home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.c      Meh, in C you gotta define EVERYTHING      243,242 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 242,619 of 243,242   
   Paul to All   
   Re: is_binary_file()   
   27 Dec 25 01:28:18   
   
   From: nospam@needed.invalid   
      
   On Fri, 12/26/2025 10:13 PM, Lawrence D’Oliveiro wrote:   
   > On Sun, 7 Dec 2025 19:01:02 +0000, Richard Harnden wrote:   
   >   
   >> A text file is supposed to end with a '\n'   
   >   
   > PDF files end with that. The object index comes at the end, and each   
   > index entry is fixed in length and ends with \015\012.   
   >   
   > But the spec makes it very clear that PDF files are not supposed to be   
   > treated as text files.   
   >   
      
   The best you can do, is for the PDF to be entirely text except for   
   some bytes near the top (second line). It's not exactly clear what they do,   
   but I've seen at least one document that misses the binary line. That   
   binary-thing could be a hash over the document.   
      
   At least in this PDF, the document is 99% text. And Mutool can be   
   used to convert a "mostly binary" PDF, into a "mostly text" PDF.   
      
   If a PDF is encrypted, it is unlikely to have a textual representation   
   when naively opening it.   
      
   PDFs can be "anywhere from 99% binary to 99% text". It all depends.   
   Generally, the ones that are mostly text are the simplest of documents.   
   Rich media documents will have a lot more binary that cannot be   
   simplified by simple transformations. You could start in the first place,   
   by using different source materials that had closer-to-textual representation   
   to fix that.   
      
   ****************************************************************   
   ******************************************   
   %PDF-1.4   
                                            <=== these can "look like binary"    
   "25 B8 9A 92 9D 0A"   
   1 0 obj<>   
   endobj   
   2 0 obj<>   
   endobj   
   3 0 obj<>   
   endobj   
   4 0 obj<>>>/Contents 5 0   
   R>>   
   endobj   
   5 0 obj<>stream   
   BT   
   /F0 12 Tf   
   1 0 0 1 100 702.7366667 Tm   
   (Hello World!)Tj   
   ET   
   endstream   
   endobj   
   6 0 obj<>   
   endobj   
   7 0 obj[278 278 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0   
   0 0 0 0 0 0 0 722 0 0 0 0 0 0 0 0 0 0 0 0 0 0 944 0 0 0 0 0 0 0 0 0 0 0 0 556   
   556 0 0 0 0 0 0 222 0 0 556 0 0 333]   
   endobj   
   8 0 obj<>   
   endobj   
   xref   
   0 9   
   0000000000 65535 f   
   0000000015 00000 n   
   0000000059 00000 n   
   0000000179 00000 n   
   0000000257 00000 n   
   0000000346 00000 n   
   0000000451 00000 n   
   0000000573 00000 n   
   0000000773 00000 n   
   trailer   
   <<9392A59F3BE7   
   840805D62746E8A4F29>]/Info 2 0 R/Size 9>>   
   startxref   
   988   
   %%EOF   
   ****************************************************************   
   ******************************************   
      
   If "there has to be binary in it", it's on the second line.   
   The other lines can be text... if the tools and print drivers   
   wanted to do it that way.   
      
      Paul   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca