Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.lang.c    |    Meh, in C you gotta define EVERYTHING    |    243,242 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 242,619 of 243,242    |
|    Paul to All    |
|    Re: is_binary_file()    |
|    27 Dec 25 01:28:18    |
      From: nospam@needed.invalid              On Fri, 12/26/2025 10:13 PM, Lawrence D’Oliveiro wrote:       > On Sun, 7 Dec 2025 19:01:02 +0000, Richard Harnden wrote:       >       >> A text file is supposed to end with a '\n'       >       > PDF files end with that. The object index comes at the end, and each       > index entry is fixed in length and ends with \015\012.       >       > But the spec makes it very clear that PDF files are not supposed to be       > treated as text files.       >              The best you can do, is for the PDF to be entirely text except for       some bytes near the top (second line). It's not exactly clear what they do,       but I've seen at least one document that misses the binary line. That       binary-thing could be a hash over the document.              At least in this PDF, the document is 99% text. And Mutool can be       used to convert a "mostly binary" PDF, into a "mostly text" PDF.              If a PDF is encrypted, it is unlikely to have a textual representation       when naively opening it.              PDFs can be "anywhere from 99% binary to 99% text". It all depends.       Generally, the ones that are mostly text are the simplest of documents.       Rich media documents will have a lot more binary that cannot be       simplified by simple transformations. You could start in the first place,       by using different source materials that had closer-to-textual representation       to fix that.              ****************************************************************       ******************************************       %PDF-1.4        <=== these can "look like binary"        "25 B8 9A 92 9D 0A"       1 0 obj<>       endobj       2 0 obj<>       endobj       3 0 obj<>       endobj       4 0 obj<>>>/Contents 5 0       R>>       endobj       5 0 obj<>stream       BT       /F0 12 Tf       1 0 0 1 100 702.7366667 Tm       (Hello World!)Tj       ET       endstream       endobj       6 0 obj<>       endobj       7 0 obj[278 278 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0       0 0 0 0 0 0 0 722 0 0 0 0 0 0 0 0 0 0 0 0 0 0 944 0 0 0 0 0 0 0 0 0 0 0 0 556       556 0 0 0 0 0 0 222 0 0 556 0 0 333]       endobj       8 0 obj<>       endobj       xref       0 9       0000000000 65535 f       0000000015 00000 n       0000000059 00000 n       0000000179 00000 n       0000000257 00000 n       0000000346 00000 n       0000000451 00000 n       0000000573 00000 n       0000000773 00000 n       trailer       <<9392A59F3BE7       840805D62746E8A4F29>]/Info 2 0 R/Size 9>>       startxref       988       %%EOF       ****************************************************************       ******************************************              If "there has to be binary in it", it's on the second line.       The other lines can be text... if the tools and print drivers       wanted to do it that way.               Paul              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca