... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c
Meh, in C you gotta define EVERYTHING
243,242 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 242,347 of 243,242
Paul to Michael Sanders
Re: is_binary_file()
06 Dec 25 03:14:55
   From: nospam@needed.invalid   
      
   On Fri, 12/5/2025 8:05 PM, Michael Sanders wrote:   
   > Am I close? Missing anything you'd consider to be (or not) needed?   
   >   
   >    
   >   
   > /*   
   >  * Checks if a file is likely a binary by examining its content   
   >  * for NULL bytes (0x00) or unusual control characters.   
   >  * Returns 0 if text, 1 if binary or file open failure.   
   >  */   
   >   
   > int is_binary_file(const char *path) {   
   >     FILE *f = fopen(path, "rb");   
   >     if (!f) return 1; // cannot open file, treat as error/fail check   
   >   
   >     unsigned char buf[65536];   
   >     size_t n, i;   
   >   
   >     while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {   
   >         for (i = 0; i < n; i++) {   
   >             unsigned char c = buf[i];   
   >   
   >             // 1. check for the NULL byte (strong indicator of binary data)   
   >             if (c == 0x00) {   
   >                 fclose(f);   
   >                 return 1; // IS binary   
   >             }   
   >   
   >             // 2. check for C0 control codes (0x01-0x1F), excluding known   
   >             // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)   
   >             if (c < 0x20) {   
   >                 if (c != 0x09 && c != 0x0A && c != 0x0D) {   
   >                     fclose(f);   
   >                     return 1; // IS binary (contains unexpected control code)   
   >                 }   
   >             }   
   >         }   
   >     }   
   >   
   >     fclose(f);   
   >     return 0; // NOT binary   
   > }   
   >   
      
   It is the year 2025.   
      
   How many times do you suppose someone has considered this question ?   
      
   I'm not trying to be a smart ass by saying this, just that the   
   question is bound to be nuanced. You can do a fast and totally   
   inaccurate determination. You can do a computationally expensive   
   or I/O expensive determination.   
      
   There has to be a reason for doing this, and a damn good reason.   
      
   *******   
      
   There is the "file" command.   
      
   It was invented in 1973.   
      
      https://en.wikipedia.org/wiki/File_%28command%29   
      
   The beauty of this command, is it has some sort of ordered   
   approach to file determination.   
      
   Originally, as I understand it (I don't see it in the Wiki), it   
   was not supposed to read more than 1024 bytes of the file. This   
   was because the command was intended to settle file determinations   
   for "ordered types". For example, an MSWord doc, might have four   
   unique bytes near the beginning of the file. The designers felt   
   they could quickly "sort" or "determine" what kind of highly   
   stylized file they were dealing with.   
      
   But the results I got one day a couple years ago, suggests   
   they have strayed from that. I got around 100 different text   
   file declarations. For example, a text file with a binary block   
   in it as a "corruption", it is declared as a text file, but   
   the word "ISO something or other" is part of the file type   
   determination. Thus, when I see a certain file on my computer   
   is no longer a plain text file, but contains the word ISO,   
   then I must scroll through it with a hex editor and see what   
   the hell has triggered this determination.   
      
   The experience suggested the entire text file was being read.   
   I did not craft any tests to see if that was true.   
      
   Some file types receive very little differentiation. There is   
   only the one detection for them, the detection offers no help   
   for technical people.   
      
   That's an exemplar of a still-supported effort to identify files.   
   The "file" command. It does not rely upon, or use, the extension.   
      
   And those people are wizards. You can't expect to just read their   
   source and make some instant discovery. Sometimes, when someone   
   asks for a new detection, the wizards know of some dependencies   
   in the detection tree that prevent the craftsmanship necessary.   
   Mere mortals need not apply while this is going on.   
      
   To find 100 different text file types, I un-tarred the Firefox   
   source tarball and scanned it, then used AWK to total the   
   various detections and print them out. I only used the AWK   
   code, after being shocked to find what a shithole the tarball was.   
   I had originally intended to run UNIX2DOS over the thing, but   
   that was entirely out of the question when the detections   
   came in. In fact, there is just one source file in the Firefox   
   tree, that you MUST NOT alter. It breaks the build, if you do   
   ANYTHING to it. Good times. I could not figure out why gcc   
   had such a problem with the file. Could not root cause it.   
      
   *******   
      
   As a little example, I will scan the Sent file of my News Client,   
   which I happen to know is corrupted, but I haven't bothered to   
   fix it yet. And how I detected the corruption in the first place,   
   was by running this!   
      
   $ File Sent   
   Sent:        Non-ISO extended-ASCII text, with very long lines, with CRLF, NEL   
   line terminators   
      
   That is a corrupt one.   
      
   $ File Trash   
   Trash:       ASCII text, with CRLF line terminators   
      
   That is not corrupt.   
      
   $ dd if=/dev/urandom of=big.bin bs=1048576 count=1024   
   1024+0 records in   
   1024+0 records out   
   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.44362 s, 144 MB/s   
      
   $ file big.bin   
   big.bin: data    <=== Not definitive, as even trivially distorted files do   
   this.   
                         This file just happens to be "perfectly undetectable".   
      
   A file full of zeros, is also "data". There is no special detection for it.   
      
      Paul   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]