home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.c      Meh, in C you gotta define EVERYTHING      243,242 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 242,346 of 243,242   
   Keith Thompson to Michael Sanders   
   Re: is_binary_file()   
   05 Dec 25 17:42:30   
   
   From: Keith.S.Thompson+u@gmail.com   
      
   Michael Sanders  writes:   
   > Am I close? Missing anything you'd consider to be (or not) needed?   
      
   There is no completely reliable way to do this, but you might be   
   able to make a reasonable guess.  A binary file might happen to   
   contain only byte values that represent printable characters.   
      
   >    
   >   
   > /*   
   >  * Checks if a file is likely a binary by examining its content   
   >  * for NULL bytes (0x00) or unusual control characters.   
   >  * Returns 0 if text, 1 if binary or file open failure.   
   >  */   
      
   Please use the term "null bytes", not "NULL bytes".  NULL is a standard   
   macro that expands to a null pointer constant.   
      
   > int is_binary_file(const char *path) {   
   >     FILE *f = fopen(path, "rb");   
   >     if (!f) return 1; // cannot open file, treat as error/fail check   
      
   It seems odd to say that a file is assumed to be binary if you can't   
   open it.  I suggest having the function return more than two distinct   
   values:   
      
   - File seems to be binary   
   - File seems to be text   
   - Could be either   
   - Something went wrong   
      
   An enum is probably a good choice.   
      
   >     unsigned char buf[65536];   
   >     size_t n, i;   
   >   
   >     while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {   
      
   Since you're only looking at individual characters, you might as well   
   read one character at a time.  The stdio functions will buffer the input   
   for you, so there won't be much loss of performance.   
      
   >         for (i = 0; i < n; i++) {   
   >             unsigned char c = buf[i];   
   >   
   >             // 1. check for the NULL byte (strong indicator of binary   
   >             data)   
      
   "null byte", not "NULL byte".   
      
   >             if (c == 0x00) {   
   >                 fclose(f);   
   >                 return 1; // IS binary   
   >             }   
   >   
   >             // 2. check for C0 control codes (0x01-0x1F), excluding known   
   >             // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)   
   >             if (c < 0x20) {   
   >                 if (c != 0x09 && c != 0x0A && c != 0x0D) {   
      
   This test will detect '\0' bytes, making your first check redundant.   
      
   >                     fclose(f);   
   >                     return 1; // IS binary (contains unexpected control code)   
   >                 }   
      
   You're assuming an ASCII-based character set, which is very   
   probably a safe assumption.  But I'd suggest replacing most of   
   the hex constants with character constants.  Aside from being more   
   portable (realistically EBCDIC systems are the only case where it   
   will matter), it makes the code more readable.  And things like   
   UTF-8 and UTF-16 make things a lot more complicated.   
      
   0x00 -> '\0'   
   0x20 -> ' '   
   0x09 -> '\t'   
   0x0A -> '\n'   
   0x0D -> '\r'   
      
   >             }   
   >         }   
   >     }   
   >   
   >     fclose(f);   
      
   fclose(f) can fail.  That's not likely, but you should check.   
      
   >     return 0; // NOT binary   
   > }   
      
   You treat an empty file as text.  That's not entirely unreasonable,   
   but you should at least document it.   
      
   You assume that a binary file is one that contains any byte values   
   in the range 0..31 other than '\t', '\n', and '\r'.  So a "text"   
   file can't contain formfeed characters (debatable), but it can   
   contain DEL characters and anything above 127.   
      
   For Latin-1, values from 0xa0 to 0xff are printable (0xa0 is   
   NO-BREAK SPACE, so that might be debatable).  For UTF-8, bytes with   
   values 0x80 and higher can be valid, but only in certain contexts.   
   And so on.   
      
   Depending on how far you want to get into it, distinguishing between   
   text and binary files is anywhere from difficult to literally   
   impossible.   
      
   Take a look at the "file" command.   
      
   --   
   Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com   
   void Void(void) { Void(); } /* The recursive call of the void */   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca