home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.c      Meh, in C you gotta define EVERYTHING      243,242 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 242,404 of 243,242   
   James Kuyper to Michael Sanders   
   Re: is_binary_file()   
   10 Dec 25 12:46:36   
   
   From: jameskuyper@alumni.caltech.edu   
      
   On 2025-12-10 06:35, Michael Sanders wrote:   
   ...   
   > #include  // FILE, fopen, fread, fclose   
   > #include  // size_t   
   >   
   > // is_text_file()   
   > // Returns:   
   > // -1 : could not open file   
   > // 0 : is NOT a text file (binary indicators found)   
   > // 1 : is PROBABLY a text file (no strong binary signatures)   
   >   
   > int is_text_file(const char *path) {   
   > // Try opening the file in binary mode,   
   > // required so that bytes are read exact.   
   > FILE *f = fopen(path, "rb");   
   > if (!f) return -1; // Could not open file   
   >   
   > unsigned char buf[4096]; // 4KB chunks   
   > size_t n, i;   
   >   
   > // Read in file until EOF   
   > while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {   
   > for (i = 0; i < n; i++) {   
   > unsigned char c = buf[i];   
      
   I'd recommend against buffering this; C stdio is already buffered, and   
   it just complicates your code to keep track of a second level of   
   buffering. Use getc() instead.   
      
   > // 1. null byte is a very strong indication of binary data.   
   > // Text files virtually never contain 0x00.   
   > if (c == 0x00) {   
   > fclose(f);   
   > return 0; // Contains binary-only byte: NOT text   
   > }   
   >   
   > // 2. Check for raw C0 control codes (0x01–0x1F).   
   > // We *allow* \t (09), \n (0A), \r (0D) because they are normal in text.   
   > // Any other control code is highly suspicious and usually means binary.   
   > if (c < 0x20) {   
   > if (c != 0x09 && c != 0x0A && c != 0x0D) {   
   > fclose(f);   
   > return 0; // unexpected control character → NOT text   
   > }   
   > }   
      
   I would recommend against use of explicit numerical codes for   
   characters. They make your code dependent upon a particular encoding,   
   and you're free to make that choice, but for implementations where that   
   encoding is the default, the corresponding C escape sequences will have   
   precisely the the correct value, and make it easier to understand what   
   your code is doing:   
      
   0x00 '\0'   
   0x09 '\t'   
   0x0A '\n'   
   0x0D '\r'   
   0x20 ' '   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca