home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.c      Meh, in C you gotta define EVERYTHING      243,242 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 242,400 of 243,242   
   Michael Sanders to Lew Pitcher   
   Re: is_binary_file()   
   10 Dec 25 11:35:48   
   
   From: porkchop@invalid.foo   
      
   On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:   
      
   > I should have added that I feel that you probably haven't really   
   > defined /what/ "text file" means, and that has interfered with   
   > the development of this function. As Keith pointed out, the task   
   > of distinguishing between a "text" file and a "binary" file is not   
   > easy. I'll add that a lot of the difficulty stems from the fact   
   > that there are many definitions (some conflicting) of what a "text"   
   > file actually contains.   
      
   Yes. Here's my 2nd attempt following the template (of thinking)   
   you've suggested...   
      
   #include   // FILE, fopen, fread, fclose   
   #include  // size_t   
      
   // is_text_file()   
   // Returns:   
   //   -1 : could not open file   
   //    0 : is NOT a text file (binary indicators found)   
   //    1 : is PROBABLY a text file (no strong binary signatures)   
      
   int is_text_file(const char *path) {   
       // Try opening the file in binary mode,   
       // required so that bytes are read exact.   
       FILE *f = fopen(path, "rb");   
       if (!f) return -1; // Could not open file   
      
       unsigned char buf[4096]; // 4KB chunks   
       size_t n, i;   
      
       // Read in file until EOF   
       while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {   
           for (i = 0; i < n; i++) {   
               unsigned char c = buf[i];   
      
               // 1. null byte is a very strong indication of binary data.   
               //    Text files virtually never contain 0x00.   
               if (c == 0x00) {   
                   fclose(f);   
                   return 0; // Contains binary-only byte: NOT text   
               }   
      
               // 2. Check for raw C0 control codes (0x01–0x1F).   
               //    We *allow* \t (09), \n (0A), \r (0D) because they are normal   
   in text.   
               //    Any other control code is highly suspicious and usually   
   means binary.   
               if (c < 0x20) {   
                   if (c != 0x09 && c != 0x0A && c != 0x0D) {   
                       fclose(f);   
                       return 0; // unexpected control character → NOT text   
                   }   
               }   
      
               // 3. NOTE: We intentionally do *not* reject bytes >= 0x80.   
               //    These occur in UTF-8, extended ASCII, and many local   
   encodings.   
               //    Rejecting them would treat valid multilingual text as binary.   
               //    So we treat high bytes as acceptable for "probably text".   
           }   
       }   
      
       fclose(f);   
       return 1; // Probably text (no strong binary signatures found)   
   }   
      
   --   
   :wq   
   Mike Sanders   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca