From: lew.pitcher@digitalfreehold.ca   
      
   On Sat, 06 Dec 2025 01:05:44 +0000, Michael Sanders wrote:   
      
   > Am I close? Missing anything you'd consider to be (or not) needed?   
   >   
   >    
   >   
   > /*   
   > * Checks if a file is likely a binary by examining its content   
   > * for NULL bytes (0x00) or unusual control characters.   
   > * Returns 0 if text, 1 if binary or file open failure.   
   > */   
      
   First off, until we get computers that store file data in formats   
   other than binary, /all/ files (text or not) are "binary" files   
   (meaning that an is_binary_file() function should always return true).   
   OTOH, "text files" are a distinguishable subset of binary files.   
   I suggest that this makes an "is_text_file()" function more valuable   
   and more fitting than an "is_binary_file()" function.   
      
   Secondly, ISTM that the function should return a unique failure value   
   rather than overload the "is binary" return value. After all, you   
   actually have three return values: is_text, is_not_text, and   
   is_indeterminate (because of file access failure).   
      
   Thirdly, your determination of whether or not the file contains text   
   seemingly depends only on the existence or absence of certain control   
   characters. But text isn't just control characters; so you need a test   
   for invalid non-control characters as well. And, IIRC, not all control   
   characters occupy the ASCII/Unicode C0 band, so you might have to expand   
   your "acceptable control character" test to include some of those other   
   control codes.   
      
   Finally, you've hardcoded the binary values for certain acceptable   
   ASCII/Unicode control characters. However, not all platforms use ASCII   
   or Unicode, and these tests would fail to test the corresponding character   
   value correctly (I think here of EBCDIC, where "Line Feed" doesn't exist   
   but it's equivalent "NewLine" is 0x15 and Horizontal Tab is 0x05). Better   
   here to use the C equivalent escape characters '\n' and '\t' instead.   
   You may also consider expanding the control-character test to include other   
   line-formatting characters (at least as far as C will allow): Vertical Tab   
   ('\v'), Form Feed ('\f'), Carriage Return ('\r') and Backspace ('\b').   
      
      
   > int is_binary_file(const char *path) {   
   > FILE *f = fopen(path, "rb");   
   > if (!f) return 1; // cannot open file, treat as error/fail check   
   >   
   > unsigned char buf[65536];   
   > size_t n, i;   
   >   
   > while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {   
   > for (i = 0; i < n; i++) {   
   > unsigned char c = buf[i];   
   >   
   > // 1. check for the NULL byte (strong indicator of binary data)   
   > if (c == 0x00) {   
   > fclose(f);   
   > return 1; // IS binary   
   > }   
   >   
   > // 2. check for C0 control codes (0x01-0x1F), excluding known   
   > // text formatting characters: 0x09 (Tab), 0x0A (LF), 0x0D (CR)   
   > if (c < 0x20) {   
   > if (c != 0x09 && c != 0x0A && c != 0x0D) {   
   > fclose(f);   
   > return 1; // IS binary (contains unexpected control code)   
   > }   
   > }   
   > }   
   > }   
   >   
   > fclose(f);   
   > return 0; // NOT binary   
   > }   
      
      
      
      
   --   
   Lew Pitcher   
   "In Skills We Trust"   
   Not LLM output - I'm just like this.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|