From: porkchop@invalid.foo   
      
   On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:   
      
   > I should have added that I feel that you probably haven't really   
   > defined /what/ "text file" means, and that has interfered with   
   > the development of this function. As Keith pointed out, the task   
   > of distinguishing between a "text" file and a "binary" file is not   
   > easy. I'll add that a lot of the difficulty stems from the fact   
   > that there are many definitions (some conflicting) of what a "text"   
   > file actually contains.   
      
   Yes. Here's my 2nd attempt following the template (of thinking)   
   you've suggested...   
      
   #include // FILE, fopen, fread, fclose   
   #include // size_t   
      
   // is_text_file()   
   // Returns:   
   // -1 : could not open file   
   // 0 : is NOT a text file (binary indicators found)   
   // 1 : is PROBABLY a text file (no strong binary signatures)   
      
   int is_text_file(const char *path) {   
    // Try opening the file in binary mode,   
    // required so that bytes are read exact.   
    FILE *f = fopen(path, "rb");   
    if (!f) return -1; // Could not open file   
      
    unsigned char buf[4096]; // 4KB chunks   
    size_t n, i;   
      
    // Read in file until EOF   
    while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {   
    for (i = 0; i < n; i++) {   
    unsigned char c = buf[i];   
      
    // 1. null byte is a very strong indication of binary data.   
    // Text files virtually never contain 0x00.   
    if (c == 0x00) {   
    fclose(f);   
    return 0; // Contains binary-only byte: NOT text   
    }   
      
    // 2. Check for raw C0 control codes (0x01–0x1F).   
    // We *allow* \t (09), \n (0A), \r (0D) because they are normal   
   in text.   
    // Any other control code is highly suspicious and usually   
   means binary.   
    if (c < 0x20) {   
    if (c != 0x09 && c != 0x0A && c != 0x0D) {   
    fclose(f);   
    return 0; // unexpected control character → NOT text   
    }   
    }   
      
    // 3. NOTE: We intentionally do *not* reject bytes >= 0x80.   
    // These occur in UTF-8, extended ASCII, and many local   
   encodings.   
    // Rejecting them would treat valid multilingual text as binary.   
    // So we treat high bytes as acceptable for "probably text".   
    }   
    }   
      
    fclose(f);   
    return 1; // Probably text (no strong binary signatures found)   
   }   
      
   --   
   :wq   
   Mike Sanders   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|