From: lew.pitcher@digitalfreehold.ca   
      
   On Wed, 10 Dec 2025 11:35:48 +0000, Michael Sanders wrote:   
      
   > On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:   
   >   
   >> I should have added that I feel that you probably haven't really   
   >> defined /what/ "text file" means, and that has interfered with   
   >> the development of this function. As Keith pointed out, the task   
   >> of distinguishing between a "text" file and a "binary" file is not   
   >> easy. I'll add that a lot of the difficulty stems from the fact   
   >> that there are many definitions (some conflicting) of what a "text"   
   >> file actually contains.   
   >   
   > Yes. Here's my 2nd attempt following the template (of thinking)   
   > you've suggested...   
      
   FWIW, my opinion doesn't matter in the measure of whether or not you have   
   written a competent is_text_file() function; what matters is that it   
   fits (or does not fit) the use-case you wrote it for. If it were me,   
   I'd have a hard time writing this function, because I don't know your   
   use-case, and I'd try to generalize it. I've worked with text files   
   stored in ASCII, and in EBCDIC, and in various Unicode formats, and   
   (god help me) in a bunch of other formats as well, and I'd have a hard   
   time generalizing all that into a universal is_text_file() function.   
      
   So, my real advice is to pick your battles, and document exactly what   
   sort of text file you intend to look for with this function. What   
   you've wrote might suit your needs exactly, without accounting for   
   all the variations of what a text file consists of.   
      
      
   > #include // FILE, fopen, fread, fclose   
   > #include // size_t   
   >   
   > // is_text_file()   
   > // Returns:   
   > // -1 : could not open file   
   > // 0 : is NOT a text file (binary indicators found)   
   > // 1 : is PROBABLY a text file (no strong binary signatures)   
   >   
   > int is_text_file(const char *path) {   
   > // Try opening the file in binary mode,   
   > // required so that bytes are read exact.   
   > FILE *f = fopen(path, "rb");   
   > if (!f) return -1; // Could not open file   
   >   
   > unsigned char buf[4096]; // 4KB chunks   
   > size_t n, i;   
   >   
   > // Read in file until EOF   
   > while ((n = fread(buf, 1, sizeof(buf), f)) > 0) {   
   > for (i = 0; i < n; i++) {   
   > unsigned char c = buf[i];   
   >   
   > // 1. null byte is a very strong indication of binary data.   
   > // Text files virtually never contain 0x00.   
      
   Except for UTF16 and UTF32 text files, of course.   
      
   So, part of your definition of what constitutes a text file is that   
   a text file (at least as far as is_text_file() is concerned) does not   
   contain any UTF16 or UTF32 characters.   
      
      
   > if (c == 0x00) {   
   > fclose(f);   
   > return 0; // Contains binary-only byte: NOT text   
   > }   
   >   
   > // 2. Check for raw C0 control codes (0x01–0x1F).   
   > // We *allow* \t (09), \n (0A), \r (0D) because they are   
   normal in text.   
   > // Any other control code is highly suspicious and usually   
   means binary.   
   > if (c < 0x20) {   
   > if (c != 0x09 && c != 0x0A && c != 0x0D) {   
      
   Except for all the flavours of EBCDIC.   
      
   So, another part of your definition of what constitutes a text file is that   
   a text file (at least as far as is_text_file() is concerned) does not contain   
   EBCDIC   
      
   > fclose(f);   
   > return 0; // unexpected control character → NOT text   
   > }   
   > }   
   >   
   > // 3. NOTE: We intentionally do *not* reject bytes >= 0x80.   
   > // These occur in UTF-8, extended ASCII, and many local   
   encodings.   
   > // Rejecting them would treat valid multilingual text as   
   binary.   
   > // So we treat high bytes as acceptable for "probably text".   
      
   Except for ASCII, which is limited to 7bit characters between 0x00 and 0x7f   
   (ignoring, of course, those text files that store ASCII with even or odd   
   parity)   
      
   So, another part of your definition of what constitutes a text file is that   
   a text file (at least as far as is_text_file() is concerned) may contain   
   ASCII, but is not guaranteed to do so.   
      
   > }   
   > }   
   >   
   > fclose(f);   
   > return 1; // Probably text (no strong binary signatures found)   
   > }   
      
      
   --   
   Lew Pitcher   
   "In Skills We Trust"   
   Not LLM output - I'm just like this.   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|