... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.lang.c
Meh, in C you gotta define EVERYTHING
243,242 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 242,417 of 243,242
Paul to bart
Re: is_binary_file()
10 Dec 25 22:35:53
   From: nospam@needed.invalid   
      
   On Wed, 12/10/2025 5:37 PM, bart wrote:   
   > On 10/12/2025 19:42, Richard Heathfield wrote:   
   >> On 10/12/2025 17:18, Scott Lurndal wrote:   
   >>> Michael S  writes:   
   >>>> On Wed, 10 Dec 2025 15:07:30 GMT   
   >>>> scott@slp53.sl.home (Scott Lurndal) wrote:   
   >>>>   
   >>>>> Michael Sanders  writes:   
   >>>>>> On Sat, 6 Dec 2025 02:00:22 -0000 (UTC), Lew Pitcher wrote:   
   >>>>>>> I should have added that I feel that you probably haven't really   
   >>>>>>> defined /what/ "text file" means, and that has interfered with   
   >>>>>>> the development of this function. As Keith pointed out, the task   
   >>>>>>> of distinguishing between a "text" file and a "binary" file is not   
   >>>>>>> easy. I'll add that a lot of the difficulty stems from the fact   
   >>>>>>> that there are many definitions (some conflicting) of what a "text"   
   >>>>>>> file actually contains.   
   >>>>>>   
   >>>>>> Yes. Here's my 2nd attempt following the template (of thinking)   
   >>>>>> you've suggested...   
   >>>>>   
   >>>>> The problem with all of your attempts is the performance   
   >>>>> issue.  Success requires reading every single byte of the   
   >>>>> file, one byte at a time.   The word 'slow' is not sufficient   
   >>>>> to describe how bad the performance will be for a very large   
   >>>>> file.   
   >>>>>   
   >>>>> At a minimum, dump the stdio double-buffered byte-by-byte   
   >>>>> algorithm and use mmap().   
   >>>>>   
   >>>>   
   >>>> I suggest to do actual speed measurements before making bold   
   >>>> claims like above. Don't trust your intuition!   
   >>>   
   >>> I have, more than once, done such measurements after mmap()   
   >>> was introduced in SVR4 circa 1989 (ported from SunOS).   
   >>>   
   >>> On a single-user system, running a single job, the difference   
   >>> for smaller files is in the noise.   For larger files, or when   
   >>> the system is heavily loaded or multiuser, it can be significant.   
   >>   
   >> 1989 is 36 years ago. Technology has moved on. If reading your file is too   
   slow to read, get yourself a real computer.   
   >>   
   >> On my very ordinary desktop machine, I just freq'd[1] a 7,032,963,565- byte   
   file in 12.256 seconds. That's 573,838,410 bytes per second. It's a damn sight   
   faster than I could do by hand.   
   >>   
   >> How, exactly, are you using `slow'?   
   >>   
   >   
   > A getc loop took 4.3 seconds to read a 192MB file from SSD, on my Windows PC.   
   >   
   > Under WSL it took 8.4 seconds (8.4/0.5 real/user).   
   >   
   > However reading it all in one go took 0.14 seconds.   
   >   
   > I guess not all 'getc' implementations are the same.   
      
   #include    
   #include    
   #include    
      
   /* gcc -Wl,--stack,1200000000  -o getcbench.exe getcbench.c */   
      
   int main(int argc, char **argv)   
   {  FILE* source;   
      
      int c;                                      /* getc holder */   
      const int size = 1000*1000*1000;   
      char keep[size];   
      int i=0;   
      
      printf( "\nWelcome to getcbench.exe\n\n" );   
      
      __int64 time1 = 0, time2 = 0, freq = 0;     /* code added for timestamp */   
      
      if (argc != 2) {   
         fprintf(stderr, "Usage: %s source_file\n", argv[0]);   
         return -1;   
      }   
      
      printf( "Array ready, opening file %s\n", argv[1] );   
      
      source = fopen(argv[1], "rb");   
      if (!source) {   
          fprintf(stderr, "Could not open %s\n", argv[1]);   
          return -1;   
      }   
      
      QueryPerformanceCounter((LARGE_INTEGER *) &time1);     /* clock is running   
   */   
      QueryPerformanceFrequency((LARGE_INTEGER *)&freq);   
      printf("time1 = %llX  freq = %lld \n", time1, freq);   
      
      while ((c = getc(source)) != EOF) {   
         keep[i++] = c;   
         if (i >= size) break;   
      }   
      
      QueryPerformanceCounter((LARGE_INTEGER *) &time2);   
      printf("time2 = %llX \n", time2);   
      
      printf("Read %d bytes in %010.6f seconds\n", i, (float)(time2-time1)/freq);   
   }   
      
   $ getcbench.exe D:\test.txt                    # D: is capable of gigabytes   
   per second speeds   
      
   Welcome to getcbench.exe   
      
   Array ready, opening file D:test.txt   
   time1 = 3380876B31  freq = 10000000   
   time2 = 338D011DCC   
   Read 1000000000 bytes in 020.930217 seconds    # Process Monitor shows that   
   4096 byte reads are being done   
      
   $   
      
   ***************************************************************   
      
   This has additional gubbins.   
      
   https://en.cppreference.com/w/c/io/setvbuf   
      
   Add some code after the fopen.   
      
      if (setvbuf(source, NULL, _IOFBF, 65536) != 0)   
      {   
           fprintf(stderr, "setvbuf() failed\n\n" );   
           return -1;   
      }   
      
   Process Monitor shows the reads now happen in 65536 chunks.   
      
   But this does not do a thing for performance (with this style of I/O and no   
   optimization).   
      
   $ getcbenchbuf.exe D:\test.txt   
      
   Welcome to getcbenchbuf.exe   
      
   Array ready, opening file D:test.txt   
   time1 = 37192A7827  freq = 10000000   
   time2 = 37256FEAFA   
   Read 1000000000 bytes in 020.587797 seconds   
      
   ***************************************************************   
      
   If I do this to the original program (-O2), it still is   
   doing 4096 byte reads, but the performance is better.   
      
   $ gcc -O2 -Wl,--stack,1200000000  -o getcbench.exe getcbench.c   
      
   $ getcbench.exe D:\\test2.txt   
      
   Welcome to getcbench.exe   
      
   Array ready, opening file D:\test2.txt   
   time1 = 3B4D7C1022  freq = 10000000   
   time2 = 3B4E5EB775   
   Read 1000000000 bytes in 001.485397 seconds   
      
   Busy sum = FFFFFFFFE216FE9C   
      
   Extra code was added so keep[] was not optimized away.   
      
      for (k = 0; k
[ << oldest | < older | list | newer > | newest >> ]