home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.lang.c      Meh, in C you gotta define EVERYTHING      243,242 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 242,356 of 243,242   
   Waldek Hebisch to Michael Sanders   
   Re: is_binary_file()   
   07 Dec 25 03:43:58   
   
   From: antispam@fricas.org   
      
   Michael Sanders  wrote:   
   > Am I close? Missing anything you'd consider to be (or not) needed?   
      
   You miss definition: you should first decide what you consider to   
   be a binary file (this is hard part).  You may wish consider   
   my experience many years ago: I looked at problem reports about   
   SUN OS.  Those were considered text files, in total about 160 MB.   
   For my purposes it would be convenient to find character code _not_   
   appearing in those files.  But checking found that the only code   
   which did not appear were 0.  Report were mostly in English,   
   but there were non-English pieces contributing international   
   characters.  There were handful of box-drawing characters.   
   There were (I think stray) control codes.   
      
   You can take from this that zero code was strong indicator of   
   non-text file.  But do you consider UTF-16 encode text as binary?   
   Note that such text is likely to contain a lot of zero bytes.   
   Any byte different than zero will appear in a file considered by   
   its author to be a text file as long as you take large enough   
   sample.   
      
   If you have few hundred of characters from a file you can apply   
   a reasonably simple statistical test to decide if text came from   
   one of popular human langages and if yes test will tell you the   
   language.   
      
   For security puprose you may wish to check if a file oly contains   
   safe codes.  But definition of "safe" depends on application.   
   In US context you could decide that anything outside printable   
   ASCII + newline is unsafe.  Or you may add to this some selected   
   contol codes like tabs.  In international context you probably   
   need to allow relevant national character codes, which depends   
   on specific environment.   
      
   --   
                                 Waldek Hebisch   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca