From: googlegroupie@coloraura.com   
      
   David Kinny wrote:   
   > "Daniel Pitts" writes:   
   >   
   > > Hey, I'm trying to write a program that will go through my filesystem   
   > > and group together equal (or very similar) files. While it isn't hard   
   > > for the exact match (using some sort of hash table), but it's a bit   
   > > more difficult to find "similar" files in a reasonable amount of time.   
   > > (We're talking > 200gigabytes and >60,000 individual files). Right   
   > > now, I'm not concerned with semantic value of the data (such as   
   > > image/video/text), although that might be the way to go in the future.   
   >   
   > > Part of the problem is defining similarity. I would think something   
   > > like "Most of the bytes in one file are in the same order as the bytes   
   > > in the other file." Or, something perhaps like "the ratio of   
   > > frequencies of values is similar between the two files". The problem   
   > > with these approaches is that the algorithms involed are polynomial.   
   > > O(n^2) to compare all files against eachother, times some polynomial   
   > > for the size of the dataset. Ouch.   
   >   
   > > I don't know a whole lot about AI, but I've had a little exposure. Is   
   > > there perhaps a way to implement a self organizing map which will help   
   > > cluster similar files together? I'd like to avoid many false   
   > > positives, although a few are okay.   
   >   
   > > Thanks in advanced,   
   > > Daniel.   
   >   
   > You can get a good estimate of the difference between two text files   
   > by the size of the output from "diff", i.e "diff ... | wc -c". You   
   > can even compare binary files with "diff -a", though this will be   
   > less satisfactory. You can avoid comparing quite dissimilar files by   
   > sorting all files by size and then comparing a file only with others   
   > of roughly similar size. Of course, this has nothing do do with AI.   
   >   
   > HTH,   
   > David   
   >   
      
   Thanks for the response.   
   The size filter is something I had considered, but I'm also trying to   
   find files that are possible prefixes of other files (such as an   
   incomplete download), so that doesn't help a whole lot. Also, MOST of   
   the files are not text, so diff isn't satisfactory at all. I'm thinking   
   of using some sort of statistical approach, but my biggest concern is   
   the O(n^2) for comparing all files to other files. if I could possibly   
   get that down to (n log n), that would be a big help. My initial   
   version of this project used hashing, so the algorithm was on the order   
   of O(n), but hashing doesn't work for "near matches" as far as I know.   
   Unless I develope a hash function such that slight changes only change   
   the hash slightly, if at all. I'm considering the approach of making   
   the hash a 256 byte array, ordering the bytes by most frequent   
   occurence. That way, files with similar statistical signatures will be   
   grouped together. The only downside to this is that "aaabbc" and   
   "cccaaaaabbbb" will hash the same, and they are definately different.   
   (Also, I forsee a lot of english documents matching eachother   
   incorrectly).   
      
   This is why I'd like to use an algorithm that will organize the files   
   as they come in, with less than O(n) time per file.   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|