... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 1,260 of 1,954
Ted Dunning to Daniel Pitts
Re: Grouping similar datasets...
13 Dec 06 10:36:00
   From: ted.dunning@gmail.com   
      
   Daniel Pitts wrote:   
   > David Kinny wrote:   
   > > "Daniel Pitts"  writes:   
   > >   
   > > > Hey, I'm trying to write a program that will go through my filesystem   
   > > > and group together equal (or very similar) files.  While it isn't hard   
   > > > for the exact match (using some sort of hash table), but it's a bit   
   > > > more difficult to find "similar" files in a reasonable amount of time.   
   > > > (We're talking > 200gigabytes and >60,000 individual files).  Right   
   > > > now, I'm not concerned with semantic value of the data (such as   
   > > > image/video/text), although that might be the way to go in the future.   
   > >   
   > > > Part of the problem is defining similarity.  I would think something   
   > > > like "Most of the bytes in one file are in the same order as the bytes   
   > > > in the other file."  Or, something perhaps like "the ratio of   
   > > > frequencies of values is similar between the two files".  The problem   
   > > > with these approaches is that the algorithms involed are polynomial.   
   > > > O(n^2) to compare all files against eachother, times some polynomial   
   > > > for the size of the dataset. Ouch.   
   > >   
   > > > I don't know a whole lot about AI, but I've had a little exposure.  Is   
   > > > there perhaps a way to implement a self organizing map which will help   
   > > > cluster similar files together?  I'd like to avoid many false   
   > > > positives, although a few are okay.   
   > >   
   > > > Thanks in advanced,   
   > > > Daniel.   
   > >   
   > > You can get a good estimate of the difference between two text files   
   > > by the size of the output from "diff", i.e "diff ... | wc -c".  You   
   > > can even compare binary files with "diff -a", though this will be   
   > > less satisfactory.  You can avoid comparing quite dissimilar files by   
   > > sorting all files by size and then comparing a file only with others   
   > > of roughly similar size.  Of course, this has nothing do do with AI.   
   > >   
   > > HTH,   
   > > David   
   > >   
   >   
   > Thanks for the response.   
   > The size filter is something I had considered, but I'm also trying to   
   > find files that are possible prefixes of other files (such as an   
   > incomplete download), so that doesn't help a whole lot. Also, MOST of   
   > the files are not text, so diff isn't satisfactory at all. I'm thinking   
   > of using some sort of statistical approach, but my biggest concern is   
   > the O(n^2) for comparing all files to other files.  if I could possibly   
   > get that down to (n log n), that would be  a big help.  My initial   
   > version of this project used hashing, so the algorithm was on the order   
   > of O(n), but hashing doesn't work for "near matches" as far as I know.   
   > Unless I develope a hash function such that slight changes only change   
   > the hash slightly, if at all.   I'm considering the approach of making   
   > the hash a 256 byte array, ordering the bytes by most frequent   
   > occurence.  That way, files with similar statistical signatures will be   
   > grouped together.  The only downside to this is that "aaabbc" and   
   > "cccaaaaabbbb" will hash the same, and they are definately different.   
   > (Also, I forsee a lot of english documents matching eachother   
   > incorrectly).   
   >   
   > This is why I'd like to use an algorithm that will organize the files   
   > as they come in, with less than O(n) time per file.   
   >   
      
   Reduce all of your files to overlapping substrings (n-grams for n-long   
   substrings).  Count the frequencies. Subtract the counts.  The absolute   
   difference is a reasonably interesting distance metric.   
      
      
   Search the net for BLAST and homology also.  This is a common problem   
   in genetics.   
      
      
   Search also for dotplot.   
      
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]