... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 1,264 of 1,954
Milind Joshi to Ted Dunning
Re: Grouping similar datasets...
14 Dec 06 06:17:24
   From: milind.a.joshi@gmail.com   
      
   Ted Dunning wrote:   
   > Daniel Pitts wrote:   
   > > David Kinny wrote:   
   > > > "Daniel Pitts"  writes:   
   > > >   
   > > > > Hey, I'm trying to write a program that will go through my filesystem   
   > > > > and group together equal (or very similar) files.  While it isn't hard   
   > > > > for the exact match (using some sort of hash table), but it's a bit   
   > > > > more difficult to find "similar" files in a reasonable amount of time.   
   > > > > (We're talking > 200gigabytes and >60,000 individual files).  Right   
   > > > > now, I'm not concerned with semantic value of the data (such as   
   > > > > image/video/text), although that might be the way to go in the future.   
   > > >   
   > > > > Part of the problem is defining similarity.  I would think something   
   > > > > like "Most of the bytes in one file are in the same order as the bytes   
   > > > > in the other file."  Or, something perhaps like "the ratio of   
   > > > > frequencies of values is similar between the two files".  The problem   
   > > > > with these approaches is that the algorithms involed are polynomial.   
   > > > > O(n^2) to compare all files against eachother, times some polynomial   
   > > > > for the size of the dataset. Ouch.   
   > > >   
   > > > > I don't know a whole lot about AI, but I've had a little exposure.  Is   
   > > > > there perhaps a way to implement a self organizing map which will help   
   > > > > cluster similar files together?  I'd like to avoid many false   
   > > > > positives, although a few are okay.   
   > > >   
   > > > > Thanks in advanced,   
   > > > > Daniel.   
   > > >   
   > > > You can get a good estimate of the difference between two text files   
   > > > by the size of the output from "diff", i.e "diff ... | wc -c".  You   
   > > > can even compare binary files with "diff -a", though this will be   
   > > > less satisfactory.  You can avoid comparing quite dissimilar files by   
   > > > sorting all files by size and then comparing a file only with others   
   > > > of roughly similar size.  Of course, this has nothing do do with AI.   
   > > >   
   > > > HTH,   
   > > > David   
   > > >   
   > >   
   > > Thanks for the response.   
   > > The size filter is something I had considered, but I'm also trying to   
   > > find files that are possible prefixes of other files (such as an   
   > > incomplete download), so that doesn't help a whole lot. Also, MOST of   
   > > the files are not text, so diff isn't satisfactory at all. I'm thinking   
   > > of using some sort of statistical approach, but my biggest concern is   
   > > the O(n^2) for comparing all files to other files.  if I could possibly   
   > > get that down to (n log n), that would be  a big help.  My initial   
   > > version of this project used hashing, so the algorithm was on the order   
   > > of O(n), but hashing doesn't work for "near matches" as far as I know.   
   > > Unless I develope a hash function such that slight changes only change   
   > > the hash slightly, if at all.   I'm considering the approach of making   
   > > the hash a 256 byte array, ordering the bytes by most frequent   
   > > occurence.  That way, files with similar statistical signatures will be   
   > > grouped together.  The only downside to this is that "aaabbc" and   
   > > "cccaaaaabbbb" will hash the same, and they are definately different.   
   > > (Also, I forsee a lot of english documents matching eachother   
   > > incorrectly).   
   > >   
   > > This is why I'd like to use an algorithm that will organize the files   
   > > as they come in, with less than O(n) time per file.   
   > >   
   >   
   > Reduce all of your files to overlapping substrings (n-grams for n-long   
   > substrings).  Count the frequencies. Subtract the counts.  The absolute   
   > difference is a reasonably interesting distance metric.   
   >   
   >   
   > Search the net for BLAST and homology also.  This is a common problem   
   > in genetics.   
   >   
   >   
   > Search also for dotplot.   
   >   
   >   
      
   I'd suggest a multilevel filtering, instead of creating a single step   
   similarity algorithm.   
      
   In the first step, you could just take the first "x" bytes, could be a   
   very short sequence, say 8 bytes. This should be enough to separate   
   files of different types, as each filetype usually has a unique header.   
   This way, you can even decide not to bother looking into   
   "uninteresting" filetypes.   
      
   The other measures you used, such as size are good, but a log-measure   
   would work better than a direct count, as files of a similar order   
   magnitude are more likely to be similar than files that are not, and   
   this is a fairly inexpensive operation.   
      
   Search engines use a similarity measure based on distance metrics, that   
   could be useful to you too.   
      
   Of course, how you implement any approach is just as important as the   
   algorithm itself, so you'll surely benefit from using good data   
   structure representations for file lists, index information, and so on.   
      
   All your other text-type measures can come after these initial   
   filtering steps, so you're doing those expensive operations only on   
   files you really care about.   
      
   Hope this is useful! :-)   
      
   Regards,   
   Milind   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]