home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.ai      Awaiting the gospel from Sarah Connor      1,954 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 1,256 of 1,954   
   David Kinny to Daniel Pitts   
   Re: Grouping similar datasets...   
   11 Dec 06 11:29:40   
   
   From: dnk@OMIT.csse.unimelb.edu.au   
      
   "Daniel Pitts"  writes:   
      
   > Hey, I'm trying to write a program that will go through my filesystem   
   > and group together equal (or very similar) files.  While it isn't hard   
   > for the exact match (using some sort of hash table), but it's a bit   
   > more difficult to find "similar" files in a reasonable amount of time.   
   > (We're talking > 200gigabytes and >60,000 individual files).  Right   
   > now, I'm not concerned with semantic value of the data (such as   
   > image/video/text), although that might be the way to go in the future.   
      
   > Part of the problem is defining similarity.  I would think something   
   > like "Most of the bytes in one file are in the same order as the bytes   
   > in the other file."  Or, something perhaps like "the ratio of   
   > frequencies of values is similar between the two files".  The problem   
   > with these approaches is that the algorithms involed are polynomial.   
   > O(n^2) to compare all files against eachother, times some polynomial   
   > for the size of the dataset. Ouch.   
      
   > I don't know a whole lot about AI, but I've had a little exposure.  Is   
   > there perhaps a way to implement a self organizing map which will help   
   > cluster similar files together?  I'd like to avoid many false   
   > positives, although a few are okay.   
      
   > Thanks in advanced,   
   > Daniel.   
      
   You can get a good estimate of the difference between two text files   
   by the size of the output from "diff", i.e "diff ... | wc -c".  You   
   can even compare binary files with "diff -a", though this will be   
   less satisfactory.  You can avoid comparing quite dissimilar files by   
   sorting all files by size and then comparing a file only with others   
   of roughly similar size.  Of course, this has nothing do do with AI.   
      
   HTH,   
   David   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca