home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.ai      Awaiting the gospel from Sarah Connor      1,954 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 1,257 of 1,954   
   Daniel Pitts to All   
   Grouping similar datasets...   
   11 Dec 06 11:16:49   
   
   From: googlegroupie@coloraura.com   
      
   Hey, I'm trying to write a program that will go through my filesystem   
   and group together equal (or very similar) files.  While it isn't hard   
   for the exact match (using some sort of hash table), but it's a bit   
   more difficult to find "similar" files in a reasonable amount of time.   
   (We're talking > 200gigabytes and >60,000 individual files).  Right   
   now, I'm not concerned with semantic value of the data (such as   
   image/video/text), although that might be the way to go in the future.   
      
   Part of the problem is defining similarity.  I would think something   
   like "Most of the bytes in one file are in the same order as the bytes   
   in the other file."  Or, something perhaps like "the ratio of   
   frequencies of values is similar between the two files".  The problem   
   with these approaches is that the algorithms involed are polynomial.   
   O(n^2) to compare all files against eachother, times some polynomial   
   for the size of the dataset. Ouch.   
      
   I don't know a whole lot about AI, but I've had a little exposure.  Is   
   there perhaps a way to implement a self organizing map which will help   
   cluster similar files together?  I'd like to avoid many false   
   positives, although a few are okay.   
      
   Thanks in advanced,   
   Daniel.   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca