Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 1,257 of 1,954    |
|    Daniel Pitts to All    |
|    Grouping similar datasets...    |
|    11 Dec 06 11:16:49    |
      From: googlegroupie@coloraura.com              Hey, I'm trying to write a program that will go through my filesystem       and group together equal (or very similar) files. While it isn't hard       for the exact match (using some sort of hash table), but it's a bit       more difficult to find "similar" files in a reasonable amount of time.       (We're talking > 200gigabytes and >60,000 individual files). Right       now, I'm not concerned with semantic value of the data (such as       image/video/text), although that might be the way to go in the future.              Part of the problem is defining similarity. I would think something       like "Most of the bytes in one file are in the same order as the bytes       in the other file." Or, something perhaps like "the ratio of       frequencies of values is similar between the two files". The problem       with these approaches is that the algorithms involed are polynomial.       O(n^2) to compare all files against eachother, times some polynomial       for the size of the dataset. Ouch.              I don't know a whole lot about AI, but I've had a little exposure. Is       there perhaps a way to implement a self organizing map which will help       cluster similar files together? I'd like to avoid many false       positives, although a few are okay.              Thanks in advanced,       Daniel.              [ comp.ai is moderated ... your article may take a while to appear. ]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca