Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 886 of 1,954    |
|    Ted Dunning to All    |
|    Re: Algorithm/Theory help: Patterns, com    |
|    10 Jan 06 00:34:47    |
      From: ted.dunning@gmail.com              The definition and computation distances between strings is a heavily       worked problem.              To get a handle on the literature, look for Levenshtein distance (check       my spelling, too) and for substring matching.              In a nutshell, most reasonable string edit distance measures involve a       normalization step (such as expansion of & to and and elimination of       repeated spaces for English text or conversion of a peptide sequence to       nucleotide sequence for genomic work).              Then you find the minimum cost sequence of edits that are required to       convert one string to another (or to a substring of another). To do       this, it is common to define a positive cost for each non-trivial edit       (deletion, transposition and so on). The cost may depend on the       content as well (so deleting a space after another space might cost       nearly nothing ... this could make the normalization redundant).              The minimum cost can then be used as a distance measure between the       strings. Moreover, if you define each edit as independent of all the       others in terms of computing cost, then you can use moderately       efficient algorithms (dynamic programming) to find the minimum cost and       the sequence of edits.              There is a natural inspiration for edit distance algorithms if you       think of random editing processes such as might occur in biological       evolution. The edit distance is just the negative logarithm of the sum       of the likelihood of the sequences of edits that get the same answer.              You can approximate the minimum edit cost form of distance in many       cases by looking at how many common substrings there are. The cost of       computing edit distance is M x N where M and N are the lengths of the       strings in question while the cost of most reasonable common substring       algorithm can be made M + N or even min(M, N) if one is much larger       than the other (as in genomics) or if the longer string is really a       great wad of shorter strings (as in comparing, say, names and       addresses).              Combining these two approaches is also possible.              Start with this. Most of the literature consists of adding frills to       this basic starting point.              Also, to your question about a generic starting point, I would       recommend making the cost of deletions and insertions the same and       forgetting about transpositions and combinations for the moment. You       may want to add transpositions later, possibly along with change (i.e.       delete + insert combo). Combinations larger than this are probably       best handled by the search algorithm.              [ comp.ai is moderated. To submit, just post and be patient, or if ]       [ that fails mail your article to |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca