... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 886 of 1,954
Ted Dunning to All
Re: Algorithm/Theory help: Patterns, com
10 Jan 06 00:34:47
   From: ted.dunning@gmail.com   
      
   The definition and computation distances between strings is a heavily   
   worked problem.   
      
   To get a handle on the literature, look for Levenshtein distance (check   
   my spelling, too) and for substring matching.   
      
   In a nutshell, most reasonable string edit distance measures involve a   
   normalization step (such as expansion of & to and and elimination of   
   repeated spaces for English text or conversion of a peptide sequence to   
   nucleotide sequence for genomic work).   
      
   Then you find the minimum cost sequence of edits that are required to   
   convert one string to another (or to a substring of another).  To do   
   this, it is common to define a positive cost for each non-trivial edit   
   (deletion, transposition and so on).  The cost may depend on the   
   content as well (so deleting a space after another space might cost   
   nearly nothing ... this could make the normalization redundant).   
      
   The minimum cost can then be used as a distance measure between the   
   strings.  Moreover, if you define each edit as independent of all the   
   others in terms of computing cost, then you can use moderately   
   efficient algorithms (dynamic programming) to find the minimum cost and   
   the sequence of edits.   
      
   There is a natural inspiration for edit distance algorithms if you   
   think of random editing processes such as might occur in biological   
   evolution.  The edit distance is just the negative logarithm of the sum   
   of the likelihood of the sequences of edits that get the same answer.   
      
   You can approximate the minimum edit cost form of distance in many   
   cases by looking at how many common substrings there are.  The cost of   
   computing edit distance is M x N where M and N are the lengths of the   
   strings in question while the cost of most reasonable common substring   
   algorithm can be made M + N or even min(M, N) if one is much larger   
   than the other (as in genomics) or if the longer string is really a   
   great wad of shorter strings (as in comparing, say, names and   
   addresses).   
      
   Combining these two approaches is also possible.   
      
   Start with this.  Most of the literature consists of adding frills to   
   this basic starting point.   
      
   Also, to your question about a generic starting point, I would   
   recommend making the cost of deletions and insertions the same and   
   forgetting about transpositions and combinations for the moment.  You   
   may want to add transpositions later, possibly along with change (i.e.   
   delete + insert combo).  Combinations larger than this are probably   
   best handled by the search algorithm.   
      
   [ comp.ai is moderated.  To submit, just post and be patient, or if ]   
   [ that fails mail your article to , and ]   
   [ ask your news administrator to fix the problems with your system. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]