From: jjw@cs.sfu.ca   
      
   This message was originally posted on 2010/05/21 at 23:09 PDT.   
   It has not yet appeared on any of the newsgroups; numerous other messages in   
   this thread have   
   appeared since then.   
      
   When I tried to repost it just now, a message box appeared saying "Sending of   
   message failed. You   
   may only send to one newsgroup at a time."   
      
   Accordingly, I am now posting it to each group separately.   
      
      
   Dick Wesseling wrote:   
    > /*   
    > Here is the final version of my solution.   
    >   
    > Let: N number of cells, size of the input   
    > V number of values, size of symbol set.   
    > M the number of distinct values in the largest sequence.   
    > T the size of the largest n-in-n sequence, i.e. the   
    > largest sequence consisting of distinct values only.   
    >   
    > As explained in my previous message <4bf43d60$0$22939$c5fe704   
   @news6.xs4all.nl>   
    > the algorithm uses two different search strategies:   
    >   
    > Plan A   
    >   
    > Find the largest sequence with size M   
    >   
    > for n from M downwards search n-in-m sequences, keeping only sequences   
    > with minimal value of m.   
    > Print sequence if n in the next step).   
    >   
    > Now n=m=T.   
    >   
    > Plan B   
    > Prints all maximal n-in-n sequences.   
    >   
    > Plan A scans the input 1+(M-T) .. 2+(M-T) times, depending on the   
    > input position of the longest sequence.   
    > Plan B scans the input once.   
    >   
    > The running time is therefore bounded by O(N * (3+(M-T)) which   
    > is typically better than or equal to O(N*C).   
    > For small values of N the cost of erasing the frequency array should   
    > also be accounted for.   
    >   
    > Two arrays are used:   
    > found[N] Intermediate results of plan A.   
    > freq[C] Frequency count for all symbols in the sliding window   
    >   
    >   
    > Finding sequences is done using a straightforward sliding window   
    > scan of the input. The values in freq[] are updated when the window   
    > edges move and the numer of unique symbols is updated when a   
    > frequency count changes between zero and non-zero.   
    >   
    >   
    > The output is unsorted.   
   ...   
   When I did my original solution a few days ago, I didn't even consider a   
   moving window approach   
   because the OP had said that it's performance was terrible. He also mentioned   
   complexities such as   
   having to restart the scan multiple times, etc.   
      
   Seeing your sliding window approach inspired me to give that approach a try.   
      
   I have come up with a streamlined version of your algorithm that I believe   
   meets the OP's   
   specifications. All of us who have participated in discussing this problem   
   have been bandying about   
   terms such as unique, distinct, minimal, shortest, without making the intended   
   interpretation   
   completely clear. I had to do a bit of reading between the lines to come up   
   with what I believe are   
   the OP's intended specifications, which is why I only say that I believe my   
   solution meets his   
   specifications. I also inadvertently added to the confusion: by the time I   
   wrote my first solution,   
   I did not have the original post in front of me; I remembered that the OP had   
   referred to cells and   
   values so I used C for the number of cells and V for the number of values.   
   Only later did I discover   
   that he had used N and T for these quantities. I have renamed some of the   
   variables in my program to   
   make it consistent with the OP's notation.   
      
   Let me suggest the following definitions and nomenclature:   
      
   The data consist of a sequence of n cells stored in an array, x, with   
   subscripts running from 1 to n.   
      
   Each cell contains one of a set of t values which are integers between 1 and t   
   inclusive.   
      
   A subsequence of length l consists of l consecutive cells from the array x.   
      
   A k-subsequence is a subsequence of length l>=k which contains exactly k of   
   the t possible values.   
      
   A minimal k-subsequence is one for which the removal of the first or last cell   
   would reduce it to a   
   (k-1)-subsequence.   
      
   A shortest k-subsequence is a minimal k-subsequence whose length is less than   
   of equal to that of   
   any other minimal k-subsequence in the array.   
      
   A shortest possible k-subsequence is a k-subsequence of length k.   
      
   If a minimal k-subsequence is not shortest possible, it contains at least one   
   repeated value. Since   
   every minimal 1- or 2-subsequence is shortest possible shortest k-subsequences   
   are only of interest   
   for k>=3. The array may but need not contain a shortest possible k-subsequence   
   for 3<=k<=t.   
   If the array contains a shortest possible k-subsequence, it also contains   
   shortest possible   
   j-subsequences for 3<=j
|