From: ted.dunning@gmail.com   
      
   tim smith wrote:   
   > On Mon, 17 Jul 2006 01:45:32 GMT, "Dephased"    
   > wrote:   
   >   
   > >Hello everyone,   
   > >   
   > >I have a dataset of observations (about 100 000 observations). Each   
   > >observation gives me the state of 30 discrete variables at a given   
   > >time.   
   > >   
   > >I would like to know if there exists any "distance" that could tell me   
   > >"how far" an observation is from another? I am not trying to get the   
   > >distance between two variables but rather between two "vectors" which   
   > >are made of the observations of 30 different variables at a given time.   
   > >   
   > >I read up a bit on the subject but I must admit I am confused with all   
   > >the possible measures and what they achieve: chi square, euclidean,   
   > >mahalanobis...   
   > >   
   > >   
   > >Thanks in advance for the help you may give me !   
   > >   
   >   
   > I would use the "distance between two points" formula and treat the 30   
   > observations as dimensions.   
   >   
   > Just as in two dimensions dist=SQRT( (x2-x2)^2 + (y2-y1)^2 )   
   > and three dimensions dist=SQRT( (x2-x1)^2 + (y2-y1)^2 + (z2-z1)^2 )   
   > you can extrapolate that out to 30 dimensions.   
   >   
   > You end up with a big formula. You can "normalize" the components of   
   > the vectors using ratios to map them into a range (1-100) so that each   
   > component of the vector will have equal weight. And then you could   
   > weight them by importance, for example.   
   >   
   > Hope that helps,   
   >   
   > Tim   
   >   
      
   Remember the original question. They stated they had discrete data.   
   This often leads to problems with the naive application of the   
   Euclidean metric as propoosed here.   
      
   This metric may work, but probably won't work nearly as well in any   
   reall application as something that is built with a bit more   
   understanding of the problem. Looking at the univariate distributions   
   and bivariate distributions of possibly correlated variables is a much   
   better first step. The best second step is to stop and think a bit   
   about what you actually know about the problem.   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|