... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.ai

Awaiting the gospel from Sarah Connor

1,954 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 1,113 of 1,954

Ted Dunning to russell kym horsell

Re: Distance between two instances?

19 Jul 06 01:58:47

   From: ted.dunning@gmail.com   

   russell kym horsell wrote:   
   > Dephased  wrote:   
   > > Hello everyone,   
   >   
   > > I have a dataset of observations (about 100 000 observations). Each   
   > > observation gives me the state of  30 discrete variables at a given   
   > > time.   
   >   
   > ...   
   >   
   >   
   > Sounds like another case of hittin yor hed up aginst Arrow's Theorem.   
   > There are some std fixes in "multivariate decision theory", but nothing   
   > is good. But if it were easy, we'd all be kicking back with a Bud.   
   >   

   While it is true that there isn't any globally best way to compare   
   instances like this, I don't think that Arrow's theorem applies (which   
   states that no voting system can meet all of a set of plausible   
   desiderata).   

   The reason that this theorem doesn't appy is that what the original   
   question is about is whether there is a univerally good metric that can   
   be imposed on the observation space.  The answer is that there are many   
   metrics, each of which has different characteristics, all of which   
   satisfy the criteria for being a metric (non-negative distance,   
   triangle inequality).  For different applications, different metrics   
   are useful.   

   For example, in error correction applications, Hamming distance is   
   useful.  This is just a city block metric applied to binary data.   
   Euclidean metrics are not very useful for these applications.   

   For many data mining problems where there is plenty of data and a   
   limited number of dimensions, Euclidean metrics or some variant of this   
   same thing such as a Euclidean distance on normalized coordinates or   
   the Mahalonobis distance might be useful.   

   For many other data mining or machine learning problems, an entropic   
   metric is much more useful, especially if the data is somehow sparse.   

   So, the right advice to the original poster is that they should say a   
   bit more about where the data in question comes from, whether there are   
   strong correlations in the data, whether it is continuous or discrete   
   and so on.  Without that kind of information, it is highly unlikely   
   that there will be any good advice to be had.   

   [ comp.ai is moderated ... your article may take a while to appear. ]   

   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]