... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 1,122 of 1,954
russell kym horsell to Ted Dunning
Re: Distance between two instances?
20 Jul 06 09:05:13
   From: kym@ukato.freeshell.org   
      
   Ted Dunning  wrote:   
   [...]   
   > Remember the original question.  They stated they had discrete data.   
   > This often leads to problems with the naive application of the   
   > Euclidean metric as propoosed here.   
   [...]   
      
   It probably doesn'ty matter because the problem is "hard", anyway.   
      
      
   Here are some real-world exampls of using utility functions (usually   
   a "distance" of some type in an abstract N-space where the axes are   
   nominal data rather than numbers anyway :). Have a giggle about   
   the implications.   
      
      
   (1).   
   Quite a few years ago some group tried to analyze some photos.   
   In the typical manner of the day, the photos were meant to be classified   
   as "containing a hidden military asset" or "not containing".   
      
   A set of photos containing said assets (partly hidden behind  vegetation   
   or cam netting, etc) was data-mined for statistically significant features,   
   and an "average" model in feature-space extracted. A similar process   
   was used to extract an "average" not-a-asset dataset.   
   The idea was that a utility function was to be used -- based on   
   some vector norm/distance type thing -- to decide whether any new photo   
   (after relevant feature extraction) was "closer" to "has asset" than   
   "does not have asset".   
   Surprisingly, the 2 sets could be separated by the hyperplane and the   
   automaton created was beleived to be very reliable.   
      
   However, the customer brought along some new photos, none of which   
   could be correctly classified by said automaton.   
      
   It later trurned out the training phots had an unusual feature.   
   Most of the phots of the "with asset" had been taken on a particular sunny day;   
   those "whithout asset" had been taken on overcast days.   
      
      
   (2).   
   Quite a few years ago a research project tried to create an automaton   
   that could mark short-answer questions in economics. The idea was that   
   a training set of "model answers" would be data-mined, creating data points   
   on an N-dim feature space. New answers could then be feature-extracted and   
   matched against the model anwsers. Any new answer closer than a given   
   distance from one of the models was then calld a "pass"; others were   
   evaluated "fail".   
      
   By this time I knew of the pitfalls of Arrow, and persisted in showing   
      
   that even after about a dozen changes to the utility functions/distance   
      
   metrics and feature sets involved, there weere answers that were obviously   
   not right, but were marked "pass". After some metnion of selling said   
   answers to interested 3rd parties, the project was abandoned.   
      
      
   (3).   
   A few years ago a certain company was developing radar processing s/w.   
   The idea behind airborne radar is to keep track of upto (say) 1 dozen   
   targets at once. Each target is represented by a datapoint in about 10 dims.   
   After each pass of the radar beam, each "new" target must be matched up   
   against the "old" targets from the prev sweep. In the case I have in mind,   
   a distance metric was used to determine the  "goodness of fit", and   
   the naive algorithm sought to minimise this using a questionable   
   numerical method that was shoe-horned onto the primitive available hardware.   
      
   Unfortunately, there was something wrong with the whole idea. :)   
   The symptom was that on some random occasion the minimum-distamnce   
   metric mis-matched targets in the old and new sweeps, thereby transforming   
   enemies into friendlies, and vice versa. At the time I posited the idea   
   of using a stable marriage type algorithm which at least guaranteed something   
   about performance. I was shouted down.   
      
   Safety tip: it may not be safe to fly when fighters with certyain radar   
   equipment are in the air at the same time.   
      
      
   (4).   
   A couple of years ago  someone wanted to match up "buyers"  with   
   "sellers". Each buyer and seller had nominated a set of features,   
   each on a scale of 1 through 5. The idea was to find all those matches   
   that were better than "65%". Said customer was informd that "65%"   
   was a bit of a fuzzy concept. The response was shock. Weren't we a professional   
   outfit? Hadn't we gone to kindergarten?   
      
   But given a small sample of prospective data, it was shown that   
   suitably-chosen distance metrics -- all of them "obvious" in some sens --   
   could order the data from closest to furthest in any way whatever.   
   Which of the data points was "65% closer" was completely arbitrary.   
   Perhaps they could just display a number of random data points and charge   
   the customer anyway -- it would be easier than burning more research budget.   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]