home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.ai      Awaiting the gospel from Sarah Connor      1,954 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 525 of 1,954   
   Ted Dunning to All   
   Re: Effect of feature selection on gener   
   21 Dec 04 01:49:53   
   
   From: tdunning@san.rr.com   
      
   Presumably when you say bigram, you mean something smaller than a word   
   rather than two words in a row.  Otherwise, your thesis that bigrams   
   make better features (as in more general and less isolated to   
   particular instances) is flawed.   
      
   You also make the claim that single bit features generalize well in   
   practice.  This is a highly problem specific claim.  For instance, in   
   text classification, single bit classifiers are essentially useless, as   
   are very long word strings.  The most useful features are fairly rare   
   words or word strings that still appear often enough to give some   
   leverage.   
      
   I think that the text classification problem provides valuable insight   
   into this sort of feature selection problem, especially for the case   
   where the overall probability of a relevant case is very low.  In text   
   classification, features that are much more common than the documents   
   you are trying to find are inherently going to have very low utitlity   
   because they give very poor precision.  Features that are much more   
   rare than relevant documents may be highly specific to a sub-class of   
   relevant documents, but they will in themselves give poor recall.   
      
   Another domain where there is a very low probability of a positive case   
   is fraud detection.  In fact, pretty much any domain where you are   
   trying to find a case of much higher economic value than the average,   
   you will tend to have a very low rate of positives, if only because of   
   the way that Zipf's law tends to occur in the presence of competition.   
      
   It sounds to me like your experience is in problems which have a much   
   more balanced mix of positive and negative cases.  In such problems,   
   there are inherently very few features that are much more common than   
   your positive (or negative) cases, but there are many features that are   
   much more rare.  As such, I think you are seeing only half the problem   
   of feature selection.   
      
   There are also a number of practically interesting problems of   
   reasonably high dimension which exhibit something like rotational   
   symmetry in feature space.  In such cases, techniques like structural   
   risk minimization become very interesting.  In these cases, you don't   
   have the luxury of picking just a few features.  Instead, you have to   
   reduce the power of your decision machine by other methods such as   
   large decision margin.   
      
   Through it all, I don't think that powerful techniques for this problem   
   exist in general.  Very general theories do exist, but they tend to   
   give very loose bounds that can be improved substantially in practice.   
   In specific domains, very powerful techniques are available, but in   
   many domains these techniques are so heuristically based as to provide   
   very little theoretical insight.   
      
   To answer your specific final question,   
      
   - Vapnik has addressed the general question of decision machine power   
      
   - the text retrieval community has used a wide variety of heuristic   
   weighting methods based on overall term frequency.  They then select   
   terms based on queries or example documents.   My Luduan system was an   
   extreme example of this term selection approach (with very good   
   results, I should add, details on request).   
      
   - Ronan Collobert had some interesting recent work to do with applying   
   margin techniques to committes of perceptrons (see   
   http://www.idiap.ch/~collober/)   
      
   [ comp.ai is moderated.  To submit, just post and be patient, or if ]   
   [ that fails mail your article to , and ]   
   [ ask your news administrator to fix the problems with your system. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca