Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 524 of 1,954    |
|    yaroslavvb@gmail.com to All    |
|    Effect of feature selection on generaliz    |
|    20 Dec 04 06:41:49    |
   
   We know that features that correspond to memorizing individual   
   instances, ie words, make for bad features, but features that do   
   something more "general", such as testing whether a particular bigram   
   appears, are good.   
      
   We know that because of experiments, but obviously we can't   
   experimentally test every possible feature combination in every   
   possible domain. The question is, can we say something more general   
   about when a particular set of features of size k is better than   
   another set of features of size k?   
      
   Here's how one can ask this question formally in probabilistic   
   framework.   
   Lets restrict our attention to exponential families since they are easy   
   to work with and are dense in the space of densities.   
      
   So   
      
   X: is a space of all binary strings of length k   
   F: is a space of all feature functions f_i: X->{0,1}, 2^2^k of them   
   P: is a space of probability measures over X, of dimension 2^k-1   
      
   We choose subset of F, ie a statistic. It defines an exponential   
   family, and then do ML fitting to training data. The resulting model   
   generalizes well if test data has high expected likelihood.   
      
   With infinite data and resources, we can just choose statistic which is   
   a sufficient statistic for every density in P, ie, some linearly   
   independent subset of F of size 2^k. This will define an upper bound   
   for generalization power of any model used and any true probability   
   density.   
      
   However, in reality we have to limit ourselves to m features.   
   Some subsets of that size well while others work poorly.   
      
   For instance, we can take first m functions of form f_i(x)={1 if i'th   
   bit of x is 1}   
   IE, that's the feature basis for MaxEnt classifiers and it generalizes   
   well in practice   
      
   Also, we can choose first m functions of form f_i(x)={1 if x is i'th   
   binary string}   
   That corresponds to estimating probabilities by memorizing individual   
   strings and generalizes poorly.   
      
   So the question is, given m, and some reasonable assumptions, what   
   would be a good set f_i to use? Can anyone recommend any techniques or   
   related work?   
      
   bulatov@cs.oregonstate.edu   
   Yaroslav Bulatov   
   Dearborn 101, Oregon State University   
   Corvallis, OR 97330   
      
   [ comp.ai is moderated. To submit, just post and be patient, or if ]   
   [ that fails mail your article to
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca