Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 525 of 1,954    |
|    Ted Dunning to All    |
|    Re: Effect of feature selection on gener    |
|    21 Dec 04 01:49:53    |
      From: tdunning@san.rr.com              Presumably when you say bigram, you mean something smaller than a word       rather than two words in a row. Otherwise, your thesis that bigrams       make better features (as in more general and less isolated to       particular instances) is flawed.              You also make the claim that single bit features generalize well in       practice. This is a highly problem specific claim. For instance, in       text classification, single bit classifiers are essentially useless, as       are very long word strings. The most useful features are fairly rare       words or word strings that still appear often enough to give some       leverage.              I think that the text classification problem provides valuable insight       into this sort of feature selection problem, especially for the case       where the overall probability of a relevant case is very low. In text       classification, features that are much more common than the documents       you are trying to find are inherently going to have very low utitlity       because they give very poor precision. Features that are much more       rare than relevant documents may be highly specific to a sub-class of       relevant documents, but they will in themselves give poor recall.              Another domain where there is a very low probability of a positive case       is fraud detection. In fact, pretty much any domain where you are       trying to find a case of much higher economic value than the average,       you will tend to have a very low rate of positives, if only because of       the way that Zipf's law tends to occur in the presence of competition.              It sounds to me like your experience is in problems which have a much       more balanced mix of positive and negative cases. In such problems,       there are inherently very few features that are much more common than       your positive (or negative) cases, but there are many features that are       much more rare. As such, I think you are seeing only half the problem       of feature selection.              There are also a number of practically interesting problems of       reasonably high dimension which exhibit something like rotational       symmetry in feature space. In such cases, techniques like structural       risk minimization become very interesting. In these cases, you don't       have the luxury of picking just a few features. Instead, you have to       reduce the power of your decision machine by other methods such as       large decision margin.              Through it all, I don't think that powerful techniques for this problem       exist in general. Very general theories do exist, but they tend to       give very loose bounds that can be improved substantially in practice.       In specific domains, very powerful techniques are available, but in       many domains these techniques are so heuristically based as to provide       very little theoretical insight.              To answer your specific final question,              - Vapnik has addressed the general question of decision machine power              - the text retrieval community has used a wide variety of heuristic       weighting methods based on overall term frequency. They then select       terms based on queries or example documents. My Luduan system was an       extreme example of this term selection approach (with very good       results, I should add, details on request).              - Ronan Collobert had some interesting recent work to do with applying       margin techniques to committes of perceptrons (see       http://www.idiap.ch/~collober/)              [ comp.ai is moderated. To submit, just post and be patient, or if ]       [ that fails mail your article to |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca