Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 376 of 1,954    |
|    gforman to All    |
|    what text-classification failures have y    |
|    16 Jul 04 23:43:18    |
      From: george.forman@gmail.com              The published literature is full of success stories in text       classification, but the failures are rarely published, if ever.              If you have found traditional methods fail to perform decently in       text-classification (supervised machine learning in text domain or       related high-dimensional domain, such as bio-informatics), please       share something about the failures.              I have run into several failures of common methods. The biggest was       in feature selection for a large industrial multi-class problem.       Information Gain, Mutual Information, Chi Squared, etc. all failed to       produce a decent selection of features. Investigating the failure, I       found that some 'easy' classes (for which there were many good       predictive features) were hogging all the features, and that other       'hard' classes got none or very few of the features that they would       need to discriminate. I call this the 'Siren Pitfall'. In the       extreme, imagine that you are trying to classify email into folders,       and just one of the folders contains German emails-- there will be a       huge number of very predictive words for this folder and IG/etc. will       each focus on these features, to the exclusion of the other needed       features. You may be thinking that this is a problem on only unusual       datasets, but I carefully studied a well-balanced, homogeneous dataset       and found it exhibited the same problem to some degree. So, I think       this problem is pretty common.              Also, when there are very few positives, I see InfoGain substantially       weakening (I prefer Bi-Normal Separation for this case), and I've seen       SVM's perform very poorly here compared with Naive Bayes. People       quickly say-- haven't you varied SVM's C parameter? Yes, and it       doesn't help.              So, what other failures have you experienced or know about?                     PS: Just to be clear, let's not count a few classification errors here       and there as a failure, but rather when the error rate is terrible       overall.              [ comp.ai is moderated. To submit, just post and be patient, or if ]       [ that fails mail your article to |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca