Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 385 of 1,954    |
|    Eray Ozkural exa to gforman    |
|    Re: what text-classification failures ha    |
|    25 Jul 04 05:19:53    |
      From: erayo@bilkent.edu.tr              george.forman@gmail.com (gforman) wrote in message news:<40f8681       $1@news.unimelb.edu.au>...       > The published literature is full of success stories in text       > classification, but the failures are rarely published, if ever.       >       > If you have found traditional methods fail to perform decently in       > text-classification (supervised machine learning in text domain or       > related high-dimensional domain, such as bio-informatics), please       > share something about the failures.              What is the language? What preprocessing do you make?              What is the input space? (is it term-frequency vectors, weighted by       IDF or similar?)              > I have run into several failures of common methods. The biggest was       > in feature selection for a large industrial multi-class problem.       > Information Gain, Mutual Information, Chi Squared, etc. all failed to       > produce a decent selection of features. Investigating the failure, I       > found that some 'easy' classes (for which there were many good       > predictive features) were hogging all the features, and that other       > 'hard' classes got none or very few of the features that they would       > need to discriminate. I call this the 'Siren Pitfall'. In the       > extreme, imagine that you are trying to classify email into folders,       > and just one of the folders contains German emails-- there will be a       > huge number of very predictive words for this folder and IG/etc. will       > each focus on these features, to the exclusion of the other needed       > features. You may be thinking that this is a problem on only unusual       > datasets, but I carefully studied a well-balanced, homogeneous dataset       > and found it exhibited the same problem to some degree. So, I think       > this problem is pretty common.              I think this may be just a manifestation of poor information in our       representation (bag of words) rather than the failure of feature       selection / dimensionality reduction methods. However, there may be       better feature selection methods. Did you try PCA, etc.?              > Also, when there are very few positives, I see InfoGain substantially       > weakening (I prefer Bi-Normal Separation for this case), and I've seen       > SVM's perform very poorly here compared with Naive Bayes. People       > quickly say-- haven't you varied SVM's C parameter? Yes, and it       > doesn't help.              Interesting, I would expect SVM with a Gaussian kernel to perform       better than plain NBC (without boosting, etc.). But it is known that       SVM isn't Bayes-optimal, so I suppose other classifiers (even NBC)       can surpass it. It's certainly not the ultimate classifier.              > So, what other failures have you experienced or know about?              When the sizes of classes are not balanced in the training set, and       there are several classes, the TF representation often introduces       strange biases, in my opinion. A colleague mentioned that when the       balance is different in the training set and the test set, the       accuracy drops significantly. I think this is due to the fact that       using L2-norm over TFs is not a very intelligent dissimilarity metric.       What do you think?              Best Regards,              --       Eray Ozkural              [ comp.ai is moderated. To submit, just post and be patient, or if ]       [ that fails mail your article to |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca