... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 385 of 1,954
Eray Ozkural exa to gforman
Re: what text-classification failures ha
25 Jul 04 05:19:53
   From: erayo@bilkent.edu.tr   
      
   george.forman@gmail.com (gforman) wrote in message news:<40f8681   
   $1@news.unimelb.edu.au>...   
   > The published literature is full of success stories in text   
   > classification, but the failures are rarely published, if ever.   
   >   
   > If you have found traditional methods fail to perform decently in   
   > text-classification (supervised machine learning in text domain or   
   > related high-dimensional domain, such as bio-informatics), please   
   > share something about the failures.   
      
   What is the language? What preprocessing do you make?   
      
   What is the input space? (is it term-frequency vectors, weighted by   
   IDF or similar?)   
      
   > I have run into several failures of common methods.  The biggest was   
   > in feature selection for a large industrial multi-class problem.   
   > Information Gain, Mutual Information, Chi Squared, etc. all failed to   
   > produce a decent selection of features.   Investigating the failure, I   
   > found that some 'easy' classes (for which there were many good   
   > predictive features) were hogging all the features, and that other   
   > 'hard' classes got none or very few of the features that they would   
   > need to discriminate.   I call this the 'Siren Pitfall'.  In the   
   > extreme, imagine that you are trying to classify email into folders,   
   > and just one of the folders contains German emails-- there will be a   
   > huge number of very predictive words for this folder and IG/etc. will   
   > each focus on these features, to the exclusion of the other needed   
   > features.  You may be thinking that this is a problem on only unusual   
   > datasets, but I carefully studied a well-balanced, homogeneous dataset   
   > and found it exhibited the same problem to some degree.  So, I think   
   > this problem is pretty common.   
      
   I think this may be just a manifestation of poor information in our   
   representation (bag of words) rather than the failure of feature   
   selection / dimensionality reduction methods. However, there may be   
   better feature selection methods. Did you try PCA, etc.?   
      
   > Also, when there are very few positives, I see InfoGain substantially   
   > weakening (I prefer Bi-Normal Separation for this case), and I've seen   
   > SVM's perform very poorly here compared with Naive Bayes.  People   
   > quickly say-- haven't you varied SVM's C parameter?  Yes, and it   
   > doesn't help.   
      
   Interesting, I would expect SVM with a Gaussian kernel to perform   
   better than plain NBC (without boosting, etc.). But it is known that   
   SVM isn't  Bayes-optimal, so I suppose other classifiers (even NBC)   
   can surpass it. It's certainly not the ultimate classifier.   
      
   > So, what other failures have you experienced or know about?   
      
   When the sizes of classes are not balanced in the training set, and   
   there are several classes, the TF representation often introduces   
   strange biases, in my opinion. A colleague mentioned that when the   
   balance is different in the training set and the test set, the   
   accuracy drops significantly. I think this is due to the fact that   
   using L2-norm over TFs is not a very intelligent dissimilarity metric.   
   What do you think?   
      
   Best Regards,   
      
   --   
   Eray Ozkural   
      
   [ comp.ai is moderated.  To submit, just post and be patient, or if ]   
   [ that fails mail your article to , and ]   
   [ ask your news administrator to fix the problems with your system. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]