home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.ai      Awaiting the gospel from Sarah Connor      1,954 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 376 of 1,954   
   gforman to All   
   what text-classification failures have y   
   16 Jul 04 23:43:18   
   
   From: george.forman@gmail.com   
      
   The published literature is full of success stories in text   
   classification, but the failures are rarely published, if ever.   
      
   If you have found traditional methods fail to perform decently in   
   text-classification (supervised machine learning in text domain or   
   related high-dimensional domain, such as bio-informatics), please   
   share something about the failures.   
      
   I have run into several failures of common methods.  The biggest was   
   in feature selection for a large industrial multi-class problem.   
   Information Gain, Mutual Information, Chi Squared, etc. all failed to   
   produce a decent selection of features.   Investigating the failure, I   
   found that some 'easy' classes (for which there were many good   
   predictive features) were hogging all the features, and that other   
   'hard' classes got none or very few of the features that they would   
   need to discriminate.   I call this the 'Siren Pitfall'.  In the   
   extreme, imagine that you are trying to classify email into folders,   
   and just one of the folders contains German emails-- there will be a   
   huge number of very predictive words for this folder and IG/etc. will   
   each focus on these features, to the exclusion of the other needed   
   features.  You may be thinking that this is a problem on only unusual   
   datasets, but I carefully studied a well-balanced, homogeneous dataset   
   and found it exhibited the same problem to some degree.  So, I think   
   this problem is pretty common.   
      
   Also, when there are very few positives, I see InfoGain substantially   
   weakening (I prefer Bi-Normal Separation for this case), and I've seen   
   SVM's perform very poorly here compared with Naive Bayes.  People   
   quickly say-- haven't you varied SVM's C parameter?  Yes, and it   
   doesn't help.   
      
   So, what other failures have you experienced or know about?   
      
      
   PS: Just to be clear, let's not count a few classification errors here   
   and there as a failure, but rather when the error rate is terrible   
   overall.   
      
   [ comp.ai is moderated.  To submit, just post and be patient, or if ]   
   [ that fails mail your article to , and ]   
   [ ask your news administrator to fix the problems with your system. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca