... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 1,658 of 1,954
Ted Dunning to Felix Crux
Re: Issues regarding testing of a classi
03 Feb 08 08:59:49
   From: ted.dunning@gmail.com   
      
   On Jan 31, 4:01 am, Felix Crux  wrote:   
   > On Tue, 2008-01-29 at 02:35 +0000, talse...@gmail.com wrote:   
   > > Hi all,   
   >   
   > > I have a general question, I hope you guys could help me.   
   >   
   > > Suppose I have a classifier A that discriminates between two classes:   
   > > class W and B (White balls and Black balls, respectively).   
   >   
   > > Suppose I have to run the classifier on a vast set of balls (:= P), in   
   > > which the distribution of White and Black balls is unknown (Which   
   > > means I don't know the a-priori probability of getting a white or a   
   > > black ball to examine).   
   >   
   > > Now I would like to test the classifier. I choose a subset of P (:=N)   
   > > that consists of N balls and run the experiment to get the ROC curve   
   > > of the classifier.   
   >   
   > > My question is: What is the best way to set the distribution of White   
   > > and Black balls in N if the distribution of P is unknown? 0.5*N Black   
   > > balls and 0.5*N White balls sounds right, but is it really right?! And   
   > > how would the answer change if P can be determined?   
   >   
   >     I wouldn't think that a 50/50 distribution is the best way to test,   
   > since it makes it  impossible to distinguish accurate classification   
   > from random guessing. In other words, did your classifier actually   
   > determine that half were white and half were black, or did it flip a   
   > coin each time one came up? Try something like 75% of one and 25% of the   
   > other. Cheers,   
   >   
   > Felix   
   >   
      
   If the problem is to estimate risk of the classifier in real operation   
   with asymmetric costs of errors, then testing with non-equal   
   proportions of test samples can give you slightly faster convergence   
   for your estimate of average error costs.   
      
   Unless your costs of false negative and false positive are   
   dramatically different this won't make a big enough difference to   
   matter.   
      
   A common counter-example is fraud detection.  The cost of a false   
   negative can be tens of thousands of dollars and the major cost of a   
   false positive is a slight customer disaffection and (usually) a   
   requirement that you not contact the customer about possible fraud for   
   90 days with the attendant risk of undetected subsequent fraud.  The   
   probability of fraud is also very low as a fraction of all   
   transactions.  If we assume a fraud rate of 1% (not realistic,   
   actually) and a cost of false negative as $5K and of false positive of   
   $1, then the expected cost of error is   
      
     E(cost(error)) = $5K x 1% x p(false negative) + $1 x 99.9% x p(false   
   positive)   
      
                          = $50 x p(false negative) + $1 x p(false   
   positive)   
      
   The error in your estimate of p(false negative) is proportional to 1/   
   sqrt(k(fraud)) and similarly for p(false positive)).  The expected   
   error on E(cost(error)) then is proportional to 1/sqrt(50 x k(fraud) +   
   k(non fraud)).  Clearly in this case having many more fraud   
   transactions in your test set is good if the total number of test   
   cases is fixed.  On the other hand, it is also good to just add   
   additional test cases of either kind so it is always bad to exclude   
   test cases.   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]