... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 1,710 of 1,954
amnon.meyers@textanalysis.com to All
Computing with Confidence: Much Ado abou
13 Apr 08 04:54:06
   XPost: comp.ai.nat-lang   
   From: amnon@textanalysis.com   
      
   Subj: Computing with Confidence: Much Ado about Nothing   
      
   Hi,   
      
   The past week or more, I've been grappling with the issue of missing   
   information in documents.   
      
   When the information is known to be there, I can build a confidence   
   rating for whether the text is correct as output by the software (a   
   text analyzer or information extraction system).   
      
   But when I can't find the information, reporting that and reporting   
   confidence about that has been driving me a little crazier than usual.   
      
   As background, the documents I'm dealing with are non-standardized   
   forms that have been scanned and OCRed.  The text there mixes any and   
   all of: computer formatted, typed, handprinted/handwritten, and   
   stamped.  I'm looking mainly for things like names, addresses, dates,   
   and associated information.  So if a date is present and found, I can   
   produce an output such as   
      
   95%  Date installed: 4/12/08   
      
   Because of OCR issues, sometimes the value isn't present, and   
   sometimes the (highly variable) text markers for "date installed"   
   themselves are garbled or missing (or unanticipated).  So if I say   
   something like   
      
   90% Date installed:   [empty]   
      
   presumably that represents confidence that this datum is correct,   
   i.e., it is actually missing in the original document.   
      
   My mind gyrated as follows: Well, it's likely to be just a 0/1   
   decision as to whether the item is actually in the original document,   
   rather than a percentile.  (Or if a percent, usually something like 0%   
   and 100%).  But then I thought that this made more sense:   
      
   Date installed present:  0 or 1   
   Date installed present confidence:  0% to 100%   
      
   I could keep these two metrics hidden internally in the system, yet   
   output something like   
      
   0%  Date installed:   
      
   In this way, the output pragmatically informs that the item must be   
   fetched or validated manually.  If I write   
      
   100% Date installed:   
      
   then I'm saying I'm totally sure that it's missing.  This seems   
   philosophically hard or impossible to justify, given the nature of the   
   domain and of "presence" and "absence".  But if I'm saying 0 or 1 and   
   0 to 100%, isn't this the same as a percentile plus a threshold   
   value?  E.g., if > 70%, that it's there then it's there, else it's not   
   there?   
      
   Hopefully the ramble above offers some insight into my confusion.  Any   
   work that's been done in this area, computing with confidence, I'd   
   appreciate pointers to as well.  Comments from math and statistical   
   experts appreciated as well.  Looking for guidance on detangling or   
   simplifying the handling of:   
      
   1. Measuring whether the datum is in the document or absent   
   2. Measuring if the datum was localized or found properly.   
   3. Measuring if the datum text is correct or was correctable.   
   4. If absent, dealing with incomplete knowledge/methods for finding   
   it, handling this cleanly, etc.   
      
   (An aside that may be of interest to some: As to how I compute   
   confidence, NLP++ has a confidence operator that lets you accumulate   
   evidence without exceeding 100%.  For example,  90 %% 90, or 90 conf   
   90, adds or composes two high confidence pieces of evidence.  The NLP+   
   + computation looks like   
      
   90 %% 90 = 94   
   94 %% 90 = 96   
   99 %% 99 = 99   
      
   and so on.  Our old resume analyzer used this to good effect, and I'm   
   having a great time using it on the latest app.)   
      
   Thanks for any clarity!   
      
   Amnon Meyers   
   CTO   
   Text Analysis International, Inc.   
   http://www.textanalysis.com   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]