Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 1,710 of 1,954    |
|    amnon.meyers@textanalysis.com to All    |
|    Computing with Confidence: Much Ado abou    |
|    13 Apr 08 04:54:06    |
      XPost: comp.ai.nat-lang       From: amnon@textanalysis.com              Subj: Computing with Confidence: Much Ado about Nothing              Hi,              The past week or more, I've been grappling with the issue of missing       information in documents.              When the information is known to be there, I can build a confidence       rating for whether the text is correct as output by the software (a       text analyzer or information extraction system).              But when I can't find the information, reporting that and reporting       confidence about that has been driving me a little crazier than usual.              As background, the documents I'm dealing with are non-standardized       forms that have been scanned and OCRed. The text there mixes any and       all of: computer formatted, typed, handprinted/handwritten, and       stamped. I'm looking mainly for things like names, addresses, dates,       and associated information. So if a date is present and found, I can       produce an output such as              95% Date installed: 4/12/08              Because of OCR issues, sometimes the value isn't present, and       sometimes the (highly variable) text markers for "date installed"       themselves are garbled or missing (or unanticipated). So if I say       something like              90% Date installed: [empty]              presumably that represents confidence that this datum is correct,       i.e., it is actually missing in the original document.              My mind gyrated as follows: Well, it's likely to be just a 0/1       decision as to whether the item is actually in the original document,       rather than a percentile. (Or if a percent, usually something like 0%       and 100%). But then I thought that this made more sense:              Date installed present: 0 or 1       Date installed present confidence: 0% to 100%              I could keep these two metrics hidden internally in the system, yet       output something like              0% Date installed:              In this way, the output pragmatically informs that the item must be       fetched or validated manually. If I write              100% Date installed:              then I'm saying I'm totally sure that it's missing. This seems       philosophically hard or impossible to justify, given the nature of the       domain and of "presence" and "absence". But if I'm saying 0 or 1 and       0 to 100%, isn't this the same as a percentile plus a threshold       value? E.g., if > 70%, that it's there then it's there, else it's not       there?              Hopefully the ramble above offers some insight into my confusion. Any       work that's been done in this area, computing with confidence, I'd       appreciate pointers to as well. Comments from math and statistical       experts appreciated as well. Looking for guidance on detangling or       simplifying the handling of:              1. Measuring whether the datum is in the document or absent       2. Measuring if the datum was localized or found properly.       3. Measuring if the datum text is correct or was correctable.       4. If absent, dealing with incomplete knowledge/methods for finding       it, handling this cleanly, etc.              (An aside that may be of interest to some: As to how I compute       confidence, NLP++ has a confidence operator that lets you accumulate       evidence without exceeding 100%. For example, 90 %% 90, or 90 conf       90, adds or composes two high confidence pieces of evidence. The NLP+       + computation looks like              90 %% 90 = 94       94 %% 90 = 96       99 %% 99 = 99              and so on. Our old resume analyzer used this to good effect, and I'm       having a great time using it on the latest app.)              Thanks for any clarity!              Amnon Meyers       CTO       Text Analysis International, Inc.       http://www.textanalysis.com              [ comp.ai is moderated ... your article may take a while to appear. ]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca