From: smorrey@gmail.com   
      
   On Mar 22, 5:46 am, "DevNull" wrote:   
   > A few years ago on a lark I decided to try and create a better   
   > "clippy" yes clippy the agent from MS Office.   
   >   
   > The goal was to create an AI agent that would find information   
   > relevant to whatever I was searching on in my browser. It would do so   
   > by mimicking the way that I naturally search for information on the   
   > internet.   
   >   
   > The design was very simple, the agent would watch whatever I typed   
   > into a search box. It would then perform a meta-search using yahoo,   
   > msn and altavista. It would then follow out for between 3-4 links any   
   > and all links within 15 lines of the target search word.   
   >   
   > Since I have noticed that image searches sometimes turn up more   
   > relevant results than standard searches, I added a module to search in   
   > images.altavista.com, this module would perform a search and pull in   
   > related pages which would be handed to the spider core for keyword and   
   > relevancy indexing.   
   >   
   > After my own related keyword search was finished, it would query   
   > dogpile, zeitgeist etc with what it figured were related search terms   
   > to see what others are searching for. Results with similar and/or   
   > exact matches were given more weight than searches that no one else   
   > was conducting.   
   >   
   > There is a little more too it than just that but after 4 years of   
   > development I have noticed something.   
   >   
   > In short, my agent appears to have become addicted to porn.   
   > Yes thats right, after 4 years of testing tuning and trying to get   
   > "clean" results, regardless of what I have my agent searching for it   
   > always stumbles on porn and places it higher than what I would think   
   > are much more closely related results.   
   >   
   > Going backwards I think the source of my problem is search engine   
   > optimized "porn" rings. These pages are filled with completely random   
   > words and links to less than scrupulous sites. Follow these pages   
   > manually (hint try prefetching the entire page and create a graph of   
   > the back links), shows a round robin "ring" of completely irrelvant   
   > pages.   
   >   
   > I never anticpated this "cache poisoning", but because of the way it's   
   > setup I cannot for the life of me figure out how to alogrithmically   
   > screen these types of results out.   
   >   
   > For a while I added an option similar to googles page rank that   
   > allowed me to manually remove either the irrelevant page, or the   
   > entire result cache (depending on how severely screwed up the AI had   
   > gotten). That works, but a few days later and the agent has again   
   > stumbled upon one of these poison pills, and once again I am stuck   
   > manually going through results.   
   >   
   > At this point I'm giving up. When I started this project most search   
   > engines would return completely irrelevant results as a matter of   
   > course. The purpose of this project was to "enhance" the results via   
   > a simple pre-fetch and rank algorithim. But it has ballooned to a   
   > level of complexity I'ld rather not deal with.   
   >   
   > Reputable search engines such as google have also increased their own   
   > relevancy to the point where the agent is actually just getting in the   
   > way and wasting my bandwidth.   
   >   
   > But 4 years of work is hard to part with, and so before I completely   
   > remove the agent from existance, I'm hoping someone has dealt with   
   > something similar, and has a potential solution I may have not   
   > considered.   
   >   
   > Thanks in advance!   
   >   
      
   I recieved an email off this post that I think will help, that is to   
   implement a baysian filter similar to those used by mail servers for   
   spam control. I think I'll give it a shot and see what happens.   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|