... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"
comp.ai
Awaiting the gospel from Sarah Connor
1,954 messages
[ << oldest | < older | list | newer > | newest >> ]
Message 1,867 of 1,954
Ted Dunning to vaib
Re: Some help in research..!
05 Nov 09 12:27:24
   From: ted.dunning@gmail.com   
      
   Vaibhav,   
      
   That is a very nice summary of a number of difficult, but very   
   interesting areas of research.  Some of them are more oriented towards   
   systems-level research in that the problems have mostly to do with   
   implementation.  Others are more limited by problems of conception and   
   algorithms.   
      
   Your summary is good enough to make me curious about who the "we" that   
   you mention are.  Can you say more?   
      
   I have added a few comments.   
      
   On Oct 28, 6:00 am, vaib  wrote:   
   > ...   
   > 1. Indexers ( in search engines ).   
      
   Some of the problems here are scaling the indexers and crawlers to web   
   scale with a small number of machines.  Many people have worked on   
   these problems.  See for instance the blixo project.  Another major   
   problem has to do, not so much with indexing per se, but simply the   
   accumulation of all of the various kinds of data that you get about   
   web pages.  What you need is a very large flexible table store with   
   many of the properties of a column-oriented database.   
      
   On the NLP research side, some of the problems include the issues   
   involved in seeing text in multiple ways at once without losing the   
   precision provided by exact matching.  Thus, if you decide a word   
   might be compounded, how can you index it to be efficiently found as   
   both the compound and the phrase.  Synonymy and stemming can be   
   treated similarly.  There are also interesting issues of compression.   
   Most web scale indexers compress their indexes by simply ignoring many   
   occurrences of words and only retaining those that are from "higher-   
   quality" pages.   
      
   Finally, there is the question of how you can use more and different   
   data from anchor text and linking patterns.  Can you use the terms   
   people use in queries in combination with their clicking patterns to   
   enhance your index?   
      
   > 4. Ontology generation from text ( we have heard a lot about it   
   > although we have no idea as to what it is )   
      
   You are in good company.  The experts on the subject are quite clear   
   about what it is, but not necessarily clear about what it should be.   
      
   > 7. Tags and folksonomies.   
   > 8. Proof, trust and provenance for web information Applications.   
      
   These two issues are very closely related and very, very important.   
   They are also related to the spam problem.   
      
   > 9. Intelligent information retrieval   
      
   Here, I prefer to not put the intelligence in the retrieval system but   
   instead to reflect the intelligences of my users back at them.   
      
   Reflected intelligence is MUCH easier than artificial intelligence and   
   thus has much higher commercial value in many cases.   
      
   > 13. Trend spotting   
      
   This is only very poorly done at present with many pretty much   
   negative results.  A contribution here would be very interesting.   
      
   > 14. Web link-analysis & graph mining   
      
   See the pregel paper.  Open source implementations of pregel similar   
   to hadoop for map-reduce would be very useful for many people.   
      
   > 15. Word Sense Disambiguation.   
      
   This is related to the indexing problem mentioned above.  Strictly   
   disambiguating a word is much less useful than indexing the level of   
   information that you have (or don't have).  In most interesting cases,   
   you can limit the ambiguity of a word, but not eliminate it.  Your   
   indexing should not force you to commit to a single reading.   
      
   > 16. Emotion Analysis ( seems very interesting but can anything   
   > concrete be done here ? )   
      
   Yes.  And many people have done interesting things.  One of the most   
   interesting tidbits I have heard was the use of local vocabulary   
   measures as an indicator of the mood of the writer.   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]