... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.ai

Awaiting the gospel from Sarah Connor

1,954 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 1,905 of 1,954

Gargunzola to tatata9...@gmail.com

Re: Semantic/Conceptual Similarity of Tw

09 Aug 10 02:38:21

   From: flamingpoodle@gmail.com   

   On Jul 14, 11:18 am, blabla12345  wrote:   
   > First, I'm an idiot, that is, I'm completely new to AI.  Here's a   
   > problem I'm trying to solve:   
   > a base text file that has about 5000 words (of 5 or 6k)   
   > and   
   > a small text file that has about 500 to 1000 words (of 0.5 to 1k,   
   > about 10 to 20% of the bigger file)   
   > I need to find how close the second/smaller file is to the bigger/base   
   > file conceptually or semantically.   
   >   
   > I understand there are a lot of praise about LSA or LSI for machine   
   > learning of texts.  In the meantime, I'm thinking totally off my head   
   > or you may call crazy to pose the following question, how difficult it   
   > would be to extract 10 to 20 concepts or key meanings of the big file,   
   > then figure out   
   > what are the 3 to 5 Most Valuable Concepts (MVC) of the 10 or 20   
   > concepts, and then, what's the author's view towards these MVCs,   
   > hence, these concepts become contextual...  Well, it's slightly   
   > easier, better, better accuracy with the smaller file, apply it the   
   > smaller file then...   
   >   
   > Am I out of my mind?   
   >   
   > Thanks for your time.   
   >   

   I found the book Programming Collective Intelligence by Tony Segaran   
   an invaluable resource as an introduction to AI topics. The mentioned   
   book gives you a good introduction to statistical techniques you can   
   use to determine correlation. You could also look at the section on K-   
   means clustering in addition to the sections on Pearson and Euclidean   
   correlation.   

   The K-means cluster tries to cluster your texts  into K-number of   
   clusters.  Each cluster is meant to represent your MVC in theory. In   
   practice, it's going to need a bit more work.   

   These techniques will not give you semantic data. Either way, a   
   background in statistical techniques would be helpful. If the words   
   overlap you're a good way to go to determine that there is a   
   conceptual/semantic overlap too, as conceptually related texts would   
   often use similar terminology.   

   Hope this helps!   

   [ comp.ai is moderated ... your article may take a while to appear. ]   

   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]