... darkrealms ...

Forums before death by AOL, social media and spammers... "We can't have nice things"

comp.ai

Awaiting the gospel from Sarah Connor

1,954 messages

[ << oldest | < older | list | newer > | newest >> ]

Message 1,278 of 1,954

Ted Dunning to ArchZhou

Re: What is the relation between cluster

30 Dec 06 13:45:29

   From: ted.dunning@gmail.com   

   ArchZhou wrote:   
   > Both of the two methods are unsupervised learning methods. Clustering   
   > tries to group similar objects together, and PLSA can estimate the   
   > values of P(z|d), that are the probabilities of topics given the   
   > document. The probabilities P(z|d) are also used in clustering methods   
   > such as unsupervised naive bayes .  The cluster a document belongs to   
   > is determined by z = argmax_z P(z|d) in clustering methods.   
   > The similar approach can be applyied to PLSA, so can PLSA be thought of   
   > a clustering method?   
   >   

   You are in essence correct that a probabilistic clustering algorithm is   
   similar to PLSA and related techniques, but it is very   
   counter-productive to try to coerce PLSA or any similar technique into   
   being a non-probabilistic clustering algorithm by setting the largest   
   of the P(z|d) to 1 and the others to zero.  This loses the   
   representation of ambiguity that is crucial to the success of these   
   techniques.   

   A more appropriate view of PLSA (and all of the LSA inspired methods   
   such as LDA and DCA) is that the components of the vector of   
   probabilities are the probabilities that a word|document has to do with   
   a single topic.  Obviously in the case of words, it is not reasonable   
   to commit to a single topic.  Moreover, all of these techniques appear   
   to encode content as mixtures of topics (z) so it isn't even reasonable   
   to coerce documents that are very narrowly focussed.   

   There are methods that involve encoding P(z|d) as sparse binary vectors   
   so that all of the appropriate probabilistic computations are   
   approximated satisfactorily.  The thought is that this would give us a   
   representation more amenable to being represented in a traditional   
   retrieval system.  I haven't seen any systems do this very well,   
   however, since the sparse binary vectors rarely provide any real   
   performance boosts and they are beastly hard to generate properly.  My   
   advice is to stick with the nice soft and continuous probability   
   vectors and don't worry too much about interpreting them as clusters.   

   [ comp.ai is moderated ... your article may take a while to appear. ]   

   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)

[ << oldest | < older | list | newer > | newest >> ]