Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 1,278 of 1,954    |
|    Ted Dunning to ArchZhou    |
|    Re: What is the relation between cluster    |
|    30 Dec 06 13:45:29    |
      From: ted.dunning@gmail.com              ArchZhou wrote:       > Both of the two methods are unsupervised learning methods. Clustering       > tries to group similar objects together, and PLSA can estimate the       > values of P(z|d), that are the probabilities of topics given the       > document. The probabilities P(z|d) are also used in clustering methods       > such as unsupervised naive bayes . The cluster a document belongs to       > is determined by z = argmax_z P(z|d) in clustering methods.       > The similar approach can be applyied to PLSA, so can PLSA be thought of       > a clustering method?       >              You are in essence correct that a probabilistic clustering algorithm is       similar to PLSA and related techniques, but it is very       counter-productive to try to coerce PLSA or any similar technique into       being a non-probabilistic clustering algorithm by setting the largest       of the P(z|d) to 1 and the others to zero. This loses the       representation of ambiguity that is crucial to the success of these       techniques.              A more appropriate view of PLSA (and all of the LSA inspired methods       such as LDA and DCA) is that the components of the vector of       probabilities are the probabilities that a word|document has to do with       a single topic. Obviously in the case of words, it is not reasonable       to commit to a single topic. Moreover, all of these techniques appear       to encode content as mixtures of topics (z) so it isn't even reasonable       to coerce documents that are very narrowly focussed.              There are methods that involve encoding P(z|d) as sparse binary vectors       so that all of the appropriate probabilistic computations are       approximated satisfactorily. The thought is that this would give us a       representation more amenable to being represented in a traditional       retrieval system. I haven't seen any systems do this very well,       however, since the sparse binary vectors rarely provide any real       performance boosts and they are beastly hard to generate properly. My       advice is to stick with the nice soft and continuous probability       vectors and don't worry too much about interpreting them as clusters.              [ comp.ai is moderated ... your article may take a while to appear. ]              --- SoupGate-Win32 v1.05        * Origin: you cannot sedate... all the things you hate (1:229/2)    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca