From: lmd@manobo.com   
      
   "zzt" wrote in message   
   news:409ed820$1@news.unimelb.edu.au...   
   > I want to clustering many pages whick crwaler from the web.   
   > But I don't which clustering algorithm can do it best.   
   > Such clustering algorithms must :   
   > (1)efficient, high running speed, I try to process 10,0000 pages per hour   
   > (not including the parsing time)   
   > (2) rather good effect, the clustering results can be understand by a people   
   >   
   > Can anyone give me some suggestions. Or where I can post my help request.   
   >   
   > --   
   > Best Regards!   
   >   
      
   There are many somewhat effective methods of text classification, you can   
   google on 'sematic classification'.   
   But as far, as I know, all of them aren't perfect and require hand tuning.   
   As for me, I've choosen simple SOM, with radial-neighbourhood to improve   
   perfomance. It was several years ago, but nothing interesting for a   
   practical purpose happened since 1996:)   
      
   For a quick start see http://websom.hut.fi/websom/,   
   http://websom.hut.fi/websom/doc/publications.html.   
      
   There are many tricks to improve classification, such phonosemantic   
   classification to improve quality, or vector space reduction to improve   
   speed.   
      
   If you are interested I can try (I've seen it several weeks ago) to find my   
   Radial-SOM C source. It will be easy to add some clustering functionality to   
   the source.   
   Or you can just download Kohonnen's SOM toolkit, it is slow and very   
   complicated comparing to one I've done, but it works:)   
      
   [ comp.ai is moderated. To submit, just post and be patient, or if ]   
   [ that fails mail your article to , and ]   
   [ ask your news administrator to fix the problems with your system. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|