Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 535 of 1,954    |
|    NickName to All    |
|    ID3 entropy calculation question with th    |
|    30 Dec 04 09:20:23    |
      From: dadada@rock.com              Happy holidays to you all!              I have a question regarding entropy calculation for a decision tree       using ID3 algorithm.              To refresh your memory, the sample data (training data) looks liks       this:       Day Outlook Temp. HumidityWind Play Tennis       D1 Sunny Hot High Weak No       D2 Sunny Hot High Strong No       D3 OvercastHot High Weak Yes       D4 Rain Mild High Weak Yes       D5 Rain Cool Normal Weak Yes       D6 Rain Cool Normal Strong No       D7 OvercastCool Normal Weak Yes       D8 Sunny Mild High Weak No       D9 Sunny Cold Normal Weak Yes       D10 Rain Mild Normal Strong Yes       D11 Sunny Mild Normal Strong Yes       D12 OvercastMild High Strong Yes       D13 OvercastHot Normal Weak Yes       D14 Rain Mild High Strong No              I believe I have some rudimentary understanding of the information       theory       entropy(S) = -(p1*log(p1)+...+pn*log(pn))       information gain (attribute, set) = ... (can't display formula here)              No problem with the first "pass" for entropies (hence gain) such as       HUMIDITYhigh = (- (3/7) * log2 (3/7) - (4/7) * log2 (4/7) )=0.985       HUMIDITYnormal = (- (6/7) * log2 (6/7) - (1/7) * log2 (1/7) )=0.592       WindWeak: 0.811       WindStrong: 1       ...              And Gain4Outlook = 0.246 (the highest among the four attributes).              So, we pick OUTLOOK as the first attribute, OUTLOOK has 3 values of       Sunny, Overcast and Rain, let's start with sunny,              Entropy for the [D1,D2,D8,D9,D11] = entropySet4Sunny = 0.970 (same as       lecture material).              However, at the second "pass", entropy calculation "threw" me off in       the sense that my result is different from ID3 lectures by several       different institutions (they entropies for the second "pass" are the       same, so, I must be the one who's wrong), so, the question is, what       went wrong?              Here's my calculation,       HUMIDITYhigh2 = ( - (0/5) log2 (0/5) - (3/5) * log2 (3/5) )       = ( - 0 - (3/5) * log2 (3/5) )       = 0.442       HUMIDITYnormal2 = ( - (2/5) * log2 (2/5) - (0/5) log2 (0/5) )       = ( - (2/5) * log2 (2/5) - 0 )       = 0.528              BUT "lecture material" reads as       Gain(Ssunny , Humidity)=0.970-(3/5)0.0 - 2/5(0.0) = 0.970       implying that HUMIDITYhigh2 is 0 and HUMIDITYnormal2 is 0 as well.       How did they do their entropy calculation here or what did I do wrong?              Or is there a rule that goes like if one "sub element" is zero then the       entropy is zero like       At the first "pass"       OUTLOOKovercast = ( - (4/4) * log2 (4/4) - 0 )       = 0 (the second "sub element" is zero)       ?              Another question,       the tree looks like this       OUTLOOK       / | \       sunny overcast rain       / Y (stop) \       HUMIDITY WIND       ... ...              for the OUTLOOK --> overcast branch, it stops there because (4+,0-)? or       because entrophy at that point is 0 or ?       I'm just getting started.              TIA.              D              [ comp.ai is moderated. To submit, just post and be patient, or if ]       [ that fails mail your article to |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca