home bbs files messages ]

Forums before death by AOL, social media and spammers... "We can't have nice things"

   comp.ai      Awaiting the gospel from Sarah Connor      1,954 messages   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]

   Message 1,222 of 1,954   
   Ted Dunning to David   
   Re: Entropy in java   
   28 Oct 06 00:08:48   
   
   From: ted.dunning@gmail.com   
      
   David wrote:   
   > Hi,   
   >   
   > I am just wondering is there any java implemenation of the information   
   > entropy used in C4.5?   
   >   
   > regards,   
   >   
   >  David   
   >   
      
   I don't know exactly what C4.5 uses, but here are some functions for   
   related measures.  The LLR measure is as I described in my 1993 paper   
   in Comp. Linguistics, with a Poisson version thrown in.  Entropy is   
   Shannon entropy based on maximum likelihood estimators.   
      
      
      
   /**   
    * Copyright 2004, Ted Dunning   
    * Distribution allowed under GPL v2.  Other terms available on   
   request.   
    * @author Ted Dunning   
    */   
   package com.tdunning.stat;   
      
   import Jama.Matrix;   
      
   /**   
    * Defines several matrix functions related to Shannon entropy.   
    */   
   public class Entropy {   
       /**   
        * Returns the generalized log-likelihood ratio test for row and   
        * column independence for a contingency table k.  Typically, k   
        * represents observed counts for multinomial processes.  Each row   
        * represents a separate process and each column represents a   
        * symbol.  This test gives an indication of how likely these   
        * observations were to have come from an identical multinomial   
        * distribution.  The output in the case that the observations   
        * come from the same distribution is asymptotically chi^2   
        * distributed with (n-1) (m-1) degrees of freedom where n is the   
        * number of rows and m the number of columns.   
        * @param k   
        * @return The result of the llr test.   
        */   
       public static double llrMultinomial(Matrix k) {   
           return 2 * (entropy(k.rowSum()) + entropy(k.columnSum()) -   
   entropy(k));   
       }   
      
       /**   
        * Returns a signed version of the generalized log-likelihood ratio   
   test   
        * for a two-dimensional table of counts.  Each row contains counts   
   for   
        * a different context that some condition might hold, each column   
   contains   
        * the number of times the condition held or not.  The result of   
   this   
        * test is positive if the first row of the table had more times   
   where   
        * the condition held than would be expected compared to the second   
   row.   
        *   
        * Fractional counts are allowed because the counts may have been   
   estimated   
        * by some odd process rather than actually counted.   
        * @param k1a             Number of times condition or event "a"   
   occured in context 1.   
        * @param k1notA          Number of times something other than "a"   
   occured in context 1.   
        * @param k2a             Number of times condition or event "a"   
   occured in context 2.   
        * @param k2notA          Number of times something other than "a"   
   occured in context 2.   
        * @return A signed score that is positive when "a" occurs more   
   often in condition 1 than in condition 2.  If   
        * the probability of getting "a" is the same in either condition   
   and whatever difference we see is due purely   
        * to chance, then this score will be asymptotically normally   
   distributed with mean 0 and variance 1.   
        */   
       public static double signedLlrFrequencyTest(double k1a, double   
   k1notA, double k2a, double k2notA) {   
           double llr = Math.sqrt(llrMultinomial(new Matrix(new   
   double[]{k1a, k2a, k1notA, k2notA}, 2)));   
           if (k1a / (k1a + k1notA) > k2a / (k2a + k2notA)) {   
               return llr;   
           } else {   
               return -llr;   
           }   
       }   
      
       /**   
        * Returns the generalized log-likelihood ratio test for a   
        * contingency table k with row times in the last column.   
        * Typically, k represents observed counts for mixed Poisson   
        * processes.  Each row represents a separate process and each   
        * column represents a symbol except for the last which contains   
        * the total observation time for the row.  This test gives an   
        * indication of how likely these observations were to have come   
        * from an identical mixed Poisson distribution. This test can be   
        * used on normal Poisson processes by giving it a matrix with   
        * exactly two columns.  The output in the case that the   
        * observations are taken from the same mixed Poisson distribution   
        * is chi^2 distributed with (n-1) (m-1) degrees of freedom where   
        * n is the number of rows and m is the number of columns in the   
        * matrix (note that one of these columns is not a countAll).   
        * @param k   
        * @return The results of the llr test.   
        */   
       public static double llrPoisson(Matrix k) {   
           int rows = k.getRowDimension();   
           int cols = k.getColumnDimension();   
           Matrix counts = k.getMatrix(0, rows-1, 0, cols-2);   
           Matrix times = k.getMatrix(0, rows-1, cols-1, cols-1);   
           return 2 * (entropy(times) + entropy(k.columnSum()) -   
   entropy(counts));   
       }   
      
       /**   
        * Computes the entropy of a matrix.  Strictly speaking, this   
        * returns sum_ij k_ij log p_ij where p_ij = k_ij / sum_ij k_ij.   
        * If sum_ij k_ij = 1, then this does no harm and it is convenient   
        * other computations.   
        *   
        * @param k   
        * @return The entropy in base e.  Divide by log(2) to get bits.   
        */   
       public static double entropy(Matrix k) {   
           double total = k.columnSum().rowSum().get(0, 0);   
      
           if (k.columnMin().rowMin().get(0, 0) < 0) {   
               throw new IllegalArgumentException("Cannot have negative   
   count");   
           }   
           if (total <= 0) {   
               throw new IllegalArgumentException("Must have non-zero   
   count");   
           }   
      
           // normalize p and add one where zero.  Adding one will have no   
   effect since we   
           // multiply by the original value later.  Doing this just   
   avoids log(0) errors.   
           Matrix p = k.times(1/total).plusEquals(k.lessEqual(0.0));   
      
           return -k.arrayTimes(p.log()).rowSum().columnSum().get(0, 0);   
       }   
   }   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   

[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]


(c) 1994,  bbs@darkrealms.ca