Forums before death by AOL, social media and spammers... "We can't have nice things"
|    comp.ai    |    Awaiting the gospel from Sarah Connor    |    1,954 messages    |
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
|    Message 1,222 of 1,954    |
|    Ted Dunning to David    |
|    Re: Entropy in java    |
|    28 Oct 06 00:08:48    |
   
   From: ted.dunning@gmail.com   
      
   David wrote:   
   > Hi,   
   >   
   > I am just wondering is there any java implemenation of the information   
   > entropy used in C4.5?   
   >   
   > regards,   
   >   
   > David   
   >   
      
   I don't know exactly what C4.5 uses, but here are some functions for   
   related measures. The LLR measure is as I described in my 1993 paper   
   in Comp. Linguistics, with a Poisson version thrown in. Entropy is   
   Shannon entropy based on maximum likelihood estimators.   
      
      
      
   /**   
    * Copyright 2004, Ted Dunning   
    * Distribution allowed under GPL v2. Other terms available on   
   request.   
    * @author Ted Dunning   
    */   
   package com.tdunning.stat;   
      
   import Jama.Matrix;   
      
   /**   
    * Defines several matrix functions related to Shannon entropy.   
    */   
   public class Entropy {   
    /**   
    * Returns the generalized log-likelihood ratio test for row and   
    * column independence for a contingency table k. Typically, k   
    * represents observed counts for multinomial processes. Each row   
    * represents a separate process and each column represents a   
    * symbol. This test gives an indication of how likely these   
    * observations were to have come from an identical multinomial   
    * distribution. The output in the case that the observations   
    * come from the same distribution is asymptotically chi^2   
    * distributed with (n-1) (m-1) degrees of freedom where n is the   
    * number of rows and m the number of columns.   
    * @param k   
    * @return The result of the llr test.   
    */   
    public static double llrMultinomial(Matrix k) {   
    return 2 * (entropy(k.rowSum()) + entropy(k.columnSum()) -   
   entropy(k));   
    }   
      
    /**   
    * Returns a signed version of the generalized log-likelihood ratio   
   test   
    * for a two-dimensional table of counts. Each row contains counts   
   for   
    * a different context that some condition might hold, each column   
   contains   
    * the number of times the condition held or not. The result of   
   this   
    * test is positive if the first row of the table had more times   
   where   
    * the condition held than would be expected compared to the second   
   row.   
    *   
    * Fractional counts are allowed because the counts may have been   
   estimated   
    * by some odd process rather than actually counted.   
    * @param k1a Number of times condition or event "a"   
   occured in context 1.   
    * @param k1notA Number of times something other than "a"   
   occured in context 1.   
    * @param k2a Number of times condition or event "a"   
   occured in context 2.   
    * @param k2notA Number of times something other than "a"   
   occured in context 2.   
    * @return A signed score that is positive when "a" occurs more   
   often in condition 1 than in condition 2. If   
    * the probability of getting "a" is the same in either condition   
   and whatever difference we see is due purely   
    * to chance, then this score will be asymptotically normally   
   distributed with mean 0 and variance 1.   
    */   
    public static double signedLlrFrequencyTest(double k1a, double   
   k1notA, double k2a, double k2notA) {   
    double llr = Math.sqrt(llrMultinomial(new Matrix(new   
   double[]{k1a, k2a, k1notA, k2notA}, 2)));   
    if (k1a / (k1a + k1notA) > k2a / (k2a + k2notA)) {   
    return llr;   
    } else {   
    return -llr;   
    }   
    }   
      
    /**   
    * Returns the generalized log-likelihood ratio test for a   
    * contingency table k with row times in the last column.   
    * Typically, k represents observed counts for mixed Poisson   
    * processes. Each row represents a separate process and each   
    * column represents a symbol except for the last which contains   
    * the total observation time for the row. This test gives an   
    * indication of how likely these observations were to have come   
    * from an identical mixed Poisson distribution. This test can be   
    * used on normal Poisson processes by giving it a matrix with   
    * exactly two columns. The output in the case that the   
    * observations are taken from the same mixed Poisson distribution   
    * is chi^2 distributed with (n-1) (m-1) degrees of freedom where   
    * n is the number of rows and m is the number of columns in the   
    * matrix (note that one of these columns is not a countAll).   
    * @param k   
    * @return The results of the llr test.   
    */   
    public static double llrPoisson(Matrix k) {   
    int rows = k.getRowDimension();   
    int cols = k.getColumnDimension();   
    Matrix counts = k.getMatrix(0, rows-1, 0, cols-2);   
    Matrix times = k.getMatrix(0, rows-1, cols-1, cols-1);   
    return 2 * (entropy(times) + entropy(k.columnSum()) -   
   entropy(counts));   
    }   
      
    /**   
    * Computes the entropy of a matrix. Strictly speaking, this   
    * returns sum_ij k_ij log p_ij where p_ij = k_ij / sum_ij k_ij.   
    * If sum_ij k_ij = 1, then this does no harm and it is convenient   
    * other computations.   
    *   
    * @param k   
    * @return The entropy in base e. Divide by log(2) to get bits.   
    */   
    public static double entropy(Matrix k) {   
    double total = k.columnSum().rowSum().get(0, 0);   
      
    if (k.columnMin().rowMin().get(0, 0) < 0) {   
    throw new IllegalArgumentException("Cannot have negative   
   count");   
    }   
    if (total <= 0) {   
    throw new IllegalArgumentException("Must have non-zero   
   count");   
    }   
      
    // normalize p and add one where zero. Adding one will have no   
   effect since we   
    // multiply by the original value later. Doing this just   
   avoids log(0) errors.   
    Matrix p = k.times(1/total).plusEquals(k.lessEqual(0.0));   
      
    return -k.arrayTimes(p.log()).rowSum().columnSum().get(0, 0);   
    }   
   }   
      
   [ comp.ai is moderated ... your article may take a while to appear. ]   
      
   --- SoupGate-Win32 v1.05   
    * Origin: you cannot sedate... all the things you hate (1:229/2)   
|
[   << oldest   |   < older   |   list   |   newer >   |   newest >>   ]
(c) 1994, bbs@darkrealms.ca