Estimating Discrete Entropy, Part 2 - Sebastian Nowozins slow blog

In the last part we have looked at the basic problem of discrete entropy estimation. In this article we will see a number of proposals of improved estimators.

Miller Correction

In 1955 George Miller proposed a simple correction to the naive plugin estimator $\hat{H}_N$ by adding the constant offset in the bias expression as follows.

$$\hat{H}_M = \hat{H}_N + \frac{K-1}{2n}.$$

This is an improvement over the plugin estimator but the added offset does not depend on the distribution but only on the sample size. We can do better.

(A variant of the Miller estimator for the infinite alphabet case is the so called Miller-Madow estimator in which the quantity $K$ is estimated from the data as well.)

Jackknife Estimator

A classic method for bias correction is the jackknife resampling method due to (Quenouille, 1947), although the somewhat catchy name is due to John Tukey. (The literature on the jackknife methodology is quite classic now. A very readable modern summary of the jackknife methodology can be found in DasGupta's book. An older but still readable introduction is (Miller, 1974), PDF.)

In a nutshell, jackknife resampling methods are used to estimate bias and variance of estimators. They are typically simple to implement, and often computationally cheaper than the bootstrap. For the bias reduction application, they often manage to reduce bias considerably, often knocking the bias down to $O(n^{-2})$.

The use of jackknife bias estimation to improve entropy estimation was suggested by (Zahl, 1977). The jackknife bias-corrected estimator of the plugin estimator is given as follows.

\begin{equation} \hat{H}_{J} = n \hat{H}_N - (n-1) \hat{H}^{(\cdot)}_N. \label{H:jackknife} \end{equation}

The quantity $\hat{H}^{(\cdot)}_N$ is the average of $n$ estimates obtained by leaving out a single observation. Thus, writing $\mathbb{h}=(h_1,\dots,h_K)$ for the histogram of bin counts on the full sample, and $\mathbb{h}_{\setminus i}$ for the histogram with the count in $X_i=k$ reduced by one, we have

$$\hat{H}^{\setminus i}_N := \hat{H}_N(\mathbb{h}_{\setminus i}),$$

and the mean of these quantities,

$$\hat{H}^{(\cdot)}_N := \frac{1}{n} \sum_{i=1}^n \hat{H}^{\setminus i}_N.$$

Interestingly, normally it would be expensive to compute $n$ leave-one-out estimates. Here however, two tricks are possible: First, because the histogram is a sufficient statistic, we need to compute only $K$ holdout estimates instead of $n$. Second, one can interleave computation in such a way that computing each holdout estimate is $O(1)$, reducing the overall computation of $(\ref{H:jackknife})$ to $O(K)$ runtime and no additional memory over the plugin estimate. In essence, the computational complexity is comparable to that of the inexpensive plugin estimate, making the jackknife estimator computationally cheap.

Grassberger Estimator

Another proposal for an improved estimator is due to Peter Grassberger. In (Grassberger, 2003) he derives two estimators based on an argument using analytic continuation which I have to admit is somewhat beyond my grasp. The better of the two estimators is the following:

$$\hat{H}_G = \log n - \frac{1}{n} \sum_{k=1}^K h_k G(h_k),$$

where the logarithm of the original naive estimator $\hat{H}_N$ have been replaced by a scalar function $G$, defined as

$$G(h) = \psi(h) + \frac{1}{2} (-1)^{h} \left(\psi(\frac{h+1}{2}) - \psi(\frac{h}{2})\right).$$

The function $\psi$ is the digamma function. (The function $G(h)$ is the solution of $G(h)=\psi(h)+(-1)^h \int^1_0 \frac{x^{h-1}}{x+1} \textrm{d}x$ given as equation $(30)$ in the paper.) Computationally this estimator is almost as efficient as the original plugin estimator, because for integer arguments the digamma function can be accurately approximated by an efficient series expansion.

When compared to the plugin estimator (in histogram count form), we can see an upwards correction of this estimator but also an interesting difference between even and odd histogram counts.

xlogx compared to xGx

Unfortunately, the original derivation in Grassberger's paper is quite involved and beyond my full understanding. However, for practical purposes, among the computationally efficient estimators, the 2003 Grassberger estimator is probably the most useful and robust estimator.

Experiment

The following plots show a simple evaluation of some popular discrete entropy estimators. We assume a categorical distribution with $K=64$ outcomes with the probability vector $\mathbb{p}$ sampled from a symmetric Dirichlet distribution with hyperparameter $\alpha \in [0.25,5.0]$. We obtain $n=100$ samples from the distribution and estimate the entropy based on this sample. For each $\alpha$ we repeat this procedure 5,000 times to estimate the root mean squared error (RMSE) and bias.

We plot all the estimators discussed above, but also plot four additional estimators:

Hausser-Strimmer estimator, based on a shrinkage estimate,
Polynomial estimator due to Vinck et al.; this is equivalent to Zhiyi Zhang's estimator, but numerically simpler and more stable to evaluate. (I only mention this here because I have not found this mentioned elsewhere.)
Two Bayesian estimators (Bayes and NSB), to be discussed in the next post.

RMSE and bias experiments for seven different discrete entropy
estimators

These experiments are not fully representative because the Dirichlet prior makes assumptions which may not be satisfied in your application; in particular, this experiment considers the well-sampled case ($N > K$), and simple bias correction methods work well in this regime (this was pointed out to me by Ilya and Jonas); however, some clear trends are visible:

The Plugin estimator fares badly on both RMSE and bias;
The Miller estimator fares less badly, suggesting RMSE is affected mainly by bias, however, significant errors remain for small values of $\alpha$;
The Bayes estimator fares almost as bad as the plugin estimator, except for $\alpha=1/2$. More on this point in the next post;
The Jackknife, Grassberger 2003, and NSB estimators provide excellent performance throughout the whole range of $\alpha$ values.
The performance of the Polynomial and Hausser estimates are mediocre.

In the next part we will be looking at Bayesian estimators.

Acknowledgements. I thank Il Memming Park, Jonas Peters, and Ilya Nemenman for reading a draft version of the article and providing very helpful feedback.