In the last two parts (part one, part two) we looked at the problem of entropy estimation and several popular estimators.

In this final article we will take a look at two Bayesian approaches to the problem.

## Bayesian Estimator due to Wolpert and Wolf

The first Bayesian approach to entropy estimation was proposed by David Wolpert and David Wolf in 1995 in their paper "Estimating functions of probability distributions from a finite set of samples", published in Physical Review E, Vol. 52, No. 6, 1995, publisher link, and a longer tech report from 1993.

The idea is simple and elegant Bayesian reasoning: specify a model relating the known observations to the unknown quantity, then compute the posterior distribution over the entropy given the observations.

The model is the following Dirichlet-Multinomial model, assuming a given non-negative vector \(\mathbb{\alpha} \in \mathbb{R}^K_+\),

- \(\mathbb{p} \sim \textrm{Dirichlet}(\mathbb{\alpha})\),
- \(x_i \sim \textrm{Categorical}(\mathbb{p})\), \(i=1,\dots,n\), iid.

If we define, for each bin \(k \in \{1,2,\dots,K\}\) the count

so that \((n_1,n_2,\dots,n_K)\) is a histogram over \(K\) outcomes, which is
distributed according to a multinomial distribution.
Then, due to *conjugacy*, the
posterior over the unknown distribution \(\mathbb{p}\) is again a Dirichlet
distribution and given as

We can now attempt to compute the squared-error optimal point estimate of the entropy under this posterior. One of the main contributions of Wolpert and Wolf is to provide a family of results that enable moment computations of the Shannon entropy under the Dirichlet distribution.

In particular, with \(n = \sum_{k=1}^K n_k\) and \(\alpha = \sum_{k=1}^K \alpha_k\), they provide the posterior mean of the entropy as

where \(\psi\) is the digamma function. This expression is efficient to compute, and similarly the second moment and hence the variance of \(H(p)\) under the posterior can be computed efficiently.

The only open question is how to select the prior vector of \(\mathbb{\alpha}\). In absence of further information about the distribution we can assume symmetry. Then there are four common options,

- \(\alpha_k = 1\), due to Bayes in 1763 and Laplace in 1812.
- \(\alpha_k = 1/K\), due to Perks in 1947.
- \(\alpha_k = 1/2\), due to Jeffreys in 1946 and 1961.
- \(\alpha_k = 0\), due to Haldane in 1948. This yields an improper prior.

It may not be clear which choice is the best, but I found an interesting discussion in a paper by de Campos and Benavoli. Further down in this article we will be better equiped to assess the above choices.

Independent of the choice of the prior parameter Wolpert and Wolf are very optimistic about their model and highlight the advantages that come from the Bayesian approach:

"One of the strength of Bayesian analysis is its power for dealing with such small-data cases. In particular, not only are Bayesian estimators in many respects more 'reasonable' than non-Bayesian estimators for small data, they also naturally provide error bars to govern one's use of their results. ... In addition, the Bayesian formalism automatically tells you when it is unsure of its estimate, through its error bars."

Also, on the empirical performance they comment,

"... for all N the Bayes estimator has a smaller mean-squared error than the frequency-counts estimator."

And indeed, also asymptotically the prior has support for every possible distribution, so consistency of the estimated entropy is guaranteed as \(n\to\infty\).

All good then?

Here is the comparison of the squared error and bias of various Bayes
estimators with different choices of prior \(\alpha\). The plot shows, like in
the previous article, the performance when evaluated on data generated from a
*different* Dirichlet prior. Each value on the x-axis is a different
generating distribution, but the prior of the estimator remains fixed.

While all of the Bayes estimators perform better than the plugin estimator, overall they all fare quite badly: there is a low error and bias only at the matching \(\alpha\) value, but they deteriorate quickly at different values of \(\alpha\).

How can this be the case?

## Nemenman-Shafee-Bialek

In 2002 Nemenman, Shafee, and Bialek recognized that the innocent looking Dirichlet-Multinomial model implies a very concentrated prior belief over the entropy of the distribution:

"Thus a seemingly innocent choice of the prior ... leads to a disaster: fixing \(\alpha\) specifies the entropy almost uniquely. Furthermore, the situation persists even after we observe some data: until the distribution is well sampled, our estimate of the entropy is dominated by the prior!"

### The Implied Beliefs over the Entropy

The following experiment visualizes this: each of the following histograms
shows the *implied prior* over \(H(\mathbb{p})\).
To create each histogram, I fixed \(K\) and \(\alpha\) and take 1,000,000 samples
of distributions \(\mathbb{p}\), then record its entropy.
In each histogram plot the x-axis covers exactly the full range over
possible entropies.

For the case \(K=2\) everything looks fine: the implied prior spreads well over the entire range of possible entropies. But look what happens for \(K=10\) and \(K=100\):

Here, the implied prior clearly concentrates sharply. (The least possible concentration of the entropy can be achieved using Perks choice of \(\alpha = 1/K\).) In fact, there is no choice of \(\alpha\) for which the prior belief over the very quantity to be estimated does not concentrate as \(K \to \infty\). If we have no reason to believe that the entropy really is in the range where the prior dictates it should be, then this is a bad prior.

How did Nemenman, Shafee, and Bialek solve this problem?

## NSB estimator

They construct a mixture-of-Dirichlet prior by defining a hyperprior on \(\alpha\) itself. The hyperprior \(P(\alpha)\) is chosen such that

Let us take a look at how this can be derived. Nemenman and collaborators first show that under the Dirichlet-Multinomial model the expected entropy is a strictly monotonic continuous function in \(\alpha\), and therefore it is invertible. Let us define the shorthand \(g^{-1}(\alpha) := \mathbb{E}[H|\alpha]\) as the function that takes \(\alpha\) to the expected entropy. Now, by the transformation formula for random vairables, we have the induced density

If we assume that \(P(H|\alpha)\) is highly concentrated (at least for large \(K\) in the above plots, this holds), then \(P_H(g^{-1}(\alpha)) \approx P(H|\alpha)\), and we want this density to be constant. Hence, we have

Because the right hand side is positive, with \(g^{-1}(\alpha) = \mathbb{E}[H|\alpha]\) this yields exactly the original expression above. This expression has an analytic solution which, properly normalized is

where \(\psi_1\) is the trigamma function.

Let us look at the *implied prior* of the entropy when using the NSB prior.
They are much more uniform now:

This uniformity results in the NSB estimator having excellent robustness properties and small bias. It is probably the best general purpose discrete entropy estimator available. One drawback however is the increased computational cost: in order to compute the estimator we need to solve a 1D integral numerically over \(\alpha\). Each pointwise evaluation of the integrand function corresponds to computing \(\hat{H}_{\textrm{Bayes}}\) for a fixed value of \(\alpha\). High accuracy requires several hundred such evaluations, and this may be prohibitively expensive in some applications (for example, decision tree induction).

### Addendum: Undersampled Regime

After a comment from Ilya Nemenman on the previous version of this article, I also did an experiment in the undersampled regime (\(N < K\)), where we observe fewer outcomes than there are bins. I am glad I did perform this experiment!

I select \(N=100\) and \(K=2000\), with \(500\) replicates and compare the same methods as in the second part of the article. The results are as follows.

Almost all estimators perform very poorly in this setting, with the naive Miller correction even being off the chart. Only the NSB and the Hausser-Strimmer estimator can be considered usable in this severely undersampled regime, with clear preference towards the NSB estimator.

Ilya Nemenman, the inventor of the NSB estimator, was kind enough to share his feedback on these experiments with me and to allow me to post them here:

I am glad to hear that NSB estimator did well on this test. It's also not surprising that HS estimator did rather well too -- in some sense, it's a frequentist version of NSB. Both NSB and HS perform shrinking towards the uniform distribution (infinite pseudocounts or "alpha" in your notation), and then they lift the shrinkage as \(N\) grows. However, HS shrinks much stronger than NSB does. As a result, HS performs very well for large entropy (large alpha) distributions, and worse for lower entropies. It's probably possible to set up a frequentist shrinkage estimator that would shrink towards entropy being half of the maximum value, or shrink towards the maximum value, but less strongly than HS — I think that such an estimator would do better over the whole range of alpha. In practice, the strong shrinkage imposed by HS becomes problematic when the alphabet size is very large, say \(2^{150}\), which is what one gets when one takes a 30ms long spike train and discretizes it at 0.2 ms resolution (yes spike = 1, no spike =0). We had numbers like this in our 2008 PLoS Comp Bio paper. With entropy of \(\approx 15\) bits, alphabet size of \(2^{150}\), and 100-1000 samples, NSB may work (more on this below), and HS will shrink towards 150 bits, and will likely overestimate. One way to see this problem is to realize that, in your comparison plots, once you use \(\alpha > 1\), the entropy is nearly the maximum possible entropy. This is why HS works well there, but fails for \(\alpha \ll 1\), where the entropy is substantially smaller than the maximum. If you were to replot the data putting the true entropy of the analyzed probability distribution (rather than alpha) on the horizontal axis, this will be visible, I think.

He continues,

A key point for both NSB and HS is that both may work in the regime of \(N \sim \sqrt{K}\) (better yet, \(\sim \sqrt{2^{H/2}}\)). On the contrary, most other estimators you analyzed work well only up to \(N \sim 2^H\) (unless I am missing something important). This is because NSB and HS require not good sampling of the underlying distribution, but coincidences in the data only. They estimate entropy, effectively, by inverting the usual birthday paradox, and using the frequency of coincidences to measure the diversity of data. One can illustrate this by pushing \(K\) to even larger values in your last plot, 10000 or even more, if you limit yourself to smaller alpha.

These comments are very insightful and show that my earlier discussion and results were, in a way, limited to the simple case where we have a reasonable number of samples per bin. The case Ilya considers in his work is the severely undersampled regime.

One difficulty in producing the plot he suggests that plots the entropy of the distribution along the x-axis is that it would require an additional binning operation along that axis, so I have not produced this plot yet.

### Reference Prior, Anyone?

I wonder whether the NSB prior is a simplification of a full reference prior treatment. This is not exactly the standard setting of reference priors because we are interested in a function (the entropy) of our random variables, so there is an additional indirection. But I believe it could work as follows: find in the space of all priors on \(\alpha\) the prior that maximizes the KL divergence between implied entropy prior and entropy posterior.

Using the numerical method suggested in the paper above, I obtained a numerical reference prior (with one additional ABC approximation for the likelihood) for \(K=2\) and this closely matches the NSB prior.

(Interestingly, I recently discovered this work
on overall objective priors
in which their *hierarchical reference prior* approach for the
Dirichlet-Multinomial model yields an analytic proper prior which is very
similar to the NSB and numerical reference priors.)

### Further Reading

As you hopefully have noticed, the problem of discrete entropy estimation is quite rich and still actively being worked on. Current work focuses on the case where the distribution is countably infinite. For example, the probability distribution of English words in popular usage is an example: there are infinitely many possible words, but a total lexical order implies a countable discrete distribution.

A great up-to-date overview of discrete entropy estimation, including a summary of the current work on the countable infinite case is given in this article by Memming Park.

To a general introduction to the difficulties of the entropy estimation problem, this 2003 paper by Liam Paninski is still the best entry point. Another very nice overview on entropy estimation is due to Thomas Schürmann in 2004. To me the best introduction to the family of Bayesian estimators is (Archer, Park, Pillow, 2013).

If you wonder why I care about entropy estimation, then, my ICML 2012 paper was the application that originally led me to consider the problem.

*Acknowledgements*. I thank Il Memming Park,
Jonas Peters, and
Ilya Nemenman
for reading a draft version of the article and providing very helpful feedback.