The Entropy of a Normal Distribution - Sebastian Nowozins slow blog

The multivariate normal distribution is one of the most important probability distributions for multivariate data. In this post we will look at the entropy of this distribution and how to estimate the entropy given an iid sample.

For a multivariate normal distribution in $k$ dimensions in standard form with mean vector $\mathbf{\mu} \in \mathbb{R}^k$ and covariance matrix $\mathbf{\Sigma}$ we have the density function

$$f(\mathbb{x};\mathbf{\mu},\mathbf{\Sigma}) = \frac{1}{\sqrt{(2\pi)^k |\mathbf{\Sigma}|}} \exp\left(-\frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu})\right).$$

For this density, the differential entropy takes the simple form

\begin{equation} H = \frac{k}{2} + \frac{k}{2} \log(2\pi) + \frac{1}{2} \log |\mathbf{\Sigma}|.\label{eqn:Hnormal} \end{equation}

In practice we are often provided with a sample

$$\mathbf{x}_i \sim \mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}), \quad i=1,\dots,n,$$

without knowledge of $\mathbf{\mu}$ nor $\mathbf{\Sigma}$. We are then interested in estimating the entropy of the distribution from the sample.

Plugin Estimator

The simplest method to estimate the entropy is to first estimate the mean as the empirical mean,

$$\hat{\mathbf{\mu}} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i,$$

and the sample covariance as

$$\hat{\mathbf{\Sigma}} = \frac{1}{n-1} \sum_{i=1}^n (\mathbf{x}_i - \hat{\mathbf{\mu}}) (\mathbf{x}_i - \hat{\mathbf{\mu}})^T.$$

Given these two estimates we simply use equation $(\ref{eqn:Hnormal})$ on $\mathcal{N}(\hat{\mathbf{\mu}},\hat{\mathbf{\Sigma}})$. (We can also use $\mathcal{N}(\mathbf{0},\hat{\mathbf{\Sigma}})$ instead as the entropy is invariant under translation.)

This is called a plugin estimate because we first estimate parameters of a distribution, then plug these into the analytic expression for the quantity of interest.

It turns out that the plugin estimator systematically underestimates the true entropy and that one can use improved estimators. This is not special and plugin estimates are often biased or otherwise deficient. In case of the problem of estimating the entropy of an unknown normal distribution however, the known results are especially beautiful. In particular,

there exist unbiased estimators,
there exist an estimator that is a uniformly minimum variance unbiased estimator (within a restricted class, see below),
this estimator is also a (generalized) Bayesian estimator under the squared-loss, with an improper prior distribution.

Hence, for this case, a single estimator is satisfactory from both a Bayesian and frequentist viewpoint, and moreover it is easily computable.

Great, we will look at this estimator, but first look at an earlier work that studies a simpler case.

Ahmed and Gokhale, 1989

An optimal UMVUE estimator for the problem of a zero-mean Normal distribution $\mathcal{N}(\mathbf{0},\Sigma)$ has been found by (Ahmed and Gokhale, 1989). This is a restricted case: while the entropy does not depend on the mean of the distribution, it does affect the estimation of the sample covariance matrix.

For a sample their estimator is

$$\hat{H}_{\textrm{AG}} = \frac{k}{2} \log(e\pi) + \frac{1}{2} \log \left|\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T\right| - \frac{1}{2} \sum_{j=1}^d \psi\left(\frac{n+1-j}{2}\right),$$

where $\psi$ is the digamma function.

If you know the mean of your distribution (so you can center your data to ensure $\mu=0$), this estimator provides a big improvement over the plugin estimate. Here is an example in mean squared error and bias, were $\Sigma \sim \textrm{Wishart}(\nu,I_k)$ and $\mathbf{x}_i \sim \mathcal{N}(\mathbf{0},\Sigma)$, with $k=3$ and $n=20$. The plot below shows a Monte Carlo result with $80,000$ replicates.

Ahmed and Gokhale estimator

As promised, we can observe a big improvement over the plugin estimate, and we also see that the Ahmed Gokhale estimator is indeed unbiased.

Here is a Julia implementation.

function entropy_ag(X)
    # X is a (k,n) matrix, samples in columns
    k = size(X,1)
    n = size(X,2)
    C = zeros(k,k)
    for i=1:n
        C += X[:,i]*X[:,i]'
    end
    H = 0.5*k*(1.0 + log(pi)) + 0.5*logdet(C)
    for i=1:k
        H -= 0.5*digamma(0.5*(n+1-i))
    end
    H
end

Because the case of a known mean is maybe less interesting, we go straight to the general case.

Misra, Singh, and Demchuk, 2005

In (Misra, Singh, and Demchuk, 2005) (here is the PDF) the authors do a thorough job of analyzing the general case. Beside a detailed bias and risk analysis the paper proposes two estimators for the general case:

An UMVUE estimator in a restricted class of estimators, that is a slight variation of the Ahmed and Gokhale estimator;
A shrinkage estimator in a larger class, which is proven to dominate the UMVUE estimator in the restricted class.

The authors are apparently unaware of the work of Ahmed and Gokhale. For their UMVUE estimator $\hat{H}_{\textrm{MSD}}$ they use the matrix

$$S = \sum_{i=1}^n (\mathbf{x}_i-\hat{\mu})(\mathbf{x}_i-\hat{\mu})^T$$

and define

\begin{equation} \hat{H}_{\textrm{MSD}} = \frac{k}{2} \log(e\pi) + \frac{1}{2} \log |S| - \frac{1}{2} \sum_{j=1}^d \psi\left(\frac{n-j}{2}\right). \label{Hmsd} \end{equation}

Can you spot the difference to the Ahmed and Gokhale estimator? There are two: the matrix $S$ is centered using the sample mean $\hat{\mu}$, and, to adjust for the use of the sample mean for centering, the argument to the digamma function is shifted by $1/2$.

Here is a Julia implementation.

function entropy_msd(X)
    # X is a (k,n) matrix, samples in columns
    k = size(X,1)
    n = size(X,2)

    Xbar = mean(X,2)
    Xs = sqrt(n)*Xbar
    S = zeros(k,k)
    for i=1:n
        S += (X[:,i]-Xbar)*(X[:,i]-Xbar)'
    end

    res = 0.5*k*(1.0 + log(pi)) + 0.5*logdet(S)
    for i=1:k
        res -= 0.5*digamma(0.5*(n-i))
    end
    res
end

Outline of Derivation

The key result that is used for deriving both the MSD and the AG estimator is a lemma due to Robert Wijsman from 1957 (PDF).

Wijsman proved a result relating the determinants of two matrices: the covariance matrix $\Sigma$ of a multivariate Normal distribution, and the empirical outer product matrix $X X^T$ of a sample $X \in \mathbb{R}^{n\times k}$ from that Normal. In Lemma 3 of the above paper he showed

$$\frac{|X X^T|}{|\Sigma|} = \prod_{i=1}^k \chi_{n-i+1}^2.$$

By taking the logarithm of this equation we can relate the central quantity in the differential entropy, namely $\log |\Sigma|$ to the log-determinant of the sample outer product matrix.

The sample outer product matrix of a zero-mean multivariate Normal sample with $n \geq k$ is known to be distributed according to a Wishart distribution, with many known analytic properties. By using the known properties of the Wishart and $\chi^2$ distributions this allows the derivation and proving unbiasedness of the AG and MSD estimators.

Generalized Bayes

Misra, Singh, and Demchuk also show that their MSD estimator is the mean of a posterior that arises from a full Bayesian treatment with an improper prior. This prior is shown to be, in (Theorem 2.3 in Misra et al., 2005),

$$\pi(\mu,\Sigma) = \frac{1}{|\Sigma|^{(k+1)/2}}$$

This is a most satisfying result: a frequentist-optimal estimator in large class of possible estimators is shown to be also a Bayes estimator for a suitable matching prior.

Because the posterior is proper for $n \geq k$, one could also use the proposed prior to derive posterior credible regions for the entropy, and most likely this is a good choice in that it could achieve good coverage properties.

Brewster-Zidek estimator

Going even further, Misra and coauthors also show that while the MSD estimator is optimal in the class of affine-equivariant estimators, when one enlarges the class of possible estimators there exist estimators which uniformly dominate the MSD estimator by achieving a lower risk.

They propose a shrinkage estimator, termed Brewster-Zidek estimator which I give here without further details.

$$\hat{H}_{BZ} = \frac{k}{2} \log(2 e \pi) + \frac{1}{2} \log |S + YY^T| + \frac{1}{2}(\log T - d(T))$$

$$d(r) = \frac{\int_r^1 t^{\frac{n-k}{2}-1} (1-t)^{\frac{k}{2}-1} \left[\log t + k \log 2 + \sum_{i=1}^k \psi\left(\frac{n-i+1}{2}\right) \right] \textrm{d}t}{\int_r^1 t^{\frac{n-k}{2}-1}(1-t)^{\frac{k}{2}-1} \textrm{d}t}$$

$$T = |S| |S+YY^T|^{-1}$$

$$Y = \sqrt{n} \hat{\mu}$$

Here is a Julia implementation using numerical integration for evaluating $d(r)$.

function entropy_bz(X)
    # X is a (p,n) matrix, samples in columns
    p = size(X,1)
    n = size(X,2)

    Bfun(t) = t^((n-p)/2-1) * (1-t)^(p/2-1)
    function Afun(t)
        res = log(t) + p*log(2)
        for i=1:p
            res += digamma(0.5*(n-i+1))
        end
        res * Bfun(t)
    end
    A(r::Float64) = quadgk(Afun, r, 1.0)[1]
    B(r::Float64) = quadgk(Bfun, r, 1.0)[1]
    d(r) = A(r) / B(r)

    Xbar = mean(X,2)
    Xs = sqrt(n)*Xbar
    S = zeros(p,p)
    for i=1:n
        S += (X[:,i]-Xbar)*(X[:,i]-Xbar)'
    end

    T = det(S)/det(S+Xs*Xs')
    dBZ = logdet(S + Xs*Xs') - d(T) + log(T)

    0.5*(p*(1+log(2*pi))+dBZ)
end

Shoot-out

Remember the zero-mean case? Let us start with this case. I use $k=3$ and $n=20$ as before, and $\Sigma \sim \textrm{Wishart}(\nu,I_k)$. Then samples are generated as $\mathbf{x}_i \sim \mathcal{N}(\mathbf{0},\Sigma)$. All numbers are from $80,000$ replications of the full procedure.

Four estimators

What you can see from the above plot is that the AG estimator which is UMVUE for this special case dominates the MSD estimator. Both unbiased estimators are indeed unbiased. In terms of risk the Brewster-Zidek estimator is indistinguishable from the MSD estimator.

Now, what about $\mu \neq \mathbf{0}$? Here, for the simulation the setting is as before, but the mean is $\mu \sim \mathcal{N}(\mathbf{0},2I)$, so that samples are distributed as $\mathbf{x}_i \sim \mathcal{N}(\mu,\Sigma)$.

Four estimators

The result shows that the AG estimator becomes useless if its assumption is violated, as is to be expected. (Interestingly, if we were to try using the scaled sample covariance matrix $n \hat{\Sigma}$ with the AG estimator it is reasonable but biased, that is, it has lost its UMVUE property.) The MSD estimator and the Brewster-Zidek estimators are virtually indistinguishable and seem to be both unbiased in this case.

Conclusion

Estimating the entropy of a multivariate Normal distribution from a sample has a satisfying solution, the MSD estimator $(\ref{Hmsd})$, which can be robustly used in all circumstances. It is computationally efficient, and with sufficient samples, $n \geq k$, the Bayesian interpretation also provides a proper posterior distribution over $\mu$ and $\Sigma$ which can be used to derive a posterior distribution over the entropy.

Acknowledgements. I thank Jonas Peters for reading a draft version of the article and providing feedback.