Effective Sample Size in Importance Sampling - Sebastian Nowozins slow blog

In this article we will look at a practically important measure of efficiency in importance sampling, the so called effective sample size (ESS) estimate. This measure was proposed by Augustine Kong in 1992 in a technical report which until recently has been difficult to locate online, but after getting in contact with the University of Chicago I am pleased that the report is now available (again):

Augustine Kong, "A Note on Importance Sampling using Standardized Weights", Technical Report 348, PDF, Department of Statistics, University of Chicago, July 1992.

Before we discuss the usefulness of the effective sample size, let us first define the notation and context for importance sampling.

Importance sampling is one of the most generally applicable method to sample from otherwise intractable distributions. In machine learning and statistics importance sampling is regularly used for sampling from distributions in low dimensions (say, up to maybe 20 dimensions). The general idea of importance sampling has been extended since the 1950s to the sequential setting and the resulting class of modern Sequential Monte Carlo (SMC) methods constitute the state of the art Monte Carlo methods in many important time series modeling applications.

The general idea of importance sampling is as follows. We are interested in computing an expectation,

$$\mu = \mathbb{E}_{X \sim p}[h(X)] = \int h(x) p(x) \,\textrm{d}x.$$

If we can sample from $p$ directly, the standard Monte Carlo estimate is possible, and we draw $X_i \sim p$, $i=1,\dots,n$, then use

$$\hat{\mu} = \frac{1}{n} \sum_{i=1}^n h(X_i).$$

In many applications we cannot directly sample from $p$. In this case importance sampling can still be applied by sampling from a tractable proposal distribution $q$, with $X_i \sim q$, $i=1,\dots,n$, then reweighting the sample using the ratio $p(X_i)/q(X_i)$, leading to the standard importance sampling estimate

$$\tilde{\mu} = \frac{1}{n} \sum_{i=1}^n \frac{p(X_i)}{q(X_i)} h(X_i).$$

In case $p$ is known only up to an unknown normalizing constant, the so called self-normalized importance sampling estimate can be used. Denoting the weights by $w(X_i) = \frac{p(X_i)}{q(X_i)}$ it is defined as

$$\bar{\mu} = \frac{\frac{1}{n} \sum_{i=1}^n w(X_i) h(X_i)}{ \frac{1}{n} \sum_{i=1}^n w(X_i)}.$$

The quality of this estimate chiefly depends on how good the proposal distribution $q$ matches the form of $p$. Because $p$ is difficult to sample from, it typically is also difficult to make a precise statement about the quality of approximation of $q$.

The effective sample size solves this issue: it can be used after or during importance sampling to provide a quantitative measure of the quality of the estimated mean. Even better, the estimate is provided on a natural scale of worth in samples from $p$, that is, if we use $n=1000$ samples $X_i \sim q$ and obtain an ESS of say 350 then this indicates that the quality of our estimate is about the same as if we would have used 350 direct samples $X_i \sim p$. This justifies the name effective sample size.

Since the late 1990s the effective sample size is popularly used as a reliable diagnostic in importance sampling and sequential Monte Carlo applications. Sometimes it even informs the algorithm during sampling; for example, one can continue an importance sampling method until a certain ESS has been reached. Another example is during SMC where the ESS is often used to decide whether operations such as resampling or rejuvenation are performed.

Definition

Two alternative but equivalent definitions exist. Assume normalized weights $w_i \geq 0$ with $\sum_{i=1}^n w_i = 1$. Then, the original definition of the effective sample size estimate is by Kong, popularized by Jun Liu in this paper, as

$$\textrm{ESS} = \frac{n}{1 + \textrm{Var}_q(W)},$$

where $\textrm{Var}_q(W) = \frac{1}{n-1} \sum_{i=1}^n (w_i - \frac{1}{n})^2$. The alternative form emerged later (I did not manage to find its first use precisely), and has the form

$$\textrm{ESS} = \frac{1}{\sum_{i=1}^n w_i^2}.$$

When the weights are unnormalized, we define $\tilde{w}_i = w_i / (\sum_{i=1}^n w_i)$ and see that

$$\textrm{ESS} = \frac{1}{\sum_{i=1}^n \tilde{w}_i^2} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}.$$

As is often the case in numerical computation in probabilistic models the quantities are often stored in log-domain, i.e. we would store $\log w_i$ instead of $w_i$, and compute the above equations in log-space.

Example

As a simple example we set the target distribution to be a $\textrm{StudentT}(0,\nu)$ with $\nu=8$ degrees of freedom, and the proposal to be a Normal $\mathcal{N}(\mu,16)$. We then visualize the ESS as a function of the shift $\mu$ of the Normal proposal. The sample size should decrease away from the true mean (zero) and be highest at zero.

ESS example, StudentT target, Normal proposal

This is indeed what happens in the above plot and, not shown, the estimated variance from the ESS agrees with the variance over many replicates.

Derivation

The following derivation is from Kong's technical report, however, to make it self-contained and accessible I fleshed out some details and give explanations inline.

We start with an expression for $\textrm{Var}(\bar{\mu})$. This is a variance of a ratio expression with positive denominator; hence we can apply the multivariate delta method for ratio expressions (see appendix below) to obtain an asymptotic approximation. Following Kong's original notation we define $W_i = w(X_i)$ and $W=W_1$, as well as $Z_i = h(X_i) w(X_i)$ and $Z = Z_1$. Then we have the asymptotic delta method approximation

\begin{eqnarray} \textrm{Var}_q(\bar{\mu}) & \approx & \frac{1}{n}\left[\frac{\textrm{Var}_q(Z)}{(\mathbb{E}_q W)^2} - 2 \frac{\mathbb{E}_q Z}{(\mathbb{E}_q W)^3} \textrm{Cov}_q(Z,W) + \frac{(\mathbb{E}_q Z)^2}{(\mathbb{E}_q W)^4} \textrm{Var}_q(W)\right].\label{eqn:delta1} \end{eqnarray}

We can simplify this somewhat intimidating expression by realizing that

$$\mathbb{E}_q W = \int \frac{p(x)}{q(x)} q(x) \,\textrm{d}x = \int p(x) \,\textrm{d}x = 1.$$

(For the unnormalized case the derivation result is the same because the ratio $\bar{\mu}$ does not depend on the normalization constant.) Then we can simplify $(\ref{eqn:delta1})$ to

\begin{eqnarray} & = & \frac{1}{n}\left[\textrm{Var}_q(Z) - 2 (\mathbb{E}_q Z) \textrm{Cov}_q(Z,W) + (\mathbb{E}_q Z)^2 \textrm{Var}_q(W)\right].\label{eqn:delta2} \end{eqnarray}

The next step is to realize that $\mathbb{E}_q Z = \int w(x) h(x) q(x) \,\textrm{d}x = \int \frac{p(x)}{q(x)} q(x) h(x) \,\textrm{d}x = \int h(x) p(x) \,\textrm{d}x = \mu.$ Thus $(\ref{eqn:delta2})$ further simplifies to

\begin{eqnarray} & = & \frac{1}{n}\big[\underbrace{\textrm{Var}_q(Z)}_{\textrm{(B)}} - 2 \mu \underbrace{\textrm{Cov}_q(Z,W)}_{\textrm{(A)}} + \mu^2 \textrm{Var}_q(W)\big]. \label{eqn:delta3} \end{eqnarray}

This is great progress, but we need to nibble on this expression some more. Let us consider the parts (A) and (B), in this order.

(A). To simplify this expression we can leverage the definition of the covariance and then apply the known relations of our special expectations. This yields.

\begin{eqnarray} \textrm{(A)} = \textrm{Cov}_q(Z,W) & = & \mathbb{E}_q[\underbrace{Z}_{= W H} W] - \underbrace{(\mathbb{E}_q Z)}_{= \mu} \underbrace{(\mathbb{E}_q W)}_{= 1}\nonumber\\ & = & \mathbb{E}_q[H W^2] - \mu\nonumber\\ & = & \mathbb{E}_p[H W] - \mu.\label{eqn:A1} \end{eqnarray}

Note the change of measure from $q$ to $p$ in the last step. To break down the expectation of the product further we use the known rules about expectations, namely $\textrm{Cov}(X,Y) = \mathbb{E}[XY] - (\mathbb{E}X)(\mathbb{E}Y)$, which leds us to

\begin{eqnarray} \textrm{(A)} = \textrm{Cov}_q(Z,W) & = & \textrm{Cov}_p(H,W) + \mu \mathbb{E}_p W - \mu.\label{eqn:A2} \end{eqnarray}

(B). First we expand the variance by its definition, then simplify.

$$\textrm{Var}_q(Z) = \textrm{Var}_q(W H) = \mathbb{E}_q[W^2 H^2] - (\underbrace{\mathbb{E}_q[WH]}_{= \mu})^2 = \mathbb{E}_p[W H^2] - \mu^2.$$

For approaching $\mathbb{E}_p[W H^2]$ we need to leverage the second-order delta method (see appendix) which gives the following approximation,

\begin{eqnarray} \mathbb{E}_p[W H^2] & \approx & (\mathbb{E}_p W)\underbrace{(\mathbb{E}_p H)^2}_{= \mu^2} + 2 \underbrace{\mathbb{E}_p[H]}_{\mu} \textrm{Cov}_p(W,H) + (\mathbb{E}_p W) \textrm{Var}_p(H)\nonumber\\ & = & (\mathbb{E}_p W) \mu^2 + 2 \mu \textrm{Cov}_p(W,H) + (\mathbb{E}_p W) \textrm{Var}_p(H). \label{eqn:B1} \end{eqnarray}

Ok, almost done. We now leverage our work to harvest:

\begin{eqnarray} \textrm{Var}_q(\bar{\mu}) & \approx & \frac{1}{n}\big[\underbrace{\textrm{Var}_q(Z)}_{\textrm{(B)}} - 2 \mu \underbrace{\textrm{Cov}_q(Z,W)}_{\textrm{(A)}} + \mu^2 \textrm{Var}_q(W)\big]\nonumber\\ & \approx & \frac{1}{n}\big[ \left( (\mathbb{E}_p W) \mu^2 + 2 \mu \textrm{Cov}_p(W,H) + (\mathbb{E}_p W) \textrm{Var}_p(H) - \mu^2 \right)\nonumber\\ & & \qquad - 2 \mu \left(\textrm{Cov}_p(H,W) + \mu\mathbb{E}_p W - \mu\right) \nonumber\\ & & \qquad + \mu^2 \textrm{Var}_q(W) \big]\nonumber\\ & = & \frac{1}{n}\left[\mu^2 \left( 1 + \textrm{Var}_q(W) - \mathbb{E}_p W\right) + (\mathbb{E}_p W) \textrm{Var}_p(H)\right].\label{eqn:H1} \end{eqnarray}

Finally, we can reduce $(\ref{eqn:H1})$ further by

$$\mathbb{E}_p W = \mathbb{E}_q[W^2] = \textrm{Var}_q(W) + (\mathbb{E}_q W)^2 = \textrm{Var}_q(W) + 1.$$

For the other term we have

$$\frac{1}{n} \textrm{Var}_p(H) = \textrm{Var}_p(\hat{\mu}).$$

This simplifies $(\ref{eqn:H1})$ to the following satisfying expression.

$$\textrm{Var}_q(\bar{\mu}) \approx \textrm{Var}_p(\hat{\mu}) (1 + \textrm{Var}_q(W)).$$

This reads as "the variance of the self-normalized importance sampling estimate is approximately equal to the variance of the simple Monte Carlo estimate times $1 + \textrm{Var}_q(W)$."

Therefore, when taking $n$ samples to compute $\bar{\mu}$ the effective sample size is estimated as

$$\textrm{ESS} = \frac{n}{1 + \textrm{Var}_q(W)}.$$

Two comments:

We can estimate $\textrm{Var}_q(W)$ by the sample variance of the normalized importance weights.
This estimate does not depend on the integrand $h$.

The simpler form of the ESS estimate can be obtained by estimating

\begin{eqnarray} \textrm{Var}_q(W) & \approx & \frac{1}{n} \sum_{i=1}^n (w_i - \frac{1}{n})^2 \nonumber\\ & = & \frac{1}{n} \sum_{i=1}^n w_i^2 - \frac{1}{n^2}.\nonumber \end{eqnarray}

which yields

$$\textrm{ESS} = \frac{n}{1 + \frac{1}{n} \sum_i w_i^2 - \frac{1}{n^2}} = \frac{1}{\sum_{i=1}^n w_i^2}.$$

Conclusion

Monte Carlo methods such as importance sampling and Markov chain Monte Carlo can fail in case the proposal distribution is not suitable chosen. Therefore, we should always employ diagnostics, and for importance sampling the effective sampling size diagnostic has become the standard due to its simplicity, intuitive interpretation, and its robustness in practical applications.

However, the effective sample size can fail, for example when all proposal samples are in a region where the target distribution has few probability mass. In that case, the weights would be approximately equal and the ESS close to optimal, failing to diagnose the mismatch between proposal and target distribution. This is, in a way, unavoidable: if we never get to see a high probability region of the target distribution, the low value of our samples is hard to recognize.

For another discussion on importance sampling diagnostics and an alternative derivation, see Section 9.3 in Art Owen's upcoming Monte Carlo book. Among many interesting things in that chapter, he proposes an effective sample size statistic specific to the particular integrand $h$. For this, redefine the weights as

$$w_h(X_i) = \frac{\frac{p(X_i)}{q(X_i)} |h(X_i)|}{ \sum_{i=1}^n \frac{p(X_i)}{q(X_i)} |h(X_i)|},$$

then use the normal $1/\sum_i w_h(X_i)^2$ estimate. This variant is more accurate because it takes the integrand into account.

Addendum: This paper by Martino, Elvira, and Louzada, takes a detailed look at variations of the effective sample size statistic.

Appendix: The Multivariate Delta Method

The delta method is a classic method using in asymptotic statistics to obtain limiting expressions for the mean and variance of functions of random variables. It can be seen as the statistical analog of the Taylor approximation to a function.

The multivariate extension is also classic, and the following theorem can be found in many works, I picked the one given as Theorem 3.7 in DasGupta's book on asymptotic statistics (by the way, this book is a favorite of mine for its accessible presentation of many practical result in classical statistics). A more advanced and specialized book on expansions beyond the delta method is Christopher Small's book on the topic.

Delta Method for Distributions

Theorem (Multivariate Delta Method for Distributions). Suppose $\{T_n\}$ is a sequence of $k$-dimensional random vectors such that

$$\sqrt{n}(T_n - \theta) \stackrel{\mathcal{L}}{\rightarrow} \mathcal{N}_k(0,\Sigma(\theta)).$$

Let $g:\mathbb{R}^k \to \mathbb{R}^m$ be once differentiable at $\theta$ with the gradient vector $\nabla g(\theta)$. Then

$$\sqrt{n}(g(T_n) - g(\theta)) \stackrel{\mathcal{L}}{\rightarrow} \mathcal{N}_m(0, \nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta))$$

provided $\nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta)$ is positive definite.

This simply says that if we have a vector $T$ of random variables and we know that $T$ converges asymptotically to a Normal, then we can make a similar statement about the convergence of $g(T)$.

For the effective sample size derivation we will need to instantiate this theorem for a special case of $g$, namely where $g: \mathbb{R}^2 \to \mathbb{R}$ and $g(x,y) = \frac{x}{y}$. Let's quickly do that. We have

$$\nabla g(x,y) = \left(\begin{array}{c} \frac{1}{y} \\ -\frac{x}{y^2}\end{array}\right).$$

We further define $X_i \sim P_X$, $Y_i \sim P_Y$ iid, $X=X_1$, $Y=Y_1$,

$$T_n=\left(\begin{array}{c} \frac{1}{n}\sum_{i=1}^n X_i\\ \frac{1}{n} \sum_{i=1}^n Y_i\end{array}\right),\qquad \theta=\left(\begin{array}{c} \mathbb{E}X\\ \mathbb{E}Y\end{array}\right),$$

assuming our sequence $\frac{1}{n} \sum_{i=1}^n X_i \to \mathbb{E}X$ and $\frac{1}{n} \sum_{i=1}^n Y_i \to \mathbb{E}Y$. For the covariance matrix we know that the empirical average of $n$ iid samples has a variance as $1/n$, that is

$$\textrm{Var}(\frac{1}{n}\sum_{i=1}^n X_i) = \frac{1}{n^2} \textrm{Var}(\sum_{i=1}^n X_i) = \frac{1}{n^2} \sum_{i=1}^n \textrm{Var}(X_i) = \frac{1}{n} \textrm{Var}(X),$$

and similar for the covariance, so we have

$$\Sigma(\theta) = \frac{1}{n} \left(\begin{array}{cc} \textrm{Var}(X) & \textrm{Cov}(X,Y)\\ \textrm{Cov}(X,Y) & \textrm{Var}(Y)\end{array}\right).$$

Applying the above theorem we have for the resulting one-dimensional transformed variance

\begin{eqnarray} B(\theta) & := & \nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta)\nonumber\\ & = & \frac{1}{n} \left(\begin{array}{c} \frac{1}{\mathbb{E}Y} \\ -\frac{\mathbb{E}X}{(\mathbb{E}Y)^2}\end{array}\right)^T \left(\begin{array}{cc} \textrm{Var}(X) & \textrm{Cov}(X,Y)\\ \textrm{Cov}(X,Y) & \textrm{Var}(Y)\end{array}\right) \left(\begin{array}{c} \frac{1}{\mathbb{E}Y} \\ -\frac{\mathbb{E}X}{(\mathbb{E}Y)^2}\end{array}\right)\nonumber\\ & = & \frac{1}{n} \left[ \frac{1}{(\mathbb{E}Y)^2} \textrm{Var}(X) - 2 \frac{\mathbb{E}X}{(\mathbb{E}Y)^3} \textrm{Cov}(X,Y) + \frac{(\mathbb{E}X)^2}{(\mathbb{E}Y)^4} \textrm{Var}(Y) \right].\nonumber \end{eqnarray}

One way to interpret the quantity $B(\theta)$ is that the limiting variance of the ratio $X/Y$ depends both on the variances of $X$ and of $Y$, but crucially it depends most sensitively on $\mathbb{E}Y$ because this quantity appears in the denominator: small values of $Y$ have large effects on $X/Y$.

This is an asymptotic expression which is based on the assumption that both $X$ and $Y$ are concentrated around the mean so that the linearization of $g$ around the mean will incur a small error. As such, this approximation may deteriorate if the variance of $X$ or $Y$ is large so that the linear approximation of $g$ deviates from the actual values of $g$.

(For an exact expansion of the expectation of a ratio, see this 2009 note by Sean Rice.)

Second-order Delta Method

The above delta method can be extended to higher-order by a multivariate Taylor expansion. I give the following result without proof.

Theorem (Second-order Multivariate Delta Method). Let $T$ be a $k$-dimensional random vectors such that $\mathbb{E} T = \theta$. Let $g:\mathbb{R}^k \to \mathbb{R}$ be twice differentiable at $\theta$ with Hessian $H(\theta)$. Then

$$\mathbb{E} g(T) \approx g(\theta) + \frac{1}{2} \textrm{tr}(\textrm{Cov}(T) \: H(\theta)).$$

For the proof of the effective sample size we need to apply this theorem to the function $g(X,Y)=XY^2$ so that

$$H(X,Y)=\left[\begin{array}{cc} 0 & 2Y\\ 2Y & 2X\end{array}\right].$$

Then the above result gives

$$\mathbb{E} g(X,Y) \approx (\mathbb{E}X)(\mathbb{E}Y)^2 + 2 (\mathbb{E}Y) \textrm{Cov}(X,Y) + (\mathbb{E}X) \textrm{Var}(Y).$$