Sebastian Nowozins slow blog

Thoughts on Trace Estimation in Deep Learning

2022-08-09T19:00:00+01:00

Efficiently estimating the trace $\textrm{tr}(A) = \sum_{i=1}^d A_{ii}$ of a square matrix $A \in \mathbb{R}^{d \times d}$ is an important problem required in a number of recent deep learning and machine learning models. In those cases the matrix $A$ is typically positive-definite, large and dense.

As a sample of recent occurences of needing to compute the trace of large matrices in machine learning, I picked the following applications.

Continuous normalizing flows, as in diffusion models (Song et al., ICLR 2021), FFJORD (Grathwohl et al., ICLR 2019) and Neural ODEs (Chen et al., NeurIPS 2018), where an initial sample $x(0) \sim p_0$ is continuously transformed by a function, i.e. $\partial x(t)/\partial t = f(x(t),t)$ from $t=0$ to $t=1$. To evaluate $\log p(x(1))$ we need to rely on the instantaneous change of variable formula,
$$\frac{\partial \log p(x(t))}{\partial t} = -\textrm{tr}\left( \frac{\partial f}{\partial x(t)}\right),$$
such that the log-probability is determined by
$$\log p(x(1)) = \log p(x(0)) - \int_0^1 \textrm{tr}\left( \frac{\partial f}{\partial x(t)}\right)\,\textrm{d}t.$$
Computing the trace of the Jacobian $\frac{\partial f}{\partial x(t)}$ is the computational bottleneck.
Efficient Gaussian Process evidence computation. (Wenger et al., ICML 2022), where trace estimation is used to estimate the log-marginal likelihood, and the matrix $A$ is a kernel matrix.
Approximating log-determinants in invertible ResNets. (Behrmann et al., 2018) propose a variant of ResNet blocks that is invertible by constraining the Lipschitz-constant of the ResNet block update to be smaller than one. Once invertible the ResNet block can be used for generative modelling via a normalizing flow model. That is, we sample $x_0 \sim p_0$ from a simple prior $p_0$ and then map $f(x_0)$ to the target density. To compute log-likelihoods for a given $x$ we invert the map and compute $\log p(x) = \log p_0(f^{-1}(x)) + \log |\det J_{f^{-1}}(x)|$. By exploiting the structure of the $i$'th ResNet block, $f_i(x) = x + g_i(x)$, and the Lipschitz constraint on $g_i$, the log-determinant computation can be reduced to a convergent power series, $\textrm{tr}(\log (I + J_g(x))) = \sum_{k=1}^{\infty} \frac{(-1)^{k+1}}{k} \textrm{tr}(J^k_g).$ Without going into detail, Behrmann et al. truncate the power series and compute the trace terms using Hutchinson's trace estimator, thus are able to use invertible ResNets for generative modelling. The same group, in (Chen et al., 2019), improve on the finite truncation by using stochastic truncation in the form of Russian roulette estimators, managing to create unbiased estimates, again using trace estimation for each term of the power series. (If you hear the term "Russian roulette estimator" for the first time, it is a quite general technique that is worth knowing about; a good self-contained brief introduction and history of randomized series truncation can be found in section 2.1 and 2.2 of (Beatson and Adams, 2019).)
Regularizing continuous dynamics. (Finlay et al., ICML 2020) regularize the Frobenius norm $\|A\|_F^2 = \textrm{tr}(A^T A)$ of the Jacobian of a neural ODE leading to smoother dynamics and fewer adaptive integrator steps.
Neural network quantization layer-wise sensitivity metric. (Dong et al., NeurIPS 2020) and (Qian et al., 2020) use the trace-of-Hessian of parameters belonging to the same neural network layer to allocate the quantization fidelity needed. Such a trace-of-Hessian regularization is also effectively used in one of the early papers on energy-based models, (Kingma and Le Cun, 2010), there it is used to regularize the curvature of learned energy functions. The diagonal of the Hessian is a natural local sensitivity measure and perhaps the earliest use in neural networks is in the classic optimal brain damage sensitivity metric of (Le Cun et al., 1989), which used second derivatives for each parameter to determine deletion of neurons.
Sliced score matching. (Hyvarinen, JMLR 2005) introduced score matching as a learning objective for energy-based models, $p(x) \propto \exp(-E(x))$, and in the score matching objective a sum of second derivates of the energy function needs to be evaluated, $\sum_{i=1}^d \partial^2 E(x) / (\partial x_i)^2$. Because evaluating these second-order derivates is expensive this limited the applicability of score matching until (Song et al., UAI 2020) introduced sliced score matching where the expensive term is replaced by a stochastic estimate $\mathbb{E}_z\left[\sum_{i=1}^d \sum_{j=1}^d \frac{\partial^2 E(x)}{\partial x_i \, \partial x_j} z_i z_j\right]$, i.e. a Hutchinson estimate of the trace of the Hessian of $E$. For a great overview of these techniques see the recent review by (Song and Kingma, 2021).

What is the shared difficulty in all of the above applications? After all, computing the trace of an explicitly given matrix $A$ is trivial: simply sum the diagonal elements,

$$\textrm{tr}(A) := \sum_i A_{ii}.$$

However, in the above applications arising in deep learning the problem is that it is very expensive to compute $A$ explicitly, but we can query matrix-vector products efficiently. Given $z \in \mathbb{R}^d$, we can efficiently compute

$$y = A \, z.$$

Clearly, if we are able to compute many such products, say $d$ times, we can reconstruct the matrix $A$ completely. The simplest example is to take $z^{(m)} := e_m$, the natural basis vectors in $\mathbb{R}^d$, such that $y^{(m)} = A \, z^{(m)}$ directly extracts the $m$'th row of the matrix. By extracting all rows we could obtain $A$ in explicit forms.

The drawback of this technique is that performing many matrix-vector multiplications is expensive, where typically each matrix-vector product corresponds to one forward-backprop operation in a neural network. Is there a better way, requiring only a small number of matrix-vector products to obtain an accurate estimate of the trace of $A$?

Yes, and we will discuss the main technique below. But first, to add more excitement to our goal: if the problem of trace estimation is amenable, a number of related problems are also in reach using extended methods such as variants of conjugate gradients (Seeger, 2000) and the stochastic Lanczos quadrature method of (Ubaru et al., 2017) which allows efficient estimation of functions of the form

$$\textrm{tr}(f(A)),$$

where $f: \mathbb{R} \to \mathbb{R}$ is a scalar function and $\textrm{tr}(f(\cdot))$ is the resulting trace function, $\textrm{tr}(f(A)) := \sum_{i=1}^d f(\lambda_i(A))$. Through different choices of $f$ trace functions enable estimation of other quantities,

Log-determinants, $\log \det(A) = \textrm{tr}(\log(A)) = \sum_{i=1}^d \log(\lambda_i)$.
Nuclear norm, for $X \in \mathbb{R}^{k \times d}$ defined as $\|X\|_* = \textrm{tr}(\sqrt{X^T X})$, and more general Schatten $p$-norms.
Trace of $A^{-1}$, where $\textrm{tr}(A^{-1}) = \textrm{tr}(f(A))$ with $f(t)=1/t$.

As we will see below, another quantity that can be estimated using Hutchinson-style estimators is the diagonal of a matrix $A$.

Skilling-Hutchinson 1989 trace estimator

The estimator appeared in two works in parallel. In his original 1989 paper, (Hutchinson, "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines", 1989), Hutchinson introduced the first stochastic estimator of the matrix trace, and simultaneously John Skilling introduced the same technique in (Skilling, "The Eigenvalues of Mega-dimensional Matrices", 1989).

The Skilling-Hutchinson trace estimator is not just historially interesting; it is still the most common method used today due to its general applicability and simplicity of implementation.

Skilling-Hutchinson's trace estimate: If $z \in \mathbb{R}^{d}$ is a random vector satisfying $\mathbb{E}[z z^T] = I$, then

$$\mathbb{E}[z^T A z] = \textrm{tr}(A).$$

The Skilling-Hutchinson estimator is

$$\hat{T}_{\cdot,M}(A) := \frac{1}{M} \sum_{m=1}^M (z^{(m)})^T A\, z^{(m)},$$

where $z^{(m)}$ are random vectors satisfying the above condition. In Hutchinson's original estimator these vectors are Rademacher vectors with elements iid in $\{-1,1\}$ and we write $\hat{T}_{H,M}$, but the term Hutchinson's trace estimator is also commonly used nowadays if standard Normal vectors are used in the Gaussian trace estimator $\hat{T}_{G,M}$.

The Skilling-Hutchinson estimator is unbiased, meaning $\mathbb{E}[\hat{T}_{\cdot,M}(A)] = \textrm{tr}(A)$. Moreover, it is known that for standard Normal vectors we have

$$\mathbb{V}[\hat{T}_{G,M}(A)] = \frac{2}{M} \|A\|_F^2 = \frac{2}{M} \sum_{i=1}^d \lambda^2_i(A),$$

where $\lambda_i(A)$ is the $i$'th Eigenvalue of $A$. For Rademacher vectors it is known that

$$\mathbb{V}[\hat{T}_{H,M}(A)] = \frac{2}{M} \left(\|A\|_F^2 - \sum_{i=1}^d A_{ii}^2\right).$$

You can see that using Rademacher vectors has provably smaller variance than using Gaussian vectors,

$$\mathbb{V}[\hat{T}_{H,M}(A)] \leq \mathbb{V}[\hat{T}_{G,M}(A)].$$

There is a wealth of theory available for the estimator, and a good recent entry point into known results is Maciej Skorksi's paper "A Modern Analysis of Hutchinson's Trace Estimator" from 2020 (PDF). In it he gives a error bound for the Rademacher version, using the relative error

$$\textrm{err}(\hat{T}_{H,M}, A) := \frac{\hat{T}_{H,M}(A)}{\textrm{tr}(A)}-1.$$

For this error and for any $d \geq 2$ he gives the tail bound for any $0 < \varepsilon < 3/8$ of the form

$$P(|\textrm{err}(\hat{T}_{H,M}, A)| \geq \varepsilon) \leq \exp\left(-\frac{M \varepsilon^2}{2(1-8/3\varepsilon)}\right).$$

Praise for Hutchinson's estimator

There is a lot of good to say about Hutchinson's trace estimator:

It is simple: the estimator is easy to understand and implement. It is free from exotic ingredients, uses just basic linear algebra, and does not make strong assumptions thus is widely applicable. Because it is simple it works well with auto-differentiation.

Linear trade-off $M$: Hutchinson's estimator comes with a free choice of $M \geq 1$, the number of matrix-vector products to evaluate. The parameter $M$ linearly controls both variance and computational effort with the estimator becoming exact for $M \to \infty$.

Parallelizable: for larger values of $M$ all evaluations can be done in parallel, i.e. the sequential compute depth does not increase for more accurate estimates.

Unbiasedness: for any $M \geq 1$ the estimator is unbiased. How valuable is an unbiased estimator? In general whether an estimator is unbiased or not may not matter (see Andrew Gelman's points here and here). But our situation here is special for two reasons: 1. there is an exact quantity of interest, $\textrm{tr}(A)$, and our estimation is done only for computational benefits; and 2. for most deep learning applications it is incredibly important: it allows iterative stochastic optimization algorithms to work correctly and to asymptotically average out estimator variance.

So is all good then with Hutchinson's estimator?

Problems of Hutchinson's estimator

Despite singing the praise just now, the estimator has a number of fundamental problems as well.

High Monte Carlo variance: the estimator has a decaying variance at rate $O(1/M)$ arising from taking the average of $M$ estimates. To see why this is a bad rate, consider the case where we take $M=d$, and we take Normal vectors $z^{(m)} \sim \mathcal{N}_d(0,I)$. We then could recover the exact matrix $A$ and thus its trace without any uncertainty. Hutchinson would still offer us only a $1/d$ decrease in variance and hence does not use all information contained in our measurements.

The analysis from Skorski reflects this hungryness for large sample sizes. Skorksi's analysis estimates that for given $(\varepsilon,\delta)$ parameters, we need $n(\varepsilon,\delta) = 2(1-(8/3)\varepsilon)\log(1/\delta)/\varepsilon^2$ samples to achieve an absolute bound of $\varepsilon$ on the relative error with probability $1-\delta$. As an example, his results requires that $n(0.1, 0.1) = 337$ and $n(0.01, 0.1)=44824$ for example, independent of $d$.

Complete prior ignorance: in some applications we may have a prior idea about $A$ or of its trace value. For example, in deep learning we learn iteratively by gradient descent, and a matrix $A_t$ at step $t$ may not be too different from a matrix $A_{t+\Delta}$ for small $\Delta$.

Complete random design $(z^{(1)},\dots,z^{(M)})$: whether Normal or Rademacher vectors are used, the random vectors $z^{(m)}$ are chosen independently at random. Can we improve the estimate by chosing them dependently? Or by chosing $z^{(m)}$ adaptively based on $(y^{(j)},z^{(j)})_{j < m}$? The latter is an adaptive experimental design and may or may not be an option depending on our needs to parallelize computation over $z^{(m)}$'s.

Variance Reduction Approaches

A number of approaches have been proposed to preserve the spirit of the Hutchinson estimator but to lower its variance. The shared idea is to think sequentially and to use prior measurements to construct some form of estimate $\hat{A}$ of $A$, which can then be used to lower the variance.

I am aware of two classes of methods: one based on control-variates, and one based on constructing a low-rank approximation to $A$.

In addition to these two classes, I will also throw in an attractive new method into the mix, based on randomized quasi Monte Carlo.

Control-variate Methods

Control variates are a classic method for variance reduction and are frequently used in reinforcement learning, where they are called baselines. A great introduction to classic variance reduction methods can be found in Chapter 8 of Art Owen's yet-unreleased Monte Carlo book, with Section 8.9 introducing various forms of control variates.

In its simplest form the idea is this: we are interested in estimating $\mathbb{E}_{z \sim p}[f(z)]$ using samples from $p(z)$. If we know a "simple" function $h$ and this function is similar to $f$, i.e. we have $h(z) \approx f(z)$, then we can instead attempt to estimate the equivalent quantity

$$\mathbb{E}_{z \sim p}[f(z) - h(z)] + \mathbb{E}_{z \sim p}[h(z)].$$

The first expectation is now likely smaller in magnitude, so our Monte Carlo estimate of this first term has smaller variance. But what about the second term? If $h$ is simple enough we may be able to compute this quantity analytically, with no Monte Carlo variance at all.

To make this idea realistic, we typically relax the definition somewhat and define $h_{\beta}(z) := \beta \, h(z)$, where $\beta \in \mathbb{R}$ can be estimated to maximally mimic the behaviour of $f(z)$ and thus to reduce the variance of $f(z) - h_{\beta}(z)$ the most.

For trace estimation (Adams et al., 2018) first proposed to use control variates to reduce variance:

They propose to set $h_{\beta}(z) = \beta \, z^T B z$, where $B \in \mathbb{R}^{d \times d}$ is a matrix chosen by us, ideally $B \approx A$, and $\beta \in \mathbb{R}$ is estimated or fixed to $\beta=1$.
The $M$-sample trace estimator now becomes
$$\hat{T}_{C}(A,B,\beta) = \frac{1}{M}\sum_{m=1}^M \left[(z^{(m)})^T A z^{(m)} - \beta (z^{m})^T B z^{(m)}\right] + \beta \textrm{tr}(B).$$
When $z^{(m)} \sim \mathcal{N}_d(0,I)$, Adams et al. show (Lemma 4.1 in their work) that the variance-minimizing choice of $\beta$ is $\beta^* = \textrm{tr}(A\,B)/\textrm{tr}(B\,B)$ and that for this choice the variance of the estimator is reduced compared to the Gaussian trace estimator by $2 \textrm{tr}(A \,B)^2 / \textrm{tr}(B \, B)$. This also shows that when $B=A$ the estimator variance is zero.

How to select the matrix $B$? Adams et al. make one efficient proposal, which is to estimate the diagonal of $A$ in the form $B = \textrm{diag}(b)$, where $b \in \mathbb{R}^d$. The diagonal is a simple choice because we can evaluate $\textrm{tr}(B) = \sum_{i=1}^d b_i$ but also because the Hutchinson-style trace estimator already contains an estimator of the diagonal within it:

$$\mathbb{E}_{z}[z \odot (A z)] = \textrm{diag}(A),$$

where $\odot$ is the elementwise product. This identity holds for both the Rademacher vectors and the Gaussian vectors because $\mathbb{E}[z_i^2] = 1$. For the $i$'th element of the diagonal, we can see that

$$ \begin{align*} \mathbb{E}\left[z_i \left(\sum_{j=1}^d A_{ij} z_j\right)\right] &= \sum_{j=1}^d A_{ij} \mathbb{E}[z_i z_j]\\ &= A_{ii} \underbrace{\mathbb{E}[z_i^2]}_{=1} + \sum_{j \neq i} A_{ij} \underbrace{\mathbb{E}[z_i]}_{=0} \underbrace{\mathbb{E}[z_j]}_{=0}\\ &= A_{ii}. \end{align*} $$

Adams et al. also propose to apply the control variate idea once more to the diagonal estimate itself. To see one way to achieve this is to look at the $m$'th iteration, where our instantaneous diagonal estimate is

$$\hat{b}^{(m)} = z^{(m)} \odot A z^{(m)}.$$

Instead we can use our existing knowledge of $A$, in the form of $\hat{b}^{(m)}$:

$$ \begin{align*} \hat{b}^{(m)} &:= z^{(m)} \odot \left(A-\textrm{diag}(\hat{b}^{(m-1)})\right) z^{(m)} + \mathbb{E}_z\left[z \odot \textrm{diag}(\hat{b}^{(m-1)}) z\right]\\ &= z^{(m)} \odot \left(A-\textrm{diag}(\hat{b}^{(m-1)})\right) z^{(m)} + \hat{b}^{(m-1)}. \end{align*} $$

Putting the two control variate ideas together, we can implement the Adams et al. trace estimator in the following Julia code.

function adams_trace_estimator(A, M::Int; use_diag_cv=false)
    d = size(A,1)
    b_diag = zeros(d)   # B = diag(b_diag)

    tr_est = 0.0
    for m = 1:M
    z = randn(d)    # Gaussian z^{(m)}
    y = A*z

    y_B = b_diag .* z   # B z
    tr_est += z'*y - (z'*y_B - sum(b_diag))   # z'Az - (z'Bz - tr(B))

    # Update diagonal estimate
    if use_diag_cv
        b_diag_cur = (z .* (y - y_B)) + b_diag  # z .* ((A-B)z) + diag(B)
    else
        b_diag_cur = z .* y   # instantaneous estimate of diag(A)
    end
    b_diag .*= (m-1)
    b_diag += b_diag_cur
    b_diag ./= m        # Invariant: b^{(M)} = (1/M) sum_{m=1}^M (z^{(m)} .* y^{(m)})
    end
    tr_est / M, b_diag
end

Low-rank Approximation Methods (Hutch++)

(Meyer et al., 2021) present improvements on the Hutchinson estimator by first extracting a low-rank approximation to $A$ and then using this low-rank approximation to reduce the variance of the trace estimate.

Given a good approximation $\tilde{A}$ of $A$ the method also uses the same technique as the control variate approach, representing

$$\textrm{tr}(A) = \textrm{tr}(\tilde{A}) + \textrm{tr}(A - \tilde{A}),$$

where $\textrm{tr}(\tilde{A})$ is computed analytically and the second term is stochastically estimated at reduced variance. How to obtain a good approximation $\tilde{A}$? Meyer et al. make two proposals, which then form the Hutch++ and the Nystroem-Hutch++ estimator. I will only discuss the Hutch++ briefly here.

Hutch++ estimator. Given a symmetric psd matrix $A \in \mathbb{R}^{d \times d}$ and an overall budget of $m$ query vectors, split this budget into $q_k$ and $\ell$ such that $2 q_k + \ell = m$. Create $S \in \mathbb{R}^{d,q_k}$ with each element $S_{ij} \sim \mathcal{N}(0,1)$. Evaluate $Y = A S$ and orthonormalize $Y$ to $Q \in \mathbb{R}^{d \times q_k}$. Set $\tilde{A} = Q^T A Q$ and apply the control variate method on $\ell$ additional samples.

In Julia this can be implemented as follows.

function hutchpp(A,m)
    d = size(A,1)
    k = floor(Int, (m-2)/8)   # Variance-optimal allocation of initial queries
    qk = 2*k+1                # qk: number of initial query vectors
    ell = m - 2*qk            # ell: remaining budget for final estimate
    @assert (2*(qk)+ell) <= m # make sure total query budget m is satisfied

    # initial basis construction
    S = randn(d, qk)      # qk initial query vectors
    Y = A*S              # query matrix
    Q = Matrix(qr(Y).Q)  # orthonormalize SY to a (d,qk) basis

    # variance-reduced stochastic estimate
    z = randn(d,ell)      # ell remaining queries
    y0 = A*z
    y = y0 - Q*(Q'*y0)    # adjust estimate using low-rank approximation
    tr_ests = sum(z .* y, dims=1)
    tr_est = Statistics.mean(tr_ests) + tr(Q'*A*Q)   # another qk queries
    tr_est
end

A problem of the Hutch++ family of estimators shared with the control variate one is that it is difficult to parallelize: there are two sequential steps and the second step relies on the output of the first. This may not be a problem in most applications, but in training deep neural networks we typically prefer parallelization.

Preconditioning

If the matrix $A$ is a kernel matrix, i.e. $K_{ij} = k(x_i, x_j)$ for some kernel function $k$, then variance of a stochastic trace estimator can be greatly reduced using an appropriate preconditioner.

An extensive set of results is given by (Wenger et al., "Preconditioning for Scalable GP Hyperparameter Optimization", ICML 2022) with the application of computing the log-marginal likelihood (evidence) of Gaussian processes.

In their application, they exploit the identity

$$\log \det K = \log \det P + \textrm{tr}(\log K - \log P),$$

and estimate the second term using a stochastic trace estimator for variance reduction.

Wenger et al. show in theory and through experiments that this leads to large reduction in variance. As an example, if $A$ is a kernel matrix arising from a radial basis function (RBF) kernel in one dimension then the variance scaling that can be achieved with a suitable precondition can be exponential, $\mathbb{V}[\hat{T}_P] = exp(-c m)$.

The paper by Wenger et al. is very well written and the code is already available in GPyTorch.

Randomized Quasi Monte-Carlo (RQMC)

Quasi Monte Carlo methods (QMC) aim to improve on Monte Carlo integration. Whereas basic Monte Carlo methods draw samples independently, quasi Monte Carlo methods draw samples from a dependent distribution chosen such that for classes of integrands better convergence rates are obtained. Typically QMC methods start with a uniform distribution in the hypercube $[0,1]^d$. We can map the hypercube $[0,1]^d$ to a domain such as $[-\infty,\infty]^d$ using the inverse cumulative distribution function (inverse CDF) of a chosen distribution. For example, for the standard Normal distribution the inverse CDF would be the Normal quantile function. QMC points are deterministic and this determinism would lead to unavoidable bias when used for sampling. An effective remedy is to randomize QMC methods once more, by shifting all generated points using a randomly chosen offset. This is the RQMC constructions and it guarantees that the marginal distribution of every point is following the target distribution.

To see intuitively how selecting dependent samples could lead to better properties, here is a visual example of 64 multivariate Normal samples in 2D as used in Monte Carlo methods such as the Gaussian trace estimator:

Now, for comparison, the following Figure shows a draw of marginally Normal-distributed points generated with a RQMC construction, implemented by the following Julia code using the Sobol.jl package.

M = 64
d = 2
points = zeros(M, d)
sobolseq = skip(SobolSeq(d), max_M)
for m = 1:max_M
    points[m,:] = Sobol.next!(sobolseq)
end
points = mod.(points .+ rand(1,d), 1.0)
points = quantile.(Normal(), points)

As you can see the points are more equally spaced out. The hope with RQMC methods is that such more homogeneous spacing improves the rate of the Monte Carlo average.

Formally, the starting point of RQMC methods is to assume an integration problem over a function $f: [0,1]^d \to \mathbb{R}$. Here, for the purpose of trace estimation, we can define our function as

$$f(u) = (\Psi^{-1}(u))^T A \, \Psi^{-1}(u),$$

where $\Psi^{-1}$ is the standard Normal quantile function, applied elementwise. We have

$$\int_{u \in [0,1]^d} f(u) \,\textrm{d}u = \textrm{tr}(A).$$

This construction is beneficial if we can approximate the integral of $f$ over the $d$-dimensional unit cube effectively. This is what randomized QMC methods do. The theory of most QMC results requires $f$ to satisfy bounded variation conditions on partial derivatives ("bounded variation in the sense of Hardy and Krause", aka BVHK), but these conditions can be difficult to verify. Here $f$ has unbounded derivatives and even $\Psi^{-1}$ itself is unbounded when approaching the boundary at zero or one. Nevertheless, we can still go ahead and simply apply RQMC methods to assess their performance empirically. This is popular practice in quantitative finance and other applications of RQMC methods, and as safety net RQMC methods typically never perform worse than plain Monte Carlo and has also been used successfully in other applications in machine learning, e.g. for variational inference in (Buchholz et al., 2018).

RQMC trace estimator. The proposed trace estimator is simply the Hutchinson construction but using a RQMC point set instead of independent samples. Here I use a Sobol sequence.

Comparison

For testing the estimators we will use a matrix extracted from a recent diffusion model for image generation. This model generates 32x32x3 ImageNet images and in order to compute the training objective we need to estimate the trace of a 3072-by-3072 matrix. I extracted this implicit matrix by performing 3072 matrix-vector products with the canonical basis vectors. The matrix is quite benign, is positive-definite and has a rather smooth spectrum (see plot below). I assume these nice properties are present in most image diffusion models.

I ran 500 replicates of the following experiment: draw $z^{(m)} \sim \mathcal{N}_d(0,I)$, $m=1,2,\dots,250$, and pass this vector to all estimators. I record the estimate after each value of $m$ for each replicate. Then I estimate the variance of the estimator, as well as its bias. All estimators are unbiased for all values of $m$, as expected, so the main quantity of interest is the variance as a function of $m$.

We can understand the variance behaviour best in a log-log plot because relationships of the form $y=b x^{\alpha}$ become linear in the log-log plot, $\log y = \log b + \alpha \log x$, and if the behaviour is well modelled as a line in the log-log plot, then the slope coefficient $\alpha$ gives us the scaling behavior as $M \to \infty$. For example, simple Monte Carlo estimates have variance behavior $M^{-1}$ so $\alpha = -1$. Any value smaller than $-1$ denotes an improvement over simple Monte Carlo. Randomized Quasi Monte Carlo methods can achieve $\alpha = -2$ for example, (Gerber and Chopin, 2015).

Hutch++ estimator: Unfortunately, despite solid theory in the paper, I have not been able to observe practical improvements over even the simple Gaussian trace estimate on my test matrix.

Bayesian Estimation

A classic method for approaching estimation problems is Bayesian decision theory. (Sidenote: I have mentioned (Parmigiani and Inoue, "Decision Theory: Principles and Approaches", 2009) in my blog before, but it really is a wonderful introduction to the topic.)

The key steps in the Bayesian approach are: 1. write down what you know; 2. write down how what you know relates to what you would like to know; and 3. make optimal decisions by optimizing expected utility. This recipe is simple and elegant in principle but becomes challenging quickly, as we will see shortly for trace estimation.

Benefits and Pitfalls of Bayesian Estimation

Before we look at trace estimation, I want to give one concrete example of the risks but also benefits of the Bayesian approach to estimation. This is the example of estimating the entropy of a discrete random variable discussed on this blog before. A short summary is this: between 1993-1995 David Wolpert and David Wolf proposed a sound Bayesian approach to the problem, using a standard Dirichlet-Multinomial model, which allows for efficient estimation due to conjugacy. The model appears elegant, and has support everywhere, thus can recover the true entropy and is asymptotically unbiased as well.

However, six years later, in 2001, Ilya Nemenman and colleagues found grave flaws in this benign looking Bayesian approach: the prior almost completely specifies the entropy, i.e. the prior predictive is highly concentrated when samples from the Dirichlet distribution, i.e. probability vectors, are mapped to their entropy. The full story is in my prior blog article.

It is really nice that the story does not end here: (Nemenman, Shafee, and Bialek, "Entropy and inference, revisited", 2001) proposed to add one more hyperprior layer to the Dirichlet-Multinomial model and chose this hyperprior to be maximially uninformative with respect to entropy, akin to a reference prior approach, but targetted to entropy inference. This estimator, the NSB estimator of entropy is still state-of-the-art for estimating the entropy of discrete random variables, dominating almost all other methods in terms of RMSE and bias in a wide variety of practical distribution types. However, it is computationally expensive compared to most other entropy estimates.

This story is very concrete but the lessons implied are general:

Bayesian estimation relies on a suitable prior, and whether a prior is suitable or not also depends on the implied prior predictive over the quantity of interest.
It may be hard to construct suitable uninformative priors, and it may not be obvious when to call a model a success.
When a suitable prior can be designed, the Bayesian approach uses all information in the data, and can provide accurate estimates with uncertainty quantification.
There may be a tradeoff between computational efficiency and suitability of the model.

Bayesian Trace Estimation

For Bayesian trace estimation we can propose the following directed graphical model.

The unknown matrix $A$ is assumed to come from a prior $p(A)$ and $T=\textrm{tr}(A)$ is the implied distribution over the trace. $z^{(m)} \sim p(z)$ independently, for example $z^{(m)} \sim \mathcal{N}_d(0,I)$. We then observe $y^{(m)} = A z^{(m)}$ and are interested in $p(T|(z^{(m)},y^{(m)})_{m=1,\dots,M})$.

To make things concrete we can assume $A$ is symmetric and model $A_{ij} \sim \mathcal{N}(0,\sigma^2)$ for $i \leq j$. Thus $A \sim \mathcal{N}(\mu, \Sigma)$ with $\mu = 0_n$ and $\Sigma = \sigma^2 I_n$, where $n=d(d+1)/2$ are the number of upper-triangular elements in the unknown $A$, so we index with coordinates of $A$, like so $\mu_{(i,j)}$, and $\Sigma_{(i,j),(k,l)}$.

When observing $y^{(m)}$ we know with certainty that

$$y^{(m)} = A z^{(m)}$$

must hold for any possible $A$. Thus we can remove all matrices from our prior which violate this equality constraint. This means we condition our multivariate Normal belief $A \sim \mathcal{N}(A; \mu, \Sigma)$ on a subspace implied by the equality. Doing so is not a standard operation on multivariate Normals, but is possible and results in a rank-deficient multivariate Normal. The result is a new posterior belief $A \sim \mathcal{N}(A; \mu', \Sigma')$.

Graphically, in 2D, this conditioning on a subspace looks as in this figure. (The detailed equation for conditioning a multivariate Normal on a subspace are in the appendix below.) The black dots are samples of possible matrices from the prior, and after conditioning on an observed subspace we retain a rank-deficient posterior, visualized by blue samples.

Thus, for our simple choice of multivariate Normal prior on $A$ we can, for each observed $(z^{(m)}, y^{(m)})$ pair update our posterior beliefs analytically. (This update is relatively expensive and may preclude the Bayesian approach entirely, see discussion below.)

At any time, we can also compute the closed-form posterior over the trace itself, as it is a sum of Normal random variables and thus Bienayme's identity applies and moreover the resulting sum is again Normal. We have $T \sim \mathcal{N}(\mu_T, \sigma^2_T)$, with

$$\mu_{T} = \sum_{i=1}^d \mu_{ii},$$

$$\sigma^2_T = \sum_{i=1}^d \Sigma_{(i,i),(i,i)} + 2 \sum_{i=1}^d \sum_{j=i+1}^d \Sigma_{(i,i),(j,j)}.$$

Overall this seems a satisfactory if computationally heavy model for trace estimation. But we can go further with the Bayesian approach and choose $z^{(m)}$ intelligently using adaptive experimental design techniques.

Adaptive Experimental Design

Experimental design refers to making intelligent choices about what to measure in order to draw more informative inferences. In static experimental design one chooses a set of things to measure apriori, selecting measurements that for example are on average not too strongly correlated in order to maximize the expected information content of the measurements. The RQMC approach would be a simple example of a static experimental design because the $z^{(m)}$ choices are dependent for different values of $m$.

In adaptive experimental design we consider a sequential setting and thus sequentially decide what to measure based on all observations measured up to that point. You can think of adaptive experimental design as a simplification to the general reinforcement learning setup: your actions (what to measure) do not have an effect on the state of the world, and your reward is internal in terms of what information you have gained.

As a simple example, consider a paper survey setting: a static experimental design consists of a printed questionnaire with a set of well-chosen questions. An adaptive experimental design would only show one question to you at first and pick the next question based on your answer to all prior questions.

Personal anecdote. As a personal anecdote, I first used Bayesian experimental design to great effect in my work with Microsoft Israel on time-of-flight (ToF) camera technology (around 2013-2017). A time-of-flight camera is an active sensing system where time-modulated light is emitted into the world and the light bounces are recorded back on a camera, whose sensitivies are also time-modulated. By using Bayesian experimental design methods we were able to design the actively controllable part of the system and halve the mean absolute range estimation error (Section 7 in TPAMI 2016 paper) and to learn to measure maximally complementary information over time (dynamic time-of-flight CVPR 2017 paper). The Bayesian ToF approach shipped in a few thousand first-gen Hololens prototypes to developers but was replaced a year later with a different sensor and algorithm and unfortunately the entire Microsoft Israel time-of-flight team was let go, thus four years of hard work and my collaboration with an outstanding team in Israel, Amit Adam in particular, came to an end. (That is a separate story for another day.)

Later, in 2018, Cheng Zhang, Chao Ma, myself, and colleagues at Microsoft used adaptive Bayesian experimental design in more general settings such as questionnaire design (ICML 2019 paper, NeurIPS 2019 paper) and Cheng and team productized much of this work, now available through Azure and shipped in successful products.

For a wonderful introduction to experimental design and decision theory more generally, I highly recommend the book (Parmigiani and Inoue, "Decision Theory: Principles and Approaches", 2009).

For trace estimation, here is what the adaptive experimental design model would look like, visualized as influence diagram.

Choice nodes $z^{(m)}$ are now rectangular to indicate that they are under our control and not independent random variables as before. How should we choose $z^{(m)}$? A natural approach is to select $z^{(m)}$ as the one that maximizes the reduction in posterior uncertainty or variance. For this, denote all prior observations as $\mathcal{D}_{<k} := \{(z^{(i)},y^{(i)})\}_{i < k}$. Then we can choose $z^{(m)}$ as

$$z^{(m)} = \textrm{argmax}_{z \in \mathbb{R}^d} \mathbb{V}[T | \mathcal{D}_{<m}] - \mathbb{E}_{(y,A) \sim p(y|A,z) \, p(A| \mathcal{D}_{<m})}\left[ \mathbb{V}[T | \mathcal{D}_{<m}, (z, y)] \right]. $$

This expression looks somewhat complex but here are some interpretation aids:

It reads "variance before minus variance after". The "variance after", i.e. after additionally measuring $(z,y)$ is always smaller than the "variance before". Hence the objective measures the reduction in variance, which we want to maximize.
The "variance after" term is also contained in an expectation over $(y,A)$. How come? We do not know $y$ and $A$, so we take an expectation over our best current beliefs up to that point.

The optimization problem may or may not have a closed-form solution, I did not investigate this. Instead, I did a simple implementation where I sample 100 points from $\mathcal{N}(0,I)$, then pick the point that maximizes the objective.

Here is a small experiment. The experiment is smaller than small: with $d=16$ I sampled $A \sim \textrm{symmat}(\mathcal{N}_k(0,\Sigma_0)$, where $k=d(d+1)/2$ and $\Sigma_0$ is chosen such that $\Sigma_{(i,i),(i,i)}=\sigma_d^2$ and $\Sigma_{(i,i),(j,j)}=\sigma_o^2$ for $i\neq j$. I used $\sigma_d=200$ and $\sigma_o=5$. This prior encodes diagonal-dominant matrices. To give the maximal possible edge to the Bayesian model I sampled $A$ from this prior, i.e. there is no misspecification in this experiment. I ran 500 replicates of the trace estimation experiment, so the plots will be a bit noisy, but here are the results.

The Bayes model has an order of magnitude lower variance than the next best method (RQMC). The Bayesian method is not unbiased, so this low variance could be due to strong influence of the prior, so let's look at the root-mean-squared-error (RMSE) as well.

Again both Bayes methods are doing very well. If my implementation is correct, this must in fact be the case, as the Bayes estimate is optimal and thus the model achieves the Bayes risk in terms of RMSE. But is the model biased? The limited experiments do not allow a conclusion except that the plot shows that the unbiased methods show up as biased due to the estimated bias itself having an estimation error and the Bayesian models being in that same range.

Difficulties of the Bayesian approach

Clearly, the Bayesian approach to trace estimation is not ready to be used due to excessive runtime requirements. It may be possible to intelligently perform the same computation in terms of sparse updates or implicit representations of the evolution of $\Sigma$, and thus make the Bayesian approach relevant.

Conclusion and Future Directions

We looked at a few existing estimators of the trace of a matrix. Here is a list of ideas for research in this area:

Sequential or not? The estimators we have discussed can be divided into two classes. In the first class we have static estimators that can be parallelized because no computation depends on the output of prior computation. In the second class we have estimators that do some clever sequential processing (estimating control variates, estimating a low-rank approximation, or similar) and then benefit in a second stage. In practice, for deep learning applications, we may be able to get the best of both worlds by amortizing computation over time: instead of treating one optimization step as a closed-world, we can estimate the necessary quantities over multiple steps, for example in the control variate or low-rank approximation case. So the dichotomy between static and sequential is not as hard, which brings me to the following concrete idea.
Parameterized control variates: in ML applications we often need trace estimates where the matrix $A$ is a function of other quantities. For example, in diffusion models the matrix $A$ may depend an input vector or time variable, e.g. $A=A(x,t)$, and we do not have one trace estimation task but a large number of unique tasks with varying $x$ and $t$. This makes Hutchinson's estimator so popular: it is cheap in this setting, and this dependence on inputs seems to rule out approaches such as the control variate method which requires multiple samples. However, in reinforcement learning control variates called state-dependent baselines are commonly employed for variance reduction in policy gradient methods, e.g. (Tucker et al., 2018). So if our matrix has dependencies such as $A(x,t)$ it may be beneficial to simultaneously learn a cheap control variate $B(x,t)$, perhaps as an auxiliary output of the main model, in order to amortize computation over learning iterations, in effect is a simple form of learning-to-learn more efficiently.
Bayesian trace estimation? Conceptually the Bayesian approach is particularly attractive for trace estimation as the latent structure of the problem is exactly known. In practice I have my doubts whether this approach will be useful in deep learning, for three reasons: 1. already the simplest faithful model I could come up with is computationally very expensive; 2. it seems challenging to find suitable priors $p(A)$ over matrices for two reasons, a) standard choices such as Wishart distributions are not closed under subspace conditioning so must be handled using even more expensive computational approaches, and b) trace estimation is used in a wide variety of domains and a generally useful yet uninformative prior seems too much to ask for; and 3. unbiasedness is highly desirable in most deep learning uses of trace estimation and Bayesian estimates are generally biased in the small sample setting and only asymptotically unbiased for $M \to \infty$, whereas Hutchinson's estimator is unbiased for any $M$. Finding a general prior for matrices that is computationally efficient under subspace conditioning would could be interesting. Perhaps a good starting point would be the multivariate Normal distribution but then to marginalize most dimensions away. This would make computation more efficient while retaining tractability.

So there you have it. Given my understanding so far, I even venture to make some recommendations for the current estimators:

First, use Hutchinson's estimator or the Gaussian trace estimator. Try both and measure the variance.
If you can afford $M > 1$ and $d < 21,201$: give the RQMC approach a try; it should be simple to implement, with SciPy, Tensorflow, and PyTorch all supporting Sobol sequence generation. (The restriction to $d < 21,201$ is not intrinsic to the approach but a practical constraint due to limited availability of so called direction numbers.)
If variance of the estimates in your trace estimates are a major bottleneck in your application, try the diagonal control variate approach, perhaps learning this control variate as part of your learning objective if the matrix is varying with the inputs to your network.

Acknowledgements. I thank Yang Song for careful reading and feedback on the draft including a number of corrections and pointing me to two more uses of trace estimation; to Florian Wenzel for corrections, references, and improvements to the quasi Monte Carlo methods.

Appendix

Conditioning a multivariate Normal on a subspace

First, the following result: if $x \sim \mathcal{N}(\mu,\Sigma)$, and

$$T(x) := Ax + b,$$

then $T(x) \sim \mathcal{N}(A\mu + b, A\Sigma A^T)$. Furthermore we have joint Normality,

$$\left[\begin{array}{c}x\\T(x)\end{array}\right] \sim \mathcal{N}\left( \left[\begin{array}{c}\mu\\ A\mu + b\end{array}\right], \left[\begin{array}{cc}\Sigma,&\Sigma A^T\\ A\Sigma,& A\Sigma A^T\end{array}\right] \right).$$

Observing $y=T(x)$ we have $x | y \sim \mathcal{N}(\bar{\mu},\bar{\Sigma})$, with

$$\bar{\mu} = \mu + \Sigma A^T (A \Sigma A^T)^{-1} (y-(A\mu + b)),$$

$$\bar{\Sigma} = \Sigma - \Sigma A^T (A \Sigma A^T)^{-1} A \Sigma.$$

Longevity and Supplements

2021-02-03T22:00:00+00:00

TLDR: in the past decade longevity has emerged as a serious research field. There are now a number of studies that indicate that a number of safe supplements may likely extend lifespan and health in adult humans.

Note: I normally blog about statistics and machine learning. This article is different and I hope my readers will find the information interesting.

Healthy Ageing

"I would like to live long and in good health."

-- Approximately 99 percent of humans.

Modern societies enable us to live in good health. Investment into healthcare systems, emergency response, medical research, pharmaceutical industries, but also elderly care facilities, gyms and sports, as well as educational policies and taxes for products causing ill health such as alcohol and tobacco all lead to better health outcomes and longer lives. Even policies and politics that lead to world peace and lower violence have remarkable outcomes on longevity.

These efforts have been remarkably effective, and globally so: in 1900 the global average life expectancy at birth was just 31 years. In 2020 it is 72.6 years. The average increase per year in life expectancy globally between 2000 and 2020 has been 0.46 percent.

Despite this amazing progress, the idea of systematically aiming scientific research at directly extending lifespan or even to overcome death entirely is recent. Of course, the idea of living forever---to cheat death---is not new. In fact, it appears in the oldest surviving story. However, before the last decade most serious scientists did not consider it possible to overcome death; after all any empiricist can see that all humans eventually have to die. Yuval Harari puts this as follows:

"Even a few years ago, very few doctors or scientists would seriously say that they are trying to overcome old age and death. They would say no, I am trying to overcome this particular disease, whether it's tuberculosis or cancer or Alzheimers. Defeating disease and death, this is nonsense, this is science fiction."

--- Yuval Harari, in a 2015 interview with Daniel Kahnemann

Not only has the last decade changed the view of serious scientists, there are now organized efforts and serious funding. The SENS foundation, founded in 2009, has limited funding but coordinates a number of research programmes and investments in longevity startups. Calico, founded in 2013 as part of Google, has a mission to combat ageing and 2.5B USD investment. Juvenescence, a UK-based longevity startup secured 100M USD investment in 2019. There are many other smaller companies and institutional investors now regularly fund these startups. For investors it makes sense: not only is the total addressable market of longevity products all of humanity, but it also seems that a technical solution to death is in reach within our lifetime. Yuval Harari again puts it better than I could:

"People never die because the Angel of Death comes, they die because their heart stops pumping, or because an artery is clogged, or because cancerous cells are spreading in the liver or somewhere. These are all technical problems, and in essence, they should have some technical solution. And this way of thinking is now becoming very dominant in scientific circles, and also among the ultra-rich who have come to understand that, wait a minute, something is happening here. For the first time in history, if I'm rich enough, maybe I don't have to die."

--- Yuval Harari, in the same 2015 interview

In parallel to these more visible investments and research efforts, there is also a renewed interest in the study of ageing-related effects in existing medication and nutritional supplements. In the remainder of this post we will look at some of the safer supplements and what we now know about them.

But first, a disclaimer.

Disclaimer: This article is not a substitute for consulting your physician about which supplements may or may not be right for you. I do not endorse supplement use or any product or supplement vendor, and all descriptions here are for scientific interest. Also: I am not a medical doctor or physician.

With the disclaimer out of the way, lets start.

Alpha-Ketoglutarate (AKG)

Supplement name: Arginine Alpha-Ketoglutarate (AAKG)
Typical supplement dosage: 2000 mg/day
Typical supplement price: 0.35 USD/day

Alpha-Ketoglutarate is naturally present in human blood plasma and is part of the metabolic cycle. However, naturally occuring AKG levels decrease severely with age, with a ten-fold reduction between the age of 40 and 80 years. As a supplement AKG has an excellent safety record and is widely and freely available, commonly used in the bodybuilding community. Two reviews on what is known about the effects on AKG in the human body are (Wu et al., Biomol Ther, 2016) and (Liu et al., BioMed Research International, 2018).

AKG was first identified as a possible longevity supplement in the nematode C. elegans animal model, (Chin et al., Nature 2014), (free PDF). In that study the authors demonstrated a concentration-dependent improvement of lifespan, achieving an almost 50 percent increase in lifespan and delaying age-related diseases. Moreover, Chin et al. identified one causal mechanism for these benefits.

(Shahmirzadi et al., Cell, 2020), (preprint PDF) report on a study done in mice. They demonstrate robust improvements in lifespan and healthspan when supplementing the mouse diet with AKG, with lifespan increased by about 10 percent (slightly more for female mice, slightly less for male mice). In particular, the AKG supplementation postpones the occurence of a number of aging phenotypes such as deterioration of color in the fur and weakened grip strength of the mice, and improvements are also present when only starting AKG supplementation in the later half of the mouse's life.

The study concludes:

"Given its GRAS status and human safety record, our findings point to a potential safe human intervention that may impact important elements of aging and improve quality of life in the elderly population."

--- Shahmirzadi et al., 2020

Glucosamine Sulphate

Supplement name: Glucosamine Sulphate
Typical supplement dosage: 1500 mg/day
Typical supplement price: 0.10 USD/day

Glucosamine is a supplement commonly used to manage joint pain as it has been shown to be able to relieve joint pain. A recent study (Ma et al., BMJ, 2019) based on large sample observational UK Biobank data showed that regular use of glucosamine supplements is associated with lowering the risk of cardiovascular disease. Another recent study, (Li et al., Annals of the Rheumatic Diseases, 2020), also based on large sample observational UK Biobank data ($n=495,077$) finds compelling evidence that Glucosamine may lead to overall reduced mortality. While Li et al. do not perform a randomized controlled trial and hence such observational data does not imply causation, they do consider confounders and perform a detailed analysis on large sample subgroups. They carefully note:

"In general, with the current observational study design the possibility of residual confounding due to imprecise measurements or unknown factors cannot be excluded for all findings in our study, despite our careful adjustment of all measured confounders."

However, the effect size in Li's study is large: assuming an absence of non-measured confounders, the hazard ratio for a nine year follow-up period for overall mortality is 0.85 with regular glucosamine supplementation, meaning that there is a 15 percent less death compared to the non-glucosamine supplemented group. For cardiovascular mortality, respiratory mortality, and digestive mortality the hazard ratio is even stronger at 0.82, 0.73, and 0.74, respectively. The authors conclude that

"regular glucosamine supplementation was associated with lower mortality due to all causes, cancer, CVD, respiratory and digestive diseases."

Glucosamine supplementation is considered safe by the US National Center for Complementary and Integrative Health and no side effects are known when used regularly for multiple years.

NAD+ boosters: NMN / NR

Supplement name: Nicotinamide riboside chloride (NR)
Typical supplement dosage: 300-1000mg NR (sold as "TRU NIAGEN"), taken in the morning on an empty stomach
Typical supplement price: 1.80 USD/day (300mg dose)

and

Supplement name: Nicotinamide mononucleotide (NMN)
Typical supplement dosage: 500-2000mg MNM, taken in the morning on an empty stomach
Typical supplement price: 0.34 USD/day (500mg dose)

NAD+, nicotinamide adenine dinucleotide, is central to metabolism in many organisms, including humans. NAD+ levels decrease strongly with age, (Massudi et al., PLoS ONE, 2012). The reasons for this decline is not yet fully understood but is actively researched, (Schultz and Sinclair, Cell Metabolism, 2017).

While NAD+ decreases with age, does the low NAD+ level cause aspects of ageing? Schultz and Sinclair put this poetically,

"The discovery of nicotinamide adenine dinucleotide (NAD+) as a "cozymase" factor in fermentation has its 110th anniversary this year [2016]. Of the two billion people who were alive back in 1906, only 150 people remain. Interestingly, NAD+ itself may be the reason for their longevity."

--- Schultz and Sinclair, 2017

A growing line of research in the last 15 years proposes to artificially elevate NAD+ levels to improve health in ageing individuals. For a number of reasons NAD+ cannot be directly used as supplement. Instead a number of precursors are used. Two of these, NR and NMN, have been shown to lead to dose-dependent increase of bioavailable NAD+ in humans, (Trammell et al., Nature Communications, 2016), and (Poddar et al., 2019). NR is a vitamer of Vitamin B3 and present in normal human diet. Likewise NMN is found naturally in small amounts in some foods such as broccoli and avocado.

In an important paper, (Zhang et al., Science, 2016), NR supplementation has extended lifespan in mice by 10 percent. In particular, NR supplementation has direct observable effects on markers of ageing such as muscle function. The authors conclude that:

"Our findings suggest that NAD+ repletion may be revealed as an attractive strategy for lengthening mammalian life span."

--- Zhang et al., 2016

There is now strong causal evidence that NAD+ supplementation in the form of NMN or NR has health and ageing benefits in mice through a broad set of mechanisms, (Yoshino et al., Cell Metabolism, 2018), (Poddar et al., 2019). Recently NAD+ supplementation via NMN has even been shown to restore female fertility in ageing mice, (Bertoldo et al., Cell Reports, 2020).

For both NR and NMN there are no large scale studies in humans. A small scale study on NR in human adults, (Martens et al., Nature Communications, 2018) showed some cardiovascular benefits but the small scale of the study makes conclusions difficult.

The safety of NMN and NR is only partially established. Both occur naturally in a normal human diet but only in small amounts, much smaller than the amount used in supplements. NR is generally recognized as safe and a randomized controlled human trial study, (Conze et al., Nature scientific reports, 2019), finds no adverse effects over a duration of eight weeks. In comparison NMN has limited safety studies, (Irie et al., 2019). However, for both NMN and NR there are is no long-term safety data available for humans.

Overall NMN and NR supplementation are promising supplements to improve healthspan in humans, but a definite conclusion is still open. Definite results will likely be available within the next ten years. In practical terms one big disadvantage at the moment is that NR is comparatively expensive.

Resveratrol

Supplement name: Resveratrol
Typical supplement dosage: 500-1000 mg/day
Typical supplement price: 0.15-0.30 USD/day

Resveratrol is a substance naturally occuring in plants in low concentrations. In food one of the highest concentrations of resveratrol is present in red wine, however, even the highest concentrations occuring naturally in food (up to 15mg per liter) are well below doses used when using resveratrol as supplement.

There is a wide body of evidence that supplementation with resveratrol has strong beneficial health effects in a wide variety of animals. A comprehensive review article, (Baur and Sinclair, Nature Reviews, 2006), summarizes the state of knowledge around 2006:

"Resveratrol, a constituent of red wine, has long been suspected to have cardioprotective effects. Interest in this compound has been renewed in recent years, first from its identification as a chemopreventive agent for skin cancer, and subsequently from reports that it activates sirtuin deacetylases and extends the lifespans of lower organisms. Despite scepticism concerning its bioavailability, a growing body of in vivo evidence indicates that resveratrol has protective effects in rodent models of stress and disease."

--- Baur and Sinclair, 2006

Most importantly at that time there already was evidence of lifespan and healthspan extension by resveratrol:

Yeast, 70 percent mean lifespan extension;
Nematode Caenorhabditis elegans, 18 percent mean lifespan extension;
Fly Drosophila melanogaster, 29 percent mean lifespan extension;
Turquoise killifish Nothobranchius furzeri, 56 percent mean lifespan extension.

While these are very different animals and the lifespan extension varies, the effect size is nevertheless remarkably strong. In 2006, Baur and Sinclair state:

"The question of whether enhanced SIRT1 activity and/or resveratrol treatment will increase mammalian lifespan looms large in the ageing-research community. (...) It is becoming clear that resveratrol and more potent mimetics show great promise in the treatment of the leading causes of morbidity and mortality in the Western world. So far, little evidence suggests that these health benefits are coupled with deleterious side effects. Even the trade off between individual health and reproductive potential that is characteristic of caloric restriction does not seem to occur in animals with lifespans that have been extended by resveratrol. Could resveratrol and similar molecules form the next class of wonder-drugs?"

--- Baur and Sinclair, 2006

A first study on mammals in the form of mice was also published by the same authors in 2006, (Baur et al., Nature, 2006). It demonstrated no lifespan extension by resveratrol supplementation but convincingly demonstrated that mice that are fed on a high calorie diet together with resveratrol survive as well as normally fed mice and have similar health characteristics, whereas mice fed the same high calorie diet without resveratrol suffers adverse health and shortened lifespan. Interestingly the mice fed with the high calorie diet and resveratrol were healthier but had the same body weight as the mice that only received the high calorie diet.

A result similar to the "high calorie diet" study on mice was repeated on humans: (Timmers at al., Cell Metabolism, 2011) report broad improvements of health markers on obese humans supplemented with resveratrol. The authors conclude,

"In conclusion, we demonstrate beneficial effects of resveratrol supplementation for 30 days on the metabolic profile in healthy obese males, which seems to reflect effects observed during calorie restriction. Although most of the effects that we observed were modest, they were very consistently pointing toward beneficial metabolic adaptations. Furthermore, therewere no effects on safety parameters, and no adverse events were reported."

--- Timmers et al., 2011

Given these promising findings on health benefits, what about lifespan extension in mammals and humans? A meta-analysis across studies with different species, (Hector et al., Biological Letters, 2012) could confirm the life extension effect of resveratrol but found a diminished effect for higher-order species.

(Pearson et al., Cell Metabolism, 2008) perform a study on mice that demonstrates no increase in lifespan for normally fed mice that received resveratrol supplements. But again these mice showed a marked improvement in health markers when compared to mice that did not receive resveratrol:

Significantly improved bone density at age;
Reduction in age-related cataracts at age;
Improved balance and motor coordination at age; and
Improved cardiovascular function at age.

The observed health benefits are similar to those of a low calorie diet, which has been shown to lead to both increased healthspan as well as lifespan extensions in mammals, (Pifferi and Aujard, 2019). However, here only health is improved, not lifespan.

The authors conclude:

"In conclusion, long-term resveratrol treatment of mice can mimic transcriptional changes induced by dietary restriction and allow them to live healthier, more vigorous lives. In addition to improving insulin sensitivity and increasing survival in HC mice [high calorie diet mice], we show that resveratrol improves cardiovascular function, bone density, and motor coordination, and delays cataracts, even in nonobese rodents. Together, these findings confirm the feasibility of finding an orally available DR [dietary restriction] mimetic. Since cardiovascular disease is a major cause of age-related morbidity and mortality in humans but not mice, it is possible that DR mimetics such as resveratrol could have a greater impact on humans. However, resveratrol does not seem to mimic all of the salutary effects of DR [dietary restriction] in that its introduction into the diet of normal 1-year-old mice did not increase longevity."

--- Pearson et al., 2008

Overall, the available evidence makes it likely that humans will also enjoy some health benefits at age when regularly using resveratrol supplementation. Whether resveratrol supplementation also has longevity benefits in humans is an open question.

Safety: a metastudy of clinical trials of resveratrol, (Patel et al., Annals of the New York Academy of Sciences, 2011), analyzed the data of 17 studies of resveratrol in humans and found no adverse effects for typical doses up to 1g per day in humans and minor dose-dependent side effects for doses up to 5g per day.

Multivitamin Supplementation

Supplement name: generic multivitamin supplement
Typical supplement dosage: typically 1 pill/day with dosages matching US FDA recommendations
Typical supplement price: 0.10 USD/day

General multivitamin supplements are widely available and widely used. A recent study finds that users of multivitamins often self-report better health status than clinically present, see (Paranjpe et al., 2020). In addition the study finds that no clinical differences exist between people using multivitamin supplements and people who do not.

However, vitamin deficiency is a real thing and multivitamin supplements are safe, cheap, and effectively guard against certain deficiencies.

Take vitamin D deficiency for example: (Forrest and Stuhldreher, Nutrition Research, 2011) report that 41.6 percent of US adults have a vitamin D deficiency (higher depending on skin type, with a vitamin D deficiency present in 82.1 percent of blacks, and in 69.2 percent of Hispanics). They conclude:

"Given that vitamin D deficiency is linked to some of the important risk factors of leading causes of death in the United States, it is important that health professionals are aware of this connection and offer dietary and other intervention strategies to correct vitamin D deficiency, especially in minority groups."

--- Forrest and Stuhldreher, 2011

Take vitamin B deficiency for example: (Sechi et al., Nutrition Reviews, 2016) summarize multiple studies of empirically observed vitamin B deficiencies in different populations. For elderly persons (age 65 and up) vitamin B deficiency is observed at 22.9 percent for vitamin B1, 11.7 percent for vitamin B2, and 30 percent for vitamin B9. They also report on the potential neurological impairments that come with vitamin B deficiencies and while they do not advocate for broad supplementation they conclude:

"Taken together, these findings indicate that subclinical or overt B vitamin deficiency, with frequent involvement both of the central and peripheral nervous system at all stages of life, is a global health concern, especially in selected populations at risk and in certain clinical settings."

--- Sechi et al., 2016

So overall, while you will likely not benefit from multivitamin supplementation if you have a varied diet and an active lifestyle, it is still cheap and safe and provides an additional guard against certain vitamin deficiencies which can have real health consequences.

Summary

In the last 15 years scientists have identified a number of promising supplements that yield health benefits and potentially increase lifespan. Should we all take these supplements then? Given the partial evidence we always have to make an uncertain risk/benefit tradeoff, the main risks being the known side effects and the unknown potential long-term consequences.

The potential benefits of healthspan and lifespan extension are large and we are objectively at the dawn of a revolution in our understanding of ageing and in treating ageing. We should not be surprised if any of the above supplements prove their promise in human studies within the next decade.

Not included

I intentionally did not include medications and certain supplements:

Creatine: generally a beneficial supplement for muscular function. It has been studied for possible beneficial effects on aging, but it is too early to tell, see (Smith et al., 2015).
Metformin: a prescription medicine used for treating diabetes. There is strong causal evidence of a link of Metformin to delaying ageing in mice, (Martin-Montalvo et al., Nature Comms, 2013), and strong evidence of a potential Metformin-related reduction of overall mortality in humans with effect size measured in years, (Campbell et al., Ageing Research Reviews, 2017), and potential link to reduction of dementia, (Campbell et al., Journal of Alzheimer's Disease, 2018). A recent summary of what we know about Metformin in ageing is (Piskovatska et al., Biogerontology, 2018). There are three reasons why I do not include Metformin in the above list: 1. Metform in a prescription medicine in most of the world, not a supplement; 2. while Metformin overall is relatively safe and has been used for decades to treat diabetes, there are very common and unpleasant side effects including diarrhoea, nausea, and vomiting; and 3. two large randomized controlled human trials evaluating Metformin specifically as a treatment for ageing are currently underway.
Rapamycin: a prescription medicine that is an immunomodulator and has an overwhelming evidence base of increasing longevity and to enhance health markers in a wide variety of animals including mammals with large effect sizes. A randomized double-blind placebo-controlled study, the PEARL study, to be completed in December 2023 studies the anti-ageing effects of Rapamycin in 150 humans. For an opinionated but well-written overview of the primary literature, see (Blagosklonny, Aging, 2019).

Debiasing Approximate Inference

2018-12-05T08:30:00+00:00

This year at NeurIPS 2018 the Symposium on Advances in Approximate Bayesian Inference discussed challenges and advances in approximating probabilistic inference in rich models. It was a genuinely exciting program!

I was lucky enough to give an invited talk at the event.

Title: Debiasing Approximate Inference
Abstract:

At its heart, the field of approximate inference is about trade-offs between computation and estimation accuracy: when we approximate quantities such as the evidence or posterior expectations no randomness is left and given limitless computation budget all quantities can be evaluated exactly. But given finite computation, how do we select inference methods such that they provide accurate estimates of quantities of interest? In this talk I will argue for a more explicit consideration of bias-variance tradeoffs of common inference methods. In particular, I highlight that current inference methods such as variational inference and Markov Chain Monte Carlo make a particular bias-variance tradeoffs which may be suboptimal for our inferential question at hand. What can we do about this? There is a rich portfolio of methods to change bias-variance tradeoffs in the form of debiasing methods; I will provide a brief overview and demonstrate a number of recent successful applications of these methods to variational inference and stochastic gradient MCMC.

Here are the talk slides and a voice recording (I believe the symposium organizers plan to eventually release a video recording).

MLSS 2018 in Madrid

2018-09-02T23:00:00+01:00

The Machine Learning Summer Schools (MLSS) is the largest and most popular machine learning summer school series. For two weeks in August and September the MLSS 2018 is held in Madrid.

I am happy to speak on the topics of generative adversarial networks (GANs) this year.

My talk materials are now available. The total talk duration is 3 hours.

Talk, PDF version: Introduction, 9MB
Talk, PowerPoint version: Introduction, (286MB), pptx

The PowerPoint version contains animations which unfortunately are not preserved in the PDF version. However, the PDF version is much smaller and works across all platforms.

Do Bayesians Overfit?

2018-07-11T22:30:00+01:00

TLDR: Yes, and there are precise results, although they are not as well known as they perhaps should be.

Over the last few years I had many conversations in which the statement was made that Bayesians methods are generally immune to overfitting, or at least, robust against overfitting, or---everybody would have to agree, right?---it clearly is better than maximum aposteriori estimation.

Various loose arguments in support include the built-in Bayesian version of Occam's razor, and the principled treatment of any uncertainty throughout the estimation. However, over the years it has always bothered me that this argument is only made casually and for many years I was not aware of a formal proof or discussion except for the well-known result that in case the model is well-specified the Bayes posterior predictive is risk-optimal.

Until recently! A colleague pointed me to a book written by Sumio Watanabe (reference and thanks below) and this blog post is the result of the findings in this nice book.

Overfitting

In machine learning, the concept of overfitting is very important in practice. In fact, it is perhaps the most important concept to understand when learning from data. Many practices and methods aim squarely at measuring and preventing overfitting. The following are just a few examples:

Regularization limits the capacity of a machine learning model in order to avoid overfitting;
Separating data into a training, validation, and test set, is best practice to assess generalization performance and to avoid overfitting;
Dropout, a regularization scheme for deep neural networks, is popularly used to mitigate overfitting.

But what is overfitting? Can we formally define it?

Defining Overfitting

The most widely used loose definition is the following.

Overfitting is the gap between the performance on the training set and the performance on the test set.

This definition makes a number of assumptions:

The data is independent and identically distributed and comes separated in a training set and a test set.
There is a clearly defined performance measure.
The test set is of sufficient size so that the performance estimation error is negligible.

For example, in a classification task the performance measure may be the classification error or the softmax-cross-entropy loss (log-loss).

However, in practice this definition of overfitting can be too strict: in many cases we care about minimizing generalization error, not about the difference between generalization error and training error. For deep learning in particular, the training error is often zero for the model that is selected as the one minimizing validation error. The recent paper (Belkin, Ma, Mandal, "To Understand Deep Learning We Need to Understand Kernel Learning", ICML 2018) is studying this phenomenon.

Is overfitting relevant for Bayesians as well?

The Bayesian Case

(This paragraph summarizes Bayesian prediction and contains nothing new or controversial.)

Since de Finetti, a subjective Bayesian measures the performance of any model by the predicted likelihood of future observables. Given a sample $D_n=(x_1, \dots, x_n)$, generated from some true data-generating distribution $x_i \sim Q$, a Bayesian proceeds by setting up a model $P(x|\theta)$, where $\theta$ are unknown parameters of the model, with prior $P(\theta)$. The data reveals information about $\theta$ in the form of a posterior distribution $P(\theta|D_n)$. The posterior distribution over parameters is then useful in constructing our best guess of what we will see next, in the form of the posterior predictive distribution,

$$P(x | D_n) = \int P(x | \theta) \, P(\theta | D_n) \,\textrm{d}\theta.$$

Note that in particular the only degrees of freedom are in the choice of model $P(x|\theta)$ and in the prior $P(\theta)$.

How good is $P(x|D_n)$? A Bayesian cares about the predicted likelihood of future observables, which corresponds to the cross-entropy loss, and is also called the Bayesian generalization loss,

$$B_g = -\mathbb{E}_{x \sim Q}[\log P(x|D_n)].$$

Likewise, given our training sample $D_n$, we can define the Bayesian training loss,

$$B_t = - \frac{1}{n} \sum_{i=1}^n \log P(X_{n+1}=x_i | D_n).$$

However, the concept of a "Bayesian training loss'' is unnatural to a Bayesian because it uses the data twice: first, to construct the posterior predictive $P(x|D_n)$, and then a second time, to evaluate the likelihood on $D_n$. Nevertheless, we will see below that the concept, combined with the so called Gibbs training loss, is a very useful one.

The question of whether Bayesians overfit is then clearly stated as:

$$B_t \ll B_g\,?$$

A Simple Experiment

We consider an elementary experiment of sampling data from a Normal distribution with unknown mean.

\begin{eqnarray} \mu & \sim & \mathcal{N}(\mu_0, \sigma^2_0),\\ x_i & \sim & \mathcal{N}(\mu, \sigma^2), \qquad i=1,\dots,n. \end{eqnarray}

In this case, exact Bayesian inference is feasible because the posterior and posterior-predictive distributions have a simple closed-form solution, each of which is a Normal distributions.

For varying sample size $n$ we perform 2,000 replicates of generating data according to the above sampling procedure and evaluate the Bayesian generalization loss and the Bayesian training loss. The following plot shows the average errors over all replicates.

Clearly $B_t < B_g$, and there is overfitting.

What about non-Bayesian estimators, such as MAP estimation and maximum likelihood estimation?

Maximum Aposteriori (MAP) and Maximum Likelihood (MLE)

Two popular point estimators are the maximum aposteriori estimator (MAP), defined as

$$\hat{\theta}_{\textrm{map}} = \textrm{argmax}_{\theta} P(\theta | D_n),$$

and the maximum likelihood estimator (MLE), defined as

$$\hat{\theta}_{\textrm{mle}} = \textrm{argmax}_{\theta} \sum_{i=1}^n \log P(x_i|\theta).$$

Each of these estimators also has a generalization loss and a training loss. In our experiment the MLE estimator is dominated by the MAP estimator, which is in turn dominated by the Bayesian estimate, which is optimal in terms of generalization loss.

The gap between the MLE generalization error (top line, dotted) and the MAP generalization error (black dashed line) is due to the use of the informative prior about $\mu$. The gap between the Bayesian generalization error (black solid line) and the MAP generalization error (black dashed line) is due to the Bayesian handling of estimation uncertainty. In this simple example the information contained in the prior is more important than the Bayesian treatment of estimation uncertainty.

Can we estimate $B_g$ except via prediction on hold-out data?

WAIC: Widely Applicable Information Criterion

It turns out that we can estimate $B_g$ to order $O(n^{-2})$ from just our training set. This is useful because it provides us an estimate of our generalization performance, and hence can be used for model selection and hyperparameter optimization.

The Widely Applicable Information Criterion (WAIC), invented by Sumio Watanabe, estimates the Bayesian generalization error,

$$\textrm{WAIC} = B_t + 2(G_t - B_t),$$

where $G_t$ is the Gibbs training loss, defined as the average loss of individual models from the posterior,

$$G_t = -\mathbb{E}_{\theta \sim P(\theta|D_n)}\left[\frac{1}{n} \sum_{i=1}^n \log P(X_{n+1} = x_i|\theta)\right].$$

Due to Jensen's inequality we always have $G_t > B_t$ so the right hand summand in $\textrm{WAIC}$ is always positive. Importantly, given a training set we can actually evaluate $\textrm{WAIC}$, but we cannot evaluate $B_g$.

Watanabe showed that

$$\mathbb{E}[B_g] = \mathbb{E}[\textrm{WAIC}] + O(n^{-2}).$$

Evaluating the previous experiment we can see that $\textrm{WAIC}$ indeed accurately estimates $B_g$.

Even better, Watanabe also showed that $\textrm{WAIC}$ continues to estimate the Bayesian generalization error accurately in singular models and in case the model is misspecified. Here, singular means that there is not a bijective map between model parameters and distributions. Misspecified means that no parameter exists which matches the true data-generating distribution.

WAIC with Approximate Posteriors

The above definition of $\textrm{WAIC}$ assumes an exact Bayesian posterior. In practice we may not have the luxury of being able to compute the exact posterior, and instead use approximate inference methods such as Markov chain Monte Carlo (MCMC) to get sample based approximations to the posterior, or variational Bayes (VB) to get approximations within a parametric family of distributions.

For such approximations WAIC can degenerate considerably. For example, consider a posterior family

$$\mathcal{U}_v := \{ \mathcal{N}(\mu, C) \, | \, \mu \in \mathbb{R}^d, \, 0 \prec C \prec vI \},$$

where a scalar $v > 0$ bounds the eigenvalues of $C$ from above. Doing variational Bayes with $\mathcal{U}_{\epsilon}$ then corresponds to MAP estimation and the difference $G_t - B_t$ will be close to zero, yet $B_t$ can be very small. In this case, applying the $\textrm{WAIC}$ equation would suggest that $B_g \approx B_t$, so we severely underestimate the Bayesian generalization loss.

The same holds true for MCMC: even if $\theta^{(1)}, \dots, \theta^{(K)}$ would be exact samples from $P(\theta|D_n)$ and we approximate the exact posterior by the set of these parameters, the estimate of $B_t$ is now too large so $G_t - B_t$ is underestimated.

Conclusion

Clearly, Bayesians do overfit, just like any other procedure does.

The following is a list of relevant literature with some comments.

(Sumio Watanabe, "Algebraic Geometry and Statistical Learning Theory", Cambridge University Press, 2009), a monograph summarizing in detail earlier results. The results are particularly relevant for neural networks (which are singular models) and for Bayesian neural networks.
For WAIC, see also Section 7.1 in (Sumio Watanabe, "A Widely Applicable Bayesian Information Criterion", JMLR, 2013).
(Gelman, Hwang, Vehtari, "Understanding predictive information criteria for Bayesian models", Statistics and Computing, 2013) have good things to say about WAIC when comparing multiple information criteria (AIC, DIC, WAIC), "WAIC is fully Bayesian (using the posterior distribution rather than a point estimate), gives reasonable results in the examples we have considered here, and has a more-or-less explicit connection to cross-validation"
The application of WAIC to select hyperparameters is studied by Watanabe in (Watanabe, "Bayesian Cross Validation and WAIC for Predictive Prior Design in Regular Asymptotic Theory", 2015).
Can one improve on the Bayesian risk? Yes, if the model is misspecified. A not so well-known paper, (Fushiki, "Bootstrap prediction and Bayesian prediction under misspecified models", Bernoulli, 2005) compares the Bayesian posterior predictive generalization loss with the generalization loss of a so-called Bootstrap prediction posterior, proving that the latter is more efficient asymptotically in the misspecified setting.

Acknowledgements. I thank Ryota Tomioka for exciting discussions and for pointing me to Watanabe's book. Thanks also to Ferenc Huszár and Richard Turner for feedback on a draft of the article and to Vitaly Kurin and Artem for a correction.

Stable GAN Models and Creative Machines

2017-12-04T15:00:00+00:00

We just published an article discussing our recent work on stabilizing generative adversarial networks.

NIPS 2016 Generative Adversarial Training workshop talk

2016-12-10T22:30:00+00:00

The biggest AI conference of the year has just ended: NIPS in Barcelona broke all records this year and the program was exciting as always. It certainly remains my favorite conference to attend.

One of the best things about NIPS are the numerous high-quality workshops; this year David Lopez-Paz, Alex Radford, and Léon Bottou put together a workshop on Adversarial Training, with most of the content related to generative adversarial networks (GAN).

If you have not heard of GANs before, Ian Goodfellow gave a detailed tutorial on GANs, slides here, earlier in the week and certainly GANs were the hot topic of this years NIPS.

$f$-GAN Talk Slides

I gave an invited talk at the GAN workshop on the NIPS 2016 paper on f-GAN, authored by Ryota Tomioka, Botond Cseke, and myself.

Here is the slide deck I used during the talk.

Please let me know your feedback.

Acknowledgments. This is joint work with Ryota Tomioka and Botond Cseke.

Book Review: Computer Age Statistical Inference

2016-11-23T21:00:00+00:00

A new book, Computer Age Statistical Inference: Algorithms, Evidence, and Data Science by Bradley Efron and Trevor Hastie, was released in July this year. I finished reading it a few weeks ago and this is a short review from the point of view of a machine learning researcher.

Living in Cambridge I indulge myself every once in a while by taking a break at the Cambridge University Press bookstore at the market square, located just opposite of King's College it is the oldest book shop in England. Besides having an excellent collection of mathematics and computer science books, at the entrance of the shop they showcase new releases from Cambridge University Press. Most of these new books fall outside my interest, but what a pleasure it was to discover a new bold book on the broad theme of statistics in the modern age, written by two experts in the field! I took a look at the table of contents and a minute later purchased the book.

Review

The book examines statistics broadly through three lenses.

First, it tells the history of the field of statistics, often with interesting remarks about the prevalent views at the time a method was invented. Second, correlated with the chronological order, the authors classify methods by their use of computation. Classic methods use few to none computation but often leverage asymptotic arguments. Newer methods are increasingly realistic in their assumptions but rely on heavy use of machine computation. Third, the flavour of the presented methods is interpreted as Fisherian, frequentist, or Bayesian.

The terminology in the book is easily accessible to a person with basic statistics training, perhaps with the exception of the word "Inference" in the title. In the book the authors use "inference" to describe the means by which legitimacy of statistical results can be established. This sense is different from the common use of the word in the machine learning community, where it would usually refer in a broad sense to "perform computation of consequences given a model and observations".

From a machine learner's perspective the most interesting parts of the book are the wide applicability of the empirical Bayes methodology, which is demonstrated in a number of generally relevant applications including large-scale testing and deconvolution.

Another benefit for someone with a machine learning background is the modern view on classic methods such as resampling methods (bootstrap and jackknife), a readable motivation for topics and applications which are popular in statistics but not popular in machine learning (survival analysis, large-scale testing, confidence intervals, etc.), and the historical remarks and subjective commentary on developments in the field.

The subjective commentary in the Epilogue makes predictions about the field of statistics and data science as a whole, with the main trends being a branching out into applications and an increased reliance on computation.

Criticism

The book is a wonderful book and many readers will enjoy reading it, as I did. There are only two minor points where I feel the book could be improved.

First, while the authors readily acknowledge that many topics could have been added to the book, I feel that certain topics should have been included due to their broad applicability and heavy use of computation in many successful models: variational Bayesian inference, approximate Bayesian computation (ABC), kernel methods more generally, and Bayesian nonparametrics. Perhaps variational inference and kernel methods have not reached the core statistics community yet, but ABC and Bayesian nonparametrics originate with them and are only possible because of the massive computation available today.

Second, in the description of solutions to statistical problems throughout the book there is a strong emphasis on empirical Bayes and the bootstrap.

Summary

If you enjoy statistics, computation, or machine learning, get the book! The breadth of topics and the independence between the chapters will make it easy for you to find something interesting.

Acknowledgements. Thanks to Diana Gillooly for corrections.

Streaming Log-sum-exp Computation

2016-05-08T21:30:00+01:00

A common numerical operation in statistical computing is to compute

$$\log \sum_{i=1}^n \exp x_i,$$

where $x_i \in \mathbb{R}$, and $n$ is potentially very large.

We can implement the above computation by exponentiating each number, then summing them, then taking a logarithm as follows (written in Julia).

logsumexp_naive(X) = log(sum(exp(X)))

When the above function returns a finite number then it is numerically accurate. However, the above computation is not robust if one of the elements is very large (say, larger than 710 for double precision IEEE floating point). Then $\exp(x_i)$ returns a floating point infinity and the entire computation returns a floating point infinity as well.

Standard Batch Solution

The standard solution to this problem is to use the mathematical identity

$$\log \sum_{i=1}^n \exp x_i = \alpha + \log \sum_{i=1}^n \exp (x_i - \alpha),$$

which holds for any $\alpha \in \mathbb{R}$. By selecting $\alpha = \max_{i=1,\dots,n} x_i$ no argument to the $\exp$-function will be larger than zero and the above naive computation can be applied on the transformed numbers. The code is as follows.

function logsumexp_batch(X)
    alpha = maximum(X)  # Find maximum value in X
    log(sum(exp(X-alpha))) + alpha
end

Code such as the above is used in almost all packages for performing statistical computation and is described as the standard solution, see e.g. here and here.

However, there are the following problems:

It requires two scans over the data array, one to find the maximum, one to compute the summation. For modern systems and large input arrays the above computation is memory-bandwidth limited so two memory scans mean twice the runtime.
It requires knowledge of the number of elements in the sum prior to computation.

Streaming log-sum-exp Computation

The solution is to also compute the maximum element in a streaming manner and to correct a running estimate whenever a new maximum is found. I have not seen this solution elsewhere, but I hope you may find it useful.

First, here is the code.

function logsumexp_stream(X)
    alpha = -Inf
    r = 0.0
    for x = X
        if x <= alpha
            r += exp(x - alpha)
        else
            r *= exp(alpha - x)
            r += 1.0
            alpha = x
        end
    end
    log(r) + alpha
end

As you can see by glancing over the code, only one linear access over the input is required and we do not need to know the number of elements.

To understand how the code works, assume we maintain two quantities. The first is the largest value seen after $i$ elements,

$$\alpha_i := \max_{j = 1,\dots,i} x_i.$$

The second is the accumulated sum so far with the current maximum subtracted,

$$r_i := \sum_{j=1}^i \exp(x_j - \alpha_i).$$

Now when we visit a new element $x_{i+1}$ there are two cases that can happen. If $x_{i+1} \leq \alpha_i$ then $\alpha_{i+1} = \alpha_i$ and we simply update

$$r_{i+1} = r_i + \exp(x_{i+1} - \alpha_{i+1}).$$

However, if we see a new largest element, we can write $r_i$ as

$$r_i := \sum_{j=1}^i \exp(x_j - \alpha_i) = \exp(-\alpha_i) \sum_{j=1}^i \exp(x_j).$$

We correct this estimate in order to use the new maximum $x_{i+1}$ and cancelling the old maximum $\alpha_i$,

$$r'_{i+1} = \exp(\alpha_i - x_{i+1}) \, r_i.$$

The factor is always smaller than one. Then we proceed to accumulate as normal to obtain

$$r_{i+1} = r'_{i+1} + \exp(x_{i+1} - \alpha_{i+1}) = r'_{i+1} + 1.$$

The above code is as numerically robust as the commonly used batch version and for large arrays can be twice as fast.

Example

Running

n = 10_000_000
X = 500.0*randn(n)

logsumexp_naive(X), logsumexp_batch(X), logsumexp_stream(X)

gives the following output

(Inf,2686.7659554831052,2686.7659554831052)

Where will Artificial Intelligence come from?

2016-04-20T23:30:00+01:00

Artificial Intelligence (AI) is making progress in great strides, or at least it appears so! Almost no week passes by without some major announcements of new challenges solved by AI technology or new products powered by AI.

Indeed many quantifiable factors attest an unprecedented level of activity: capital investments, number of academic papers, number of products involving AI technology, they all are on a steep rise in the past five years.

Computers are already very capable at some specialized tasks that require reasoning and other abilities that we typically associate with intelligence. For example, computers can play a decent game of chess or can help us order our holiday photos. Despite this genuine progress, we are still a long way from human level intelligence because our best artificial intelligence systems are not general purpose. They cannot quickly adapt to novel tasks the way most humans can do.

When talking about artificial intelligent systems there is a risk of emphasizing humans too much. Computers are already more capable than any human at many tasks, for example in numerical computation and search. Yet, in discussions about artificial intelligence we emphasize the shrinking set of abilities where humans still outperform machines. For a nice and more balanced recent discussion on issues surrounding artificial intelligence I recommend reading the edge contributions towards the Edge 2015 question.

As artificial intelligence continues to make progress, I would like to ask the following question:

Where will the next major advance towards general purpose artificial intelligence come from?

Below I list seven possible areas which I believe could be the answer to this question; these answers are highly subjective and biased and they may be all wrong, but hopefully they do contain some interesting pointers for everyone.

The point of this exercise is to show that there are many strands of active research that could result in major AI advances. So here are they, the seven areas where a major general purpose AI breakthrough could come from.

1. Composable Differentiable Architectures (aka Deep Learning)

Composable differentiable architectures describes current state-of-the-art deep learning systems. Frameworks such as Caffe, Theano, Torch, Chainer, all allow the specification of function classes and to automatically compose and differentiate such functions. Because of this mix-and-match composability there is a frictionless and rapid diffusion of components and (sub-)models across application domains.

This commoditizes machine learning and allows customization to specific applications; it commoditizes machine learning because the level of knowledge required to leverage modern deep learning frameworks is low. These deep learning frameworks also allow for easy customization of the model to the application at hand. Years ago, this was the unattained dream for graphical models, but today it is achieved by deep learning frameworks where bespoke models are build for most applications.

But is it enough for general purpose AI? What is missing?

I believe there are two obstacles; first, almost all deep learning systems require large amounts of supervised data to work. For high-value industrial applications this may be okay because the required label data can be collected. However, there is a long tail of useful applications where label data is rare but unlabeled data is abundant. Future AI systems need to be able to leverage this abundant data source.

Second, what is missing are general architectures for reasoning, and an intense search for such building blocks is currently taking place. Maybe classic ideas from AI, such as blackboard systems, could be adapted and made differentiable to enable reasoning, or maybe some entirely unexpected new building block will appear.

Besides better models, the key novel technology to look out for in deep learning is custom hardware and novel engineering abstractions. Custom hardware could enable energy savings, or increased speed, or both. Current deep learning piggybacks on GPU development funded largely by the gaming industry. This is a great thing because developing a new GPU generation such as Nvidia's new Pascal GPU requires very large research and development budgets. Novel engineering abstractions in the form of next generation deep learning frameworks could enable automatic scalability, distributed computation, or offer help in identifying the right architecture for the task.

Scalability is important beyond just training speed. For example, consider basic estimates of the computing power of the human brain or the following quote from a recent interview with Geoff Hinton.

"So in the brain, you have connections between the neurons called synapses, and they can change. All your knowledge is stored in those synapses. You have about 1,000-trillion synapses-10 to the 15, it's a very big number. So that's quite unlike the neural networks we have right now. They're far, far smaller, the biggest ones we have right now have about a billion synapses. That's about a million times smaller than the brain."

This puts up a ballpark estimate for the number of primitive computational units in the human brain, and it is quite reasonable to attempt to achieve this scale.

One important fact to consider: the driving force behind applications of deep learning is largely the industry, and this will remain the case as long as it pays dividends (it does so greatly at the moment).

2. Brain Simulations

Understanding the brain and simulating it is what I think of as the safe route to general AI. We do not know whether it will take 5, 50, or 500 years, but it is likely that we eventually will get there and be able to accurately simulate an artificial brain which is functionally indistinguishable from a real human brain.

Novel technology and approaches to study neural systems, such as optogenetics, multi-electrode arrays and connectomics, eventually will enable us to obtain a high-fidelity understanding of the brain. Likewise, increase in computation and custom hardware will allow accelerated simulation of neural models.

Most of the investments in this area of research are government funds, for example through the large US BRAIN initiative and the Human Brain Project, and more general neuroscience funding.

3. Algorithmic Information Theory and Universal Intelligence

Whatever intelligence is, if we were to accept the possibility of a mathematical theory for it, the closest contenders for such theory are found in a field called algorithmic information theory. If you have not heard of algorithmic information theory before, Gregory Chaitin recently wrote an excellent essay on the conceptual roots of algorithmic information theory and the general history of notions of complexity in science and mathematics.

One approach which leverages algorithmic information theory for general artificial intelligence is the AIXI agent, a theory put forward by Marcus Hutter that attempts to be universal in the sense that it will successfully and optimally solve any solvable tasks. At its heart it is a Bayesian reinforcement learning agent where the hypothesis space are possible programs of a Turing machine. It is an extension of an earlier idea to consider Turing machines for predicting future symbols in an observed sequence. This idea is Solomonoff induction proposed by Ray Solomonoff. Because Turing machines are universal any computable hypothesis can be entertained. AIXI extends this idea from mere prediction of symbols to acting in an unknown environment, that is, to reinforcement learning.

Grounding intelligence in Turing machines is very appealing: not only is it universal, but allows the formal definition of universal intelligence as well. In essence, reasoning and acting intelligently is reduced to formal manipulation of a notion of complexity defined by programs on a Turing machine. See also Jürgen Schmidhuber's speed prior for Turing machines.

Despite this promise, so far we do not see impressive results achieved by AIXI agents. Why not? There are at least two obstacles:

Universal Turing machines are not practically implementable and approximating AIXI is hard; there have been some approximation attempts, e.g. in the work of (Veness et al., JAIR 2011), but at best results have matched other reinforcement learning methods without enabling novel applications that were out of reach before. More recently Jürgen Schmidhuber proposed a more practical integration of recurrent neural network models of the world with algorithmic information theory in the form of RNN-based AIs.
The choice of Turing machine is not clear. There is an infinite set of possible universal Turing machines and we could reasonably hope that the particular choice would not influence the agent efficiency except perhaps for some small overhead. (For a related example, the Kolmogorov complexity of a sequence is defined through a Turing machine, but whatever the choice of a Turing machine there is an invariance property in that only a constant overhead introduced compared to any other Turing machine.) Unfortunately for AIXI this is not the case: the choice of Turing machine can determine the behaviour of the AIXI agent entirely. (This may also affects Bayesian reinforcement learning more generally: when using a non-parametric prior process the choice of prior may determine more than intended.)

This recent negative result leaves AIXI in an interesting state at the moment. It is clearly the most complete theory of universal agents we have at the moment, see e.g. Hutter's own review from 2012, but it may turn out to be entirely subjective (if no "natural" Turing machine can be identified) or practically unworkable.

4. Artifical Life

In the above section on brain simulation I argued that by understanding the human brain and then simulating it we will eventually be able to attain human-level intelligence. However, we can start at a more basic level: by understanding and simulating a synthetic form of chemistry we may be able to simulate artificial life. Given a sufficiently rich environment such life may evolve to become intelligent.

The field of Artificial Life (ALife) studies the formation and dynamics of life itself on top of artifical simulations of life. This life does not need to be intelligent, and in fact, so far no such simulation has produced life with the intelligence beyond that of a simple organism. But it is clear that since the early 1990'ies, by any generally plausible definition of life (of which there are many and there is some controversy), artificial life does indeed spontaneously form in computer simulations and complex evolutionary dynamics such as symbiosis and parasites do occur in these simulations.

For a dated but inspiring introduction to the field of artificial life more generally, see Christoph Adami's book on Artificial Life. Adami also wrote a recent article on evolving artificial intelligence that highlights current research issues for the goal to evolve artificial intelligence.

More fundamentally, in another article Adami argues that from theoretical results in a field called integrated information theory (and which I have not heard of before), one possible consequence may be that due to the complexity of general intelligence it is not possible to design it but instead an evolutionary approach is needed.

Given that our goal is to evolve intelligence artifically, fundamentally there are the following obstacles in producing useful general artificial intelligence through artificial life:

The big intelligence filter hypothesis. This hypothesis goes as follows: life may be abundant but intelligent life may be exceedingly rare. We currently do not know if intelligent life is rare or abundant in the universe, but if it is rare, it may also be exceedingly rare in any simulation of artificial life. A related point is what is known as the Fermi paradox, namely, that what science tells us about astrophysics implies we should have likely observed alien civilizations by now, but this has not happened yet. (See Tim Urban's wonderful article on the Fermi's paradox.) Even for life on our own planet we are not sure what triggered intelligence to appear; one widely believed hypothesis is that it happened in a short time, akin to a phase transition, due to a change in ocean oxygen levels 540 million years ago, leading to the Cambrian explosion.
Harnessing intelligence. One of our closest genetic relatives, the chimpanzee, are clearly intelligent, but harnessing this intelligence for something useful is difficult. Now imagine a giant squid, swimming a kilometer deep within the ocean. Likely they are also intelligent but we can hardly leverage this for anything useful. Who knows which form intelligent artificial life will take? Will we be able to recognize this life as intelligent? If we do, such life will likely be similar to encountering an alien species in our universe: unlike anything you can imagine or predict beforehand. That is to say, we may be able to achieve intelligent artificial life but may still struggle to make it useful. Even with full control over the simulation environment, a god-like state if you will, it seems necessary that to make artificial intelligence life useful we will at least have to decode its representation or ``language'' and understand the incentives sufficiently well in order to communicate with such intelligence and motivate it to work for us.

In summary, the evolutionary approach to constructing AIs is promising in the long run and there are now several labs working on it (the labs of Adami, Clune, and Hintze).

5. Robotics and Autonomous Systems

Autonomous robots are rapidly conquering novel applications in industry and consumer space, such as in self-driving cars, agricultural robotics, industry 4.0, and drones.

The key enablers of this development are improved sensing technology (e.g. low-cost depth sensors), increased compute and memory capacities, and improved pattern recognition methods. As a result of the maturity of basic required technologies significant industry capital is invested in driving advanced autonomous robotics research.

Beyond the natural urge to feel scared by autonomous machines, how could this lead to a breakthrough in artificial intelligence that cannot be found in one of the constituent technologies?

One line of thought in the field of embodied cognition argues that an intelligent system is conditioned on its environment in a fundamental way, shaping the allocation of precious (evolutionary) resources in order to maximally exploit the types of sensors and actuators available to it. Therefore the specific nature of sensing and acting abilities is not ancillary to intelligence but the main driving force that enables intelligence in the first place.

If the above thesis were true, autonomous robots with modern sensors and actuators would provide a rich enough embodiment for artificial intelligence, and the lack of such an embodiment in other domains would likely impede the emergence of general intelligence.

In the past decade, the European Union, through its robotics programme in the 7th Framework Programme (FP7, totalling more than 50 billion Euro for 2007-2013) has placed an emphasis on combining cognition with robotics at the exclusion of funding research on artificial intelligence not involving robotics. However, many of the resulting large research projects of that time are more reflecting the ample funding availability rather than representing progress on fundamental questions of cognition.

6. Game Playing

Games entertain humans; what could they do to enable artificial intelligence?

The answer: quite a lot! Games are designed to challenge our intellect, involve interactions between multiple agents, and are sufficiently abstract to be formalized. A computer-implemented game can be as simple and abstract as tic-tac-toe, Chess, or Go, or as sophisticated and close to reality as the latest Grand Theft Auto game.

Therefore games are an almost ideal research vehicle to drive artificial intelligence research. Julian Togelius argues this point eloquently in a recent article.

In fact, there are now popular game playing competitions and platforms which drive AI research: the Stanford general game playing competition, the Computer Poker Competition, the StarCraft AI Competition, the Atari 2600 Arcade Learning Environment, and, most recently Microsoft's Minecraft AI environment (Malmo).

It is likely that such platforms will provide diverse and challenging environments for testing the abilities of artificial general intelligence agents, thus accelerating research and enabling breakthroughs. Perhaps the next breakthrough will be in the form of mastering another game.

7. Knowledge Bases

A knowledge base is a discrete representation of basic facts and relations about entities. Large-scale knowledge bases constructed semi-automatically from the web are already incredibly useful commercially and they power search engine results and personal assistants.

In search results they provide highly accurate results for known entities in all major search engines (e.g. Knowledge Graph and Knowledge Vault in Google, Satori in Microsoft Bing). To see an example, search for a well-known person, e.g. "Stanislaw Ulam" (results from Bing, results from Google) and observe that the details about the person displayed.

In personal intelligent assistants such as Apple Siri, Google Now, Microsoft Cortana, or Amazon Alexa they are responsible for providing facts in basic reasoning abilities. For example, in order to answer queries such as "Who was the president following Thomas Jefferson?" a basic natural language understanding ability and a large knowledge base go a long way.

But can knowledge bases provide the substrate for artificial intelligence? The Cyc project started in 1984 and the Open Mind Common Sense project started in 1999 are both based on the belief that in order to enable artificial intelligence we need to encode common sense reasoning, particularly the entities and relationships of everyday life. The hope was that knowledge, encoded in this way, will make reasoning and discovery of novel knowledge simpler.

It is fair to say, that while the (commercial) usefulness of knowledge bases for intelligent applications is now well established, it is too early to say whether general artificial intelligence would require reasoning on top of an explicit symbolic knowledge base. Perhaps a more continuous and non-symbolic representation of knowledge that supports reasoning is sufficient.

Conclusion

The goal of artificial general intelligence (AGI) is challenging and exciting on many levels. In all likelihood artificial intelligence will make rapid progress in the next decade, perhaps along the directions we just discussed.

Acknowledgements. I thank Jürgen Schmidhuber, Neil Lawrence, Pushmeet Kohli, Chris Adami, and Chris Watkins for feedback, pointers to literature, and corrections on a draft version of the article.

Image credits. The tree image is licensed CC-BY-SA by adoomer. The brain image is a drawing of the brain of Gauss and is public domain. The Anomalocaris image is CC-BY-3.0 licensed art by Nobu Tamura. The octopus image is CC-BY-2.0 licensed art by Paul K. The robot image is licensed CC-BY-2.0 by striatic. The dice image is public domain by Personeoneste. The couple image is public domain.

The Best of Unpublished Machine Learning and Statistics Books

2016-02-09T23:00:00+00:00

Nowadays authors in the fields of statistics and machine learning often choose to write their books openly by publishing early draft versions. For popular books this creates a lot of feedback and in the end clearly improves the final book when it is published.

Here is a short list of very promising draft books. Because completing a book is difficult it is likely that some of these books will never be finished.

"Deep Learning"

By Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep learning has revolutionized multiple applied pattern recognition fields since 2011. If you want to get started in applying deep learning methods, now is the time.

First, there are now many well-engineered frameworks available that make learning and experimentation fun.

Second, there are good learning resources available. If you have a solid background in basic machine learning and basic linear algebra, this book is for you. The field of deep learning is advancing very swiftly and likely this book will not cover all the latest models and techniques used when it is published. However, at the moment it is quite up-to-date and simply contains some of the most organized material on the topic of deep learning.

"Advanced Data Analysis from an Elementary Point of View"

By Cosma Rohilla Shalizi.

This book is covering a lot of ground from classical statistics (linear/additive/spline regression, resampling methods, density estimation, dimensionality reduction), and some more recent topics (causality, graphical models, models for dependent data).

One feature of this book is highlighted by the addition ``from an elementary point of view'' to the title: it is quite accessible and the author genuinely cares about conveying understanding, often in a delightfully casual tone.

See also the unfinished but great book written by Cosma Shalizi and Aryeh Kontorovich, "Almost None of the Theory of Stochastic Processes".

"Monte Carlo theory, methods and examples"

By Art Owen.

While Monte Carlo methods have advanced in the last decade, in particular driven by the need for large scale Bayesian statistics, most comprehensive textbooks (Liu, Robert and Casella, Rubinstein and Kroese) are now dated.

Art Owen's book is a wonderful addition that will become a classic once it is completed. From first principles and with great depth Art Owen introduces the theory and practice of Monte Carlo methods. Chapters 8 to 10 contain a wealth of material not found in other textbook treatments. I am eagerly waiting for chapters 11 to 17.

"A Course in Machine Learning"

By Hal Daume III.

A basic course on a broad set of machine learning methods including (eventually, not yet) chapters on structured prediction and Bayesian learning. Very accessible and lots of pseudo-code.

"Introduction to Machine Learning"

By Alex Smola and SVN Vishwanathan.

This draft was last updated in 2010 and probably will never be completed. Chapter 3 however, covering continuous optimization methods, is complete and nicely written.

The Fair Price to Pay a Spy: An Introduction to the Value of Information

2016-01-09T22:30:00+00:00

(This article covers the decision-theoretic concept of value of information through a classic example.)

What is the value of a piece of information?

It depends. Two factors determine the value of information: first, whether the information is new to you; second, whether the information causes you to change your decisions.

The first point is immediately clear as you would be unwilling to pay a reward for information which you already know. Information is understood here in the sense of probabilistic knowledge represented by a probability distribution. As such, if the information keeps your beliefs unchanged, it cannot have any value.

The second point is more subtle. Only decisions and actions can have value, information itself has only indirect value through the decisions and actions that it influences. The consequence of a decision is a realized utility which can be both positive or negative. As a simple monetary example, imagine you buy a share of a company. Then the utility is a function of the change in the share price. Information such as insider information can lead to a belief that the share price will drop, thus leading to the decision to sell the share and realize the utility. If the information I learn about the company does not change my decision whether to sell the share or not, then it also cannot change the utility. Therefore value is understood as a subjective but quantitative utility that is realized at decision time.

The Fair Price to Pay a Spy

The following example is from one of the important papers on decision theory and decision analysis, now in its 50th anniversary year(!), (Howard, "Information Value Theory", 1966). Unfortunately the paper is behind a paywall, but I will keep the presentation below self-contained and also took the liberty to update the exotic notation used in the paper to a more modern form.

Imagine you run a construction company and the government advertises a contract to build a large development. The bidding happens via a lowest price closed bidding, where every construction company submits a price for which they would construct the development in a technically acceptable manner. You do not see any competing bids and the lowest-price bid wins.

Leaving moral and legal concerns aside, how much would you pay a spy to reveal to you the lowest competing bid prior to you making your bid? We will follow Ronald Howard in answering this question using decision theory, thus putting a monetary value on a piece of information.

The following are the key quantities in this problem:

$E$, the expense to your company in constructing the development. It is a random variable.
$L$, the lowest price among all competing bids. It is a random variable.
$B$, your bid. It is a decision variable under your control, not a random variable.
$V$, the profit you realize, a random variable.

The situation is represented using influence diagrams in the following figure. (Incidentally influence diagrams were also first formally published by Ronald Howard in (Howard and Matheson, "Influence diagrams", 1981), and a nice historical piece on them is available from Judea Pearl in (Pearl, "Influence Diagrams - Historical and Personal Perspectives", 2005).)

In the diagram the round nodes represent random variables, just like in directed graphical models (Bayesian networks). The rectangular node represents a decision node under our control, here the bid $B$ we submit. The diamond shaped utility node represents a value achieved, in our case the profit $V$. The above diagram is not enough, we need to specify how our profit $V$ comes about.

The first step in applying decision theory is to assume that everything is known. So let us assume $B$, $E$, $L$ are known. Then, it is easy to see whether we actually won the contract, i.e. whether our bid is small enough, $B < L$. If $B \geq L$, we do not obtain the contract and the profit is zero. (We assume here, for simplicity, that the cost for making the bid is zero.) If we won the bid, that is, if $B < L$ is true, then the profit is simply the bid price minus our expenses, $B - E$. Therefore we have the profit as a function of $B$, $E$, and $L$ as

$$V = \left\{\begin{array}{cl}0,&\textrm{if $B \geq L$,}\\ B-E,&\textrm{if $B < L$.}\end{array}\right.$$

The above expression can also be written using indicator notation as $V = \mathbb{1}_{\{B < L\}} \cdot (B-E)$.

But $B$, $E$, $L$ are not known. The second step in applying decision theory is therefore to take expectations with respect to everything that is unknown ($E$ and $L$ in our case) and to maximize utility with respect to all decisions ($B$ in our case). We do this in two steps. Let us first assume $B$ is fixed. Then we take the expectation of the above expression with respect to the unknown $E$ and $L$,

$$\mathbb{E}[V | B] = \mathbb{E}_{E,L}[\mathbb{1}_{\{B < L\}} \cdot (B-E)].$$

Now we further assume independence of the cost $E$ and the lowest competing bid $L$, that is $P(E,L) = P(E) \, P(L)$, a reasonable assumption. Here is an example visualization of priors $P(E) = \textrm{Gamma}(\textrm{Shape}=80,\textrm{Scale}=6)$ and $P(L) = \mathcal{N}(\mu=1100, \sigma=120)$.

Assuming independence we obtain

\begin{eqnarray} \mathbb{E}[V | B] & = & \mathbb{E}_{E,L}[\mathbb{1}_{\{B < L\}} \cdot (B-E)]\nonumber\\ & = & P(B < L) (B - \mathbb{E}_E[E]).\label{eqn:VgivenB} \end{eqnarray}

The expression (\ref{eqn:VgivenB}) is intuitive: the expected profit is given by the probability of winning the bidding times the difference between bid and expected cost. Here is a visualization for the above priors, with our bid $B$ on the horizontal axis.

You can see three regimes: 1. When $P(B < L)$ is very large (up to about $B=850$) the expected profit behaves linearly as $B-\mathbb{E}_E[E]$, and if we bid below our actual cost we realize a negative profit (loss). 2. When $P(B < L)$ is very small (above $B=1300$) the expected profit drops to zero. 3. Between $B=850$ and $B=1300$ we see the product expression resulting in a nonlinear profit as a function of B.

To finish the second step of applying decision theory we have to maximize (\ref{eqn:VgivenB}) over our decision $B$, yielding

$$\mathbb{E}[V] = \max_{B} \mathbb{E}[V|B].$$

This tells us how to bid without the help of a spy: in the above example figures, we obtain an expected profit $\mathbb{E}[V] = 421.8$ for a bid of $B=966.2$.

Revealing $L$ gives a large competitive advantage, but how much would we be willing to pay a spy for this information? To this end Howard introduces the concept of clairvoyance and value of information.

In clairvoyance we consider what could happen if a clairvoyant appears and offers us perfect information about $L$. If we would know $L$ we can compute as before

\begin{eqnarray} \mathbb{E}[V | B, L] & = & P(B < L) (B - \mathbb{E}_E[E])\nonumber\\ & = & \mathbb{1}_{\{B < L\}} (B - \mathbb{E}_E[E]),\nonumber \end{eqnarray}

where the probability $P(B < L)$ is now deterministic one or zero as $B$ is our decision and $L$ is known. As $B$ is our decision we again maximize over it.

\begin{eqnarray} \mathbb{E}[V | L] & = & \max_B \mathbb{E}[V | B,L]\nonumber\\ & = & \max_B \mathbb{1}_{\{B < L\}} (B - \mathbb{E}_E[E])\nonumber\\ & = & \left\{\begin{array}{cl}L - \mathbb{E}_E[E], & \textrm{if $L > \mathbb{E}_E[E]$,}\\ 0, & \textrm{otherwise (do not bid).}\end{array}\right.\nonumber \end{eqnarray}

The last step can be seen as follows: our bid $B$ should be above our expected expenses $\mathbb{E}_E[E]$ otherwise we would incur a negative profit but $B$ should also be as high as possible just below $L$. Hence if this is impossible ($L \leq \mathbb{E}_E[E]$) we do not bid. Otherwise we bid $B=L-\epsilon$ and realize the expected profit $L-\mathbb{E}_E[E]$.

Ok, so this tells us how to bid when we know $L$. But we do not know $L$ yet. Instead we would like to put a value on the information about $L$. We do this by integrating out $L$,

$$\mathbb{E}_L[\mathbb{E}[V|L]]$$

(Howard introduces a special notation for the above expression, but I am not a fan of it and will omit it here.)

The value of information (value of $L$) is now defined as

$$\textrm{EVPI}(L) = \mathbb{E}_L[\mathbb{E}[V|L]] - \mathbb{E}[V].$$

This quantity is again intuitive: the value of knowing $L$ is the expected difference between the utility achieved with knowledge of $L$ and the expected utility achieved without such knowledge.

The abbreviation $\textrm{EVPI}$ denotes the expected value of perfect information, a term that was introduced later and has become standard in decision analysis.

So how much is the knowledge of $L$ worth in our example? We compute

$$\mathbb{E}_L[\mathbb{E}[V|L]] \approx 620.0$$

with Monte Carlo and we had $\mathbb{E}[V] = 421.8$ from earlier, hence

$$\textrm{EVPI}(L) \approx 620.0 - 421.8 = 198.2,$$

is the maximum price we should pay our spy for telling us $L$ exactly.

The Fair Price to Pay an Expert

The above was the original scenario described in Howard's paper. In practice obtaining perfect knowledge is often infeasible. But the above reasoning extends easily to the general case where we only obtain partial information.

Here is an example for our setup: consider that we can ask an expert to provide us an estimate $L'$ of what the lowest bid $L$ could be.

By assuming a probability model $P(L' | L)$ we can relate the true unknown lowest bid $L$ to the experts guess.

The influence diagram looks as follows:

Recipe for Value of Information Computation

To understand how the above derivation extends to this case, let us state a recipe of computing value of information:

State the expected utility, conditioned on decisions and the information to be valued.
Maximize the expression of step 1 over all decisions.
Marginalize the expression of step 2 over the information to be valued, using your prior beliefs. The resulting expression is the expected utility with information.
Start over: state the expected utility, conditioned only on decisions.
Maximize the expression of step 4 over all decisions. The resulting expression is the expected utility without information.
Compute the value of information as the difference between the two expected utilities (step 3 minus step 5).

This recipe works for any single-step decision problem, and any potential difficulties are computational.

Application of the Recipe to our Example

Here is its application to our generalized example:

This is $\mathbb{E}[V | L', B]$ which is obtained by marginalizing over $E$ and $L$ in $\mathbb{E}[V | L', B, E, L]$ and the marginal of $L$ is $P(L|L')$ obtained by Bayes rule.
Maximize over $B$, obtaining $\max_B \mathbb{E}[V | L', B]$.
Take the expectation over $L'$, which is defined via $P(L') = \mathbb{E}_{L}[P(L'|L)]$, yielding
\begin{equation} \mathbb{E}_{L'}[\max_B \mathbb{E}[V | L', B]]\label{eqn:Ltick-withinfo} \end{equation}
This is $\mathbb{E}[V | B]$ which is obtained by marginalizing over $E$ and $L$, here the marginal of $L$ is the prior $P(L)$.
Maximize over $B$, obtaining
\begin{equation} \max_B \mathbb{E}[V | B].\label{eqn:Ltick-withoutinfo} \end{equation}
The value of information is the difference between (\ref{eqn:Ltick-withinfo}) and (\ref{eqn:Ltick-withoutinfo}),
$$ \textrm{EVPI}(L') = \mathbb{E}_{L'}[\max_B \mathbb{E}[V | L', B]] - \max_B \mathbb{E}[V | B].$$

To make the above example concrete, let us assume that our expert is unbiased and we have

$$P(L'|L) = \mathcal{N}(L, \sigma),$$

where $\sigma > 0$ is the standard deviation. Computing $\textrm{EVPI}(L')$ as a function of $\sigma$ is possible by solving the maximization and integration problems.

Using the same parameters as before and using Monte Carlo for the integration, here is a visualization of the fair price to pay our expert.

We can see that for $\sigma \to 0$ we recover the previous case of perfect information as the expert provides increasingly accurate knowledge about $L$ when $\sigma$ decreases. Conversely, with increasing expert uncertainty the value of his expert advice decreases.

Computation

(This was added in April 2016 after the original article was published.)

Computing the EVPI can be challenging because in many cases both the maximization problem and the expectation are intractable analytically and sample-based Monte Carlo approximations induce a non-negligible bias.

The recent work of (Takashi Goda, "Unbiased Monte Carlo estimation for the expected value of partial perfect information", arXiv:1604.01120) addresses part of the computation diffulties by application of a randomly truncated series to de-bias the ordinary Monte Carlo estimate. I have not performed any experiments but it seems to be a potentially useful method in the context of value of information computation problems.

Summary

From a formal decision theory point of view the value of information does not occupy a special place. It just measures the difference between two different expected utilities, given optimal decisions.

But value of information appears frequently in almost any statistical decision task. Here are two more examples.

In active learning we are interested in minimizing the amount of supervision needed to learn to perform a task and we can obtain supervision (ground truth class labels, for example) for instances of our choice at a cost. By applying value of information we can select for supervision the instances whose revealed label information brings the highest expected increase in utility.

In experimental design we have to make choices about which information to acquire, such as the number of patients to sample in a medical trial, or what information to collect at different costs in a customer survey. Value of information provides a way to make these choices, both statically, or better, adaptively.

Limitations

While decision theory is rather uncontroversial, it is a normative theory, that is, it tells you how to derive decisions which are optimal and coherent (rational). There are two main limitations I would like to point out:

As a normative theory it cannot claim to be a description of how humans (or other intelligent agents) make decisions.
It assumes infinite reasoning resources on behalf of the acting agent.

Both limitations are related of course in that real intelligent agents may deviate from normative decision theory precisely because they are limited in their reasoning abilities. There are both normative and descriptive theories to address these limitations. On the normative side we have for example computational rationality, taking into account the computational costs of reasoning and deriving optimal decisions within these constraints. On the descriptive side we have for example prospect theory, aiming to describe human decision making.

ICCV 2015, Day 4

2015-12-16T23:50:00+00:00

This article summarizes the fourth day of the ICCV 2015 conference, the International Conference on Computer Vision. A summary of the first day, second day, and third day is also available.

ICCV 2017 and 2019

ICCV 2017 will be in Venice, Italy.

For ICCV 2019 there was an open voting between Seoul (Korea) and Shanghai (China), with Seoul winning the election. Both proposals were strong and because I have lived in Shanghai for two years I favored that proposal, but I am confident that ICCV 2019 in Seoul will be wonderful as well.

Parties

Computer vision is now fully recognized as having an impact in the industry. All large tech companies invested heavily in the last three years or so, and one of the visible results is the increased number of conference sponsors and the conference parties.

Conferences such as NIPS, CVPR, and ICCV now host invite-only open bar parties with several hundred attendees; this year at ICCV there were parties by Microsoft, Intel, Google, and Facebook.

Interestingly they do not come across as recruiting events: there is a minimal announcement perhaps, but otherwise people just chat with food and drinks. It is more a show of strength and goodwill towards the community that computer vision is taken seriously and the parties do demonstrate that the companies are in good shape, much like banks invest in a marble floor and shiny glass facades to gain the trust of their customers.

Interesting Papers

Polarized 3D: High-Quality Depth Sensing with Polarization Cues

By Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Raskar.

Polarization of light is a rarely exploited cue for 3D reconstruction. This work revisits shape-from-polarization and shows fine detail 3D reconstruction from polarization information (with non-trivial post-processing).

Paper.

ICCV 2015, Day 3

2015-12-16T01:00:00+00:00

This article summarizes the third day of the ICCV 2015 conference, the International Conference on Computer Vision. A summary of the first day and second day is also available.

Interesting Papers

Registering Images to Untextured Geometry Using Average Shading Gradients

By Tobias Ploetz and Stefan Roth.

This work considers the difficult problem of aligning an untextured 3D surface to a real image of the same object, a challenging problem because of the absence and presence of edges depending on texture and light.

The authors propose an alignment procedure that uses efficiently computable average shading gradient images that capture expected visible edges due to shadows despite unknown light direction.

Paper.

Robust Nonrigid Registration by Convex Optimization

By Qifeng Chen and Vladlen Koltun.

The authors consider the problem of aligning two 3D shapes to each other, where each shape may be corrupted by missing surfaces (non water-tight surfaces) and undergo severe nonrigid deformations. Previous work has proposed to minimize a specific geodesic distortion measure over suitable classes of continuous transformations, however, this yields difficult non-convex optimization problems.

Because the distortion measure makes sense this work proposes a way to approximate while simultaneously convexifying the problem. This is achieved by representing the transformation nonparametrically through correspondences on randomly sampled points. While the original problem was continuous and non-convex, now it is a discrete energy minimization problem that can be approximately solved using a standard LP-based relaxation approach, where the authors use TRW-S.

What is surprising is how much the results improve on benchmark data sets; the error is reduced by a factor of three compared to strong baseline methods.

Paper.

ICCV 2015, Day 2

2015-12-15T01:20:00+00:00

This article summarizes the second day of the ICCV 2015 conference, the International Conference on Computer Vision. A summary of the first day is also available.

Awards

The following awards were given at ICCV 2015.

Achievement awards

PAMI Distinguished Researcher Award (1): Yann LeCun
PAMI Distinguished Researcher Award (2): David Lowe
PAMI Everingham Prize Winner (1): Andrea Vedaldi for VLFeat
PAMI Everingham Prize Winner (2): Daniel Scharstein and Rick Szeliski for the Middlebury Datasets

Paper awards

PAMI Helmholtz Prize (1): David Martin, Charles Fowlkes, Doron Tal, and Jitendra Malik for their ICCV 2001 paper "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics".
PAMI Helmholtz Prize (2): Serge Belongie, Jitendra Malik, and Jan Puzicha, for their ICCV 2001 paper "Matching Shapes".
Marr Prize: Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samual Rota Bulo, for "Deep Neural Decision Forests".
Marr Prize honorable mention: Saining Xie and Zhuowen Tu for "Holistically-Nested Edge Detection".

Interesting Papers

The above Marr prize winning papers are very nice, but here I also want to highlight three other papers I found interesting today.

Fast R-CNN

By Ross Girshick.

Since 2014 the standard object detection pipeline for natural images is the R-CNN system which first extracts a set of object proposals then scores them using a convolutional neural network. The two key weaknesses of the approach are: first, the separation between proposal generation and scoring, preventing joint training of model parameters; and second the separate scoring of each hypothesis which leads to significant runtime overhead. This work and the follow-up work ("Faster R-CNNs" at NIPS this year) addresses both issues by proposing a joint model that is trained end-to-end, including proposal generation, leading to a new state of the art in object detection.

Code, paper.

Unsupervised Visual Representation Learning by Context Prediction

By Carl Doersch, Abhinav Gupta, and Alexei A. Efros.

Supervised deep learning needs lots of labeled training data to achieve good performance. This paper investigates whether we can create and train deep neural networks on artificial tasks for which we can create large amounts of training data. In particular, the paper proposes to predict where a certain patch appears within the image. For this task, an almost infinite amount of training data is easily created. Perhaps surprisingly the resulting network, despite being trained on this artificial task, has learned useful representations for real vision tasks such as image classification.

Paper.

Deep Fried Convnets

By Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang.

In deep convolutional networks the last few densely connected layers have the most parameters and thus most of the required memory during test time and training. This work proposes to leverage the fastfood kernel approximation to replace densely connected layers with specific efficient and low parameter operations.

The empirical results are impressive and the fastfood justification is plausible, but I wonder if this work may even provide a hint at a more general approach to construct efficient neural network architectures by using arbitrary dense but efficient matrix operations (FFT, DCT, Walsh-Hadamard, etcetera).

Paper.

ICCV 2015, Day 1

2015-12-14T23:00:00+00:00

ICCV 2015, the International Conference on Computer Vision, is one of the premier venues for computer vision research, together with the CVPR conference. This ICCV is happening in Santiago, Chile, a beautiful city with amazing food.

The computer vision community is growing, and this ICCV is the largest so far (1460 attendees, 525 papers). Since a few years computer vision is broadly relevant for the industry and there are no less than 22 companies sponsoring the conference. The acceptance rate this year was 30.92%, with the acceptance for oral presentations at 3.30%. All papers of the conference are available as open-access PDF here.

There was a lot of interesting work presented on the first day, but here is my subjective selection of interesting work.

Aligning Books and Movies

By Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.

Movies and the books they are based on form a rich paired data source. In this work the authors propose a recurrent neural network model to align these two sources semantically. The challenge is that movies and books are often substantially different, but apparently modern recurrent neural networks have enough semantic discrimination ability to enable such alignment.

Project page, paper.

Convolutional Color Constancy

By Jonathan Barron.

Color constancy deals with the correction of colors in digital images. While there have been a large number of works in this area, the issue remains challenging and important.

In this work the author convincingly demonstrates that common changes in colors correspond to simple translation of a color histogram in a transformed 2D histogram space. Then, the problem of correcting for these translations can be posed as simply recognizing the true center position of the observed color histogram and undoing the translation.

Paper.

Self-Calibration of Optical Lenses

By Michael Hirsch and Bernhard Schoelkopf.

Both cheap and expensive camera lenses suffer from many optical effects, leading to deterioration in image quality. This work proposes an automatic way to obtain non-parametric kernel estimates of the point spread functions characterising a lens. The resulting model can then be used to deblur images. In effect, this allows better image quality even when using cheap lenses.

Paper.

The second day is available now.

Ten Tips for Writing CS Papers, Part 2

2015-12-10T22:30:00+00:00

This continues the first part on tips to write computer science papers.

6. Ideal Structure of a Paragraph

A paper has different levels of formal structure: sections, subsections, paragraphs, sentences. It is important to ensure that the structure of the content aligns well with the formal structure because the formal structure is readily perceived by the reader, whereas the structure of the content is not. With a good alignment we make it easy for the reader to have the right mental model for the organization of the content; this enables a better navigation and memory of the content.

An important consequence of a well organized paper is to minimize the possible surprise for the reader. In general you may want to surprise readers with how amazing your method or achievements are, but not through the organization of the paper.

How to align the content with the formal structure? There is more to say about this and I recommend the references at the end of this article, but here I want to focus on the structure of one or multiple paragraphs. The basic rules are:

One paragraph should contain only a single idea or a single point of argumentation.
The beginning and the end of a paragraph glue the paragraph into the surrounding content.

There is an ambiguity as to what constitutes a separate idea and indeed paragraphs may be of quite different lengths.

To achieve a good structure, here is a recipe that works for me. For a section I would like to write I make a list of bullet points of things I want to say, with one bullet point being a single idea or important point. Each point may have one or more dependencies on other points and I use the dependencies to order the list. Finally, I write one paragraph for each item on the list and I may add an additional paragraph at the beginning and end of the section to connect the section to the surrounding content.

I found that this recipe also makes my job as a writer easier because it overcomes my writing inhibition in two ways. First, I can start by simply making a list and this does not feel like writing. Second, once the ordering of ideas is clear, the actual writing becomes a lot simpler.

Here is an example of a less-than-ideal paragraph from Section 2.3 in (Gehler and Nowozin, 2008).

"As already mentioned to our knowledge (Argyriou et al., 2006) were the first to note the possibility of an infinite set of base kernels and they also stated the subproblem (Problem 1). We will defer the discussion of the subproblem to the next section and shortly comment on the differences of the Algorithm of (Argyriou et al., 2006) and the IKL Algorithm. We denote with $g$ the objective value of a standard SVM classifier with loss function $L$."

Let us reverse engineer the content of this paragraph, then restructure it. The paragraph makes two points: first, a connection to the work of (Argyriou et al., 2006). Second, it establishes some notation. So it should perhaps be split into two paragraphs.

For the first point, the beginning is also less than ideal: "as already mentioned to our knowledge"; it is a bit redundant and apologetic to point out that we already mentioned it and that we may not know better. The second point, the notation, is okay by itself, but it is unclear why it follows the first: is it done in order to enable the comparison between approaches? We would need to read ahead to find out. (This is indeed the case.) Here is a proposed improvement:

(Argyriou et al., 2006) first recognized the possibility of an infinite set of base kernels and we now discuss the connection to our work.

To make the connection explicit we first establish the notation we will use throughout the paper. We use $g$ to denote the objective value of a standard SVM classifier, where $L$ is the loss function.

It is simpler to read and makes it clear why we introduce the notation. Also note the end and beginning of the two short paragraphs: the end of the first paragraph tells you what comes next ("the connection to our work"), the beginning of the second paragraph tells you how this is done (through notation). The flow between the two paragraphs is natural now and they could almost be merged into one again with the single point of the resulting paragraph being "the connection between (Argyriou et al.) and our work".

7. Avoid Ambiguous Relative Pronouns (This, These, That, Which)

When used properly, a relative pronoun, such as "this", "these", "that", "which", can effectively refer to a previously mentioned noun, and that has to be remembered by the reader.

In the previous sentence, which entity did "that" refer to? Is it "a previously mentioned noun"? Or is it "a relative pronoun"? Or is it the proper use?

Ambiguities of relative pronouns are common because the writer does not experience the ambiguity. After all, it is clear to the writer what he refers to. Train yourself to recognize any potentially ambiguous relative pronoun, ideally by using a highlighter to mark them in a printout.

To resolve the ambiguity the easiest solution is simply to add the noun it refers to. For the above example, "that" would become "that noun".

(Another issue I ran into frequently is in deciding between "which" in cases where "that" should have been used, such as in "We use an algorithm which is efficient." I remember annoying a former American colleague of mine by using "which" a bit too often. Some advice is available.)

Here is a real example from an ICDM 2008 paper of mine. I highlight all relative pronouns.

Extracting such geometric patterns from molecular 3D structures is one of the central topic in computational biology, and numerous approaches have been proposed. Most of them are optimization methods, which detect one pattern at a time by minimizing a loss function (e.g., [14, 15, 6]). They are different from our approach enumerating all patterns satisfying a certain geometric criterion. In particular, they do not have a minimum support constraint. Instead they try to find a motif that matches all graphs.

This is not the worst example but can be improved nevertheless. The first "which" is best removed, the other relative pronouns are best clarified. Here is a proposed improvement:

Extracting such geometric patterns from molecular 3D structures is one of the central topic in computational biology, and numerous approaches have been proposed. Most of them are optimization methods, detecting one pattern at a time by minimizing a loss function (e.g., [14, 15, 6]). These optimization methods are different from our approach enumerating all patterns satisfying a certain geometric criterion. In particular, other methods do not have a minimum support constraint and instead try to find a motif that matches all graphs.

8. Provide Continuation Markers

Continuation markers are sentences or paragraphs, typically at the beginning of sections, to tell the reader what will be presented next and to tell the reader how it is relevant or how it relates to what has been presented already. It provides structure and flow, connecting the different parts of the paper.

Here is an example, from an ICCV 2015 paper:

"3. Method

We now describe our model for tracking fast moving objects. While the motion model is standard, the observation model for raw ToF captures is a novel contribution."

Note two elements here: first, there is an explicit statement of what will be presented next (the model for tracking fast moving objects). Second, we establish relevance with respect to the contribution.

There are two reasons why thinking about natural continuation markers for reading the paper is important. First, it enables navigation through the paper by allowing the reader to skip sections more efficiently. Second, without the necessary background it may take a reader multiple repeated readings to fully understand the paper. If you lost the reader, providing a natural re-entry point makes it easier to continue reading the paper despite a lack of understanding of some parts.

Both reasons are especially important for reviewers, a special type of reader. Ideally the reviewer is an expert in the field already, so we would like to make it easy for him to quickly navigate to relevant parts of the paper. Less ideally, the reviewer is working under time pressure or without keen interest in the work; in this case we would like to minimize misunderstanding or missing important points during reading.

It is important to co-locate the continuation markers with the actual text itself. It is not sufficient to provide a mini table-of-contents as part of the introduction ("In Section 2 we present related work. In Section 3 we present our method. etc.").

9. Multiple Authors

It is a reality that most computer science papers are authored by multiple authors. Coordinating the writing between multiple authors can be challenging on both the level of content and in terms of technology.

In terms of content, in my experience a recipe for disaster is to divide the paper into parts and agree that "Author A will write the introduction, author B will write the method, etcetera". The resulting draft will be incoherent and everyone has an excuse for delaying their part due to perceived dependencies ("I will write the method once the notation is defined in the introduction", "I will write the introduction when we have results").

Also, when dividing up work this way the draft can be poorly balanced in terms of relevant parts, as sub-authors tend to be assigned to the parts they have contributed to the most, which provides an incentive to describe their own contribution in too much detail (for example senior authors writing the introduction will fill it discussing their past research agenda that led to this work; the author writing about the implementation will want to go into detail because it was really difficult to get it to work and people may miss just how difficult it was, etcetera).

It is better to assign responsibility to a single author to write a full draft, then iterate together over this draft. There are two reasons why it is better: first, clear responsibility gets stuff done; second, the draft will be more coherent with a more linear flow of arguments.

The single author draft works best if the draft writer is an experienced author because iterating on a poorly organized draft may take more effort than a complete rewrite. When iterating on a draft it is important to distinguish substantial from minor changes. Minor changes are changes that fix issues locally, such as adding a sentence for clarification, changes of word order, typos, etc. These changes are important but not urgent. Most accomplished authors I know prefer to make these changes in passes through the full paper, much like polishing the paper with each reading.

Substantial changes are things like addition or removal of sections, changing the order of the presentation, enlarging or shrinking the claimed contribution, etcetera. Such changes can have large implications on the other parts of the paper which need to be addressed and therefore such changes are important and urgent because they require less time if made early.

In terms of technology, I frequently experienced problems due to the diversity of authors and their working style. Often some authors will be senior authors with a proven but dated work setup, for example, not using basic version control systems and being stuck in an unflexible editor that mangles LaTeX every time it opens a file. To be fair, these authors are often most essential in terms of providing feedback on the content of the paper and they may have little time available to stay up to date with the latest tools. For addressing this problem with technology, my recommendations are the following:

Use a version control system: this should almost go without saying and even if you are the sole author of a paper it is best to use a version control system because it provides a simple method to back your work up. But for multiple authors coordinating the writing of a paper without a version control system is simply a waste of time and nerves of everyone involved.
Use a friendly version control system that provides a simple web interface; Bitbucket is my favorite for paper writing because it offers free private git repositories and allows you to view changes in a neat timeline in the browser. While hardly surprising to any git user, this feature is readily appreciated by everyone. Also, for minor changes Bitbucket actually allows editing from within the browser.
For yourself: when writing LaTeX write one sentence in a line and use a line break after each sentence. This makes merging conflicts easier and leads to fewer surprises with strange editors breaking long lines. (I also found that this helps me to improve the organization of a paragraph because every sentence now starts at the beginning of a line.)
When you need only high level feedback from your coauthors, sending them a PDF for annotation via email may still be the most efficient way.

10. Authorship and Author Ordering

Except for the writing itself, another common problem with multiple authors is discussions about authorship and author ordering. While not related to writing papers per se, I do want to share some remarks on this topic. There are only a few common situations where debates about author ordering arise. Here are a few common examples, with the more common cases first:

A small contributor or someone involved in early discussions wants to be a co-author, but other authors disagree based on the amount of time they contributed.
There is a PhD student, a post-doc, and a faculty author and in most computer science venues the recognition is strongest for the first and last author position. The post-doc feels he guided the student the most so deserves to be recognized, but the faculty member may feel different based on seniority or being the source of funding.
Two or more students contributed to a piece of work and see their contribution as the strongest; this happens sometimes when a student postpones a line of work and another student is continuing with the work, directed by a joint supervisor.
Two or more senior authors feel that they started or guided the project the most.

Obviously there is no "right way" to handle all circumstances, and indeed computer science handles authorship differently to, say, mathematics, for example. Of course everyone agrees that scientific authorship should imply substantial contributions to the work, but that is about as ambiguous a statement as can be made. To be more concrete, here are some observations.

First, some conflicts can be anticipated, for example the case of two students. Here, it is best to discuss a possible publication and authorship as soon as the second student gets involved. This discussion should be summarized via email for future reference. Likewise for the case of the small contributor, as soon as it is clear the work will end up in a publication a discussion should help to set expectations, for example to offer authorship only if additional work is invested.

Second, as a young PhD student one naturally underestimates the implicit future benefits that arise from co-authorship. For example the senior co-authors may present the work at venues otherwise inaccessible, or the work will lead to substantial future collaborations with the original co-authors.

Third, when considering whether to include a small contributor as co-author, the problem is most often not the co-authorship itself, but possible future actions by the contributor after the paper is published (for example, giving seminar talks about the paper). The other authors may then feel that the credit and opportunities are taken away from them. By discussing not just the co-authorship itself early but instead also what future paper-related actions are done by whom these problems can be avoided. For example, all authors may agree that seminar and job talks about the work should only be presented by the lead author.

Ten Tips for Writing CS Papers, Part 1

2015-11-29T21:00:00+00:00

As a non-native English speaker I can relate to the challenge of writing concise and clear English. Scientific writing is particularly challenging because the audience is only partially known at the time of writing: at best, the paper will still be read in 10 or 20 years from the time of writing by people from all over the world.

Learning to write papers well takes a long time and is achieved mostly by practice, that is, writing and publishing papers. But to improve your writing at a faster pace you can actively reflect on certain patterns and writing habits you may have.

Below I compiled a short list of some best practices from my own experience and preference, with more following in a second part. This list is by no means exhaustive and has a certain bias towards computer science publications. However, I hope it will serve as an inspiration to improve your writing.

I provide some examples of poor writing from published papers. To avoid offending anyone, I select the examples from my own published papers.

1. Use Simple Language

Concepts and ideas in scientific papers can at times be complex but the writing used to describe them should remain simple. Simple writing has short sentences, a clear logical structure, and uses minimal jargon. Writing papers is not poetry but still requires you to pay attention to the language you use.

Computer science does not seem to have an overly large problem with complex writing, possibly due to a large number of non-native English speakers. Or perhaps there is a strong desire to be understood by the writers; other academic fields are more challenged.

Yet, I have frequently seen non-native English speaking junior authors, perhaps when writing their first paper, who attempt to copy style from their native language. At least for native German speakers (like me) this would often lead to comparatively complex writing in terms of sentence lengths and less than optimal didactics in terms of presenting the abstract before the concrete.

If still in doubt whether using simple language is a good idea, check this Ig-Nobel-prize-winning work: (Oppenheimer, "Consequences of Erudite Vernacular Utilized Irrespective of Necessity: Problems with Using Long Words Needlessly", Applied Cognitive Psychology, 2006).

2. State your Contribution

The key contribution of most published papers falls into exactly one out of the following three categories.

Insight: you have an explanation for something that is already there.
Performance: you can do something better.
Capability: you can do something that could not be done before.

If you know which category your paper falls into this, emphasize this aspect early in the paper, ideally in the abstract. This sets the tone and expectations for the remainder of the paper.

3. See Everything as a Facet on the Contribution

Every scientific paper claims a contribution over previous work. Once you have stated the contribution clearly, the rest of the paper is there just to support the contribution: The introduction motivates the need for your contribution. The related work section differentiates prior work against your claimed contribution. The method section typically provides a description of the contribution. The experiments verifies that your contribution works as advertised. Etcetera.

The point is: the contribution anchors everything else in the paper. If the contribution is clear, every part of the paper should make sense and become a different facet or view onto the contribution.

There are two common ways how this simple structure is violated, leading to a poorly written paper. The first way is to not clearly state the contribution, leaving it ambiguous during the whole paper. In such papers some method may be described, some experiments may be performed, but the higher goal does not emerge. At the end of the paper, the reader may agree with all statements of the paper and still wonder what he should make of it.

The second way to violate the structure is less severe: a long description of another method or work is added to the paper. I have seen this frequently with junior authors who have just learned about a cool method and want to showcase their understanding. Such description may even be interesting to a reader of the paper, but it is orthogonal to the contribution of the paper thus has negative value and is best removed.

4. Consider Using a Page-1 Figure

Consider using an explanatory figure on page one of the paper. This was started in the SIGGRAPH community with the work of Randy Pausch, but has slowly spread to other communities.

The main purpose of a page one figure is to provide a gist of the paper, much like a "visual abstract". It highlights what is important and sets the right expectations. It is also visually engaging and whets the appetite of the reader.

What makes a good page one figure? 1. Simplicity: You need to be able to understand it in 20 seconds or less. 2. Being self-contained: All relevant information should be in the figure or the figure caption. The figure caption should be short.

Many papers benefit from the addition of a page one figure, but there are some exceptions, for example in theory papers it could appear out of place.

5. Avoid the Passive Voice

You can write clear English in both the active and passive voice. A historical note on this is available in this essay on active vs passive voice in scientific writing:

"More than a century ago, scientists typically wrote in an active style that included the first-person pronouns I and we. Beginning in about the 1920s, however, these pronouns became less common as scientists adopted a passive writing style.

Considered to be objective, impersonal, and well suited to science writing, the passive voice became the standard style for medical and scientific journal publications for decades.

...

Nowadays, most medical and scientific style manuals support the active over the passive voice."

The reason for this change is simple: most people find text written in the active voice easier to read and more engaging. Duke university published a guide on scientific writing that contains a long discussion on the active versus passive voice.

In my writing there are very few exceptions were a passive voice may be more appropriate, for example when discussing prior work ("The relationship between iron intake and lifespan of parrots was studied by Miller and Smith.") or when discussing experimental results ("The test error remained small even when the regularization strength was decreased."), but even for these two examples we can find an alternative active formulation ("Miller and Smith studied the relationship between iron intake and lifespan of parrots.") and ("Even when we decreased the regularization strength the test error remained small."). The use of the passive voice in these two exceptions conveys an impersonal attitude that may be justified when discussing the work of others or reporting (as opposed to interpreting) experimental results.

Here is a real example from a ICCV 2007 paper of mine (page 4):

The dual problem has a limited number of variables, but a huge number of constraints. Such a linear program can be solved efficiently by the constraint generation technique: Starting with an empty hypothesis set, the hypothesis whose constraint (6) is violated the most is identified and added iteratively. Each time a hypothesis is added, the optimal solution is updated by solving the restricted dual problem.

I highlight all the passive formulations. Here is a rewrite of the paragraph using only the active voice:

The dual problem has a limited number of variables, but a huge number of constraints. We can solve such a linear program efficiently by the constraint generation technique: Starting with an empty hypothesis set, we identify the hypothesis with the largest constraint violation in (6) and add the hypothesis to the hypothesis set. Each time we add a hypothesis, we also update the optimal solution by solving the restricted dual problem.

I made a few minor changes such as changing the word order and adding the noun ("to the hypothesis set") for added clarity. I hope you agree that the second version is easier to read.

The next part is available now.

History of Monte Carlo Methods - Part 3

2015-11-13T21:30:00+00:00

This is the third part of a three part post. The first part covered the early history of Monte Carlo and the rejection sampling method, the second part covered sequential Monte Carlo.

Part 3

In this part we are going to look at Markov chain Monte Carlo.

The video files are also available for offline viewing in MP4/H.264, WebM/VP8, and WebM/VP9 formats.

Your browser does not support the video tag.

(Click on the slide to advance, or use the previous/next buttons.)

(Also note there are three additional video visualization below.)

Transcript

(This is a slightly edited and link-annotated transcript of the audio in the above video. As this is spoken text, it is not as polished as my writing.)

Speaker: So this was one of family of Monte Carlo methods. I have too few time remaining but a little bit of time to talk about a completely different family of Monte Carlo methods and you may have heard this abbreviation before. It is called MCMC.

MCMC stands for Markov chain Monte Carlo and it is completely different from importance sampling. The basic difference is instead of growing a configuration or weighting configurations, I always have a certain state and I manipulate that state iteratively, and if I do this long enough then I have obtained a sample that is uniformly distributed. I will get into the details in a minute. I first want to talk briefly about the history.

It was invented by Marshall Rosenbluth but it is called the Metropolis algorithm. Why is that? Well, there were five authors on the paper. The first of which was Nicholas Metropolis. And I roughly sized the pictures according to contributions to the paper.

There are two different historical accounts of how the method came about and they agree on that Edward Teller posed the mathematical problem, and Marshall Rosenbluth solved it, and Arianna Rosenbluth implemented it, and the other two authors did not do anything. They ordered the author names alphabetically and Nicholas Metropolis happened to be head of the group at that time. In any case, two interesting things about it, all of these authors did not use the method in their following research. Also the method is now very, very popular. Actually, Marshall Rosenbluth afterwards founded the field of plasma physics, so he completely went into a different direction. At the turn of the 21st century, Jack Dongarra and Francis Sullivan, two researchers in scientific computing, were asking to compile a list of the top 10 algorithms of the 20th century and this was one of them in their list. And quicksort was another one. So it is really an important algorithm. First, the intuition, then a little bit formalization and how it applies to our problem.

So, here is the intuition. What the method does: it constructs a directed graph where each possible state is a node, and there are simple modifications you can make indicated by these directed arcs to transform one state into another state. For example, in our chain case, we could bend the chain at a certain node or we could maybe change the last state a little bit or something. Some simple transformations you can perform to transform one state into another state. Only a few for each state, only a few arcs that leave each state. And then it performs a random walk on this graph in such a way that if you perform the random walk long enough and then stop it, you are uniformly distributed across the whole graph, or according to some target probability distribution.

The graph is much too large to explicit construct, right? It has exponentially many configurations on our case, but it still it is able to perform a random walk on this graph. It is called Markov chain Monte Carlo because the basic concept is a Markov chain and I want to quickly introduce that concept to you. So here is a simple graph with just three states, A, B, and C. And imagine that you are standing at State B and you follow the very simple rule of sequentially walking along that graph. You see all the edges that leave your current state. They have numbers associated to them and these numbers sum up to one. So if you stand on State B, you have a 40% probability to move to State A, a 50% probability to move to State C, and a 10% probability to stay in State B. So you just follow that rule in each time step and you arrive at a new state.

(This is the video I showed, which is not visible in the slides above. The video file is also available as MP4/H.264 file (1MB).)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video

And when you do that and you record how often you reach a certain state, then this is a histogram you would obtain. So it seems to converge to some values after just a few hundred steps. And in fact, the limit distribution that you obtain ultimately is given by these numbers here. And it does not depend where you start, but it is completely independent of where you start.

The Metropolis algorithm solves the inverse problem. The inverse problem is the following. So here we had the rules how to walk and we just recalled that the limit distribution. What if you have a graph structure and you have some target probabilities that you want to realize? How should you put numbers at the edges to reach that limiting distribution? That was the problem that was posed by Edward Teller in essence.

Here we have that target distribution and the Metropolis algorithm is a constructive way to choose its transition probabilities. It assumes that you have a base Markov chain, so some basic random walk on the graph. So let us say, just uniformly go over all outgoing edges on that graph. And we could follow that random walk, right? It would not be the same limit distribution that we are interested in but we could follow that random walk and we would get a different limit distribution. And what it now does is whenever we transition from one state to the other, in addition, modulates that decision and has the option to reject that step. It can only accept or reject steps proposed by the base Markov chain. The final Markov chain of the Metropolis algorithm is this chain $T$ which is the base Markov chain multiplied with its acceptance rate. And the acceptance rate is calculated according to that formula which has a quite simple interpretation.

The acceptance probability is high when the target probability, $\pi_j$, is high but your current probability is low, when it's more likely to be in a target state or vice versa, if it is unlikely, you are more likely to stay at the current configuration. Or if the base chain pushes you to some other state, you divide by that probability to compensate for that bias of the base chain. So that is sort of the numerator and denominator have the two effects. And the remaining probability, everything that is sort of modulated down, that is the reject probability to stay in the current state.

(This is the video I showed, which is not visible in the slides above. The video file is also available as MP4/H.264 file (1MB).)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video

So let us do that simple calculation for our example here. For that limit distribution, we would obtain that values. And let us take a look at whether we converge to that limiting distribution. So the limiting distribution was 0.7, 0.1, 0.2, it converges slower than in the example before, but ultimately, we are guaranteed to converge to the limit distribution that we setup. It is not a unique way to construct such a Markov chain but it is a constructive way to do so.

In our (self-avoiding random walk) example, we want to walk on this graph. And what we are going to do is we just allow simple transformations, we pick a random element on that chain and bend it 90 degrees in a random fashion. So that gives us the arcs on that graph and you can imagine for a chain of a certain length, there are only so many ways which you can enumerate to bending the elements by 90 degrees. And if we happen to bend it in such a way that it is actually no longer self-avoiding, we can remove that and add that back to the probability of staying in a certain state. And remember we wanted to sample uniformly overall the states on the graph. So we just plug that into the acceptance rate calculation of the Metropolis algorithm and the $\pi$, because it is uniformly distributed, just cancels out of that rate. So it is a very simple calculation.

And in practice, it is even simpler to implement that. We have a certain state. We just initialize it with any state we want. We propose a random modification and we accept or reject that, and then we have a new state, and we iterate, we accept or reject that. So we keep doing that and after a certain amount of time, we just keep the number of samples that we have generated and we can compute our expectations with that sample that we have generated. They are no longer independent samples because we have always modified them a little bit but they are a set of samples that we have generated. And this is the estimate, again, same curve that we had before, this time, with the MCMC approach. And I am almost out time but I want to take the last three slides here to show you our last method, the final method of the talk. Any questions on MCMC so far?

Attendee: If we said we accept or reject. We make a bend and we accept them. How do you decide whether to accept or reject, just whether it crosses itself.

Speaker: If it crosses itself, it is no longer a valid state, and we immediately reject. If it does not cross itself, you compute the acceptance probability by that formula and the acceptance probability maybe 0.8, and then you roll a random number between zero and one, uniformly distributed, and if that number is below the acceptance rate, then you accept. If it is above the acceptance rate that you have calculated, you reject. So if the acceptance rate is one, you always accept. If the acceptance rate is 0.5, you flip a coin uniformly to accept or reject.

Attendee: How would I calculate the acceptance rate in the self-avoiding random walk problem?

Speaker: That depends on this graph structure here. So every state in the graph has a certain set of allowed changes, right? For some state which is very compact in number of allowed changes becomes smaller. For some longer chain, you can basically bend it at any element, in any direction and it would still be a self-avoiding work.

Attendee: We have to check them all to count how many were self-intersecting?

Speaker: Yes, in some case you can remove it, I mean, not in this case. In some cases, you always have a probability mass everywhere and then not like a hard constraint like self-avoiding, then it cancels out as well. But in this case, we have to enumerate all of them.

(A warning in this edit: I cannot recommend learning about the simulated annealing by browsing the web as there is lots of misinformation around or special cases are described as simulated annealing, or the base Markov chain is not a reversible Markov chain, etc. See the references at the end of this transcript for good links if you want to learn about the method.)

The final method is called Simulated Annealing. It is a method to convert a Markov chain into an optimization method. As simple as that, a Markov chain into an optimization method. It was proposed in 1983 by Scott Kirkpatrick and co-workers and it is a very simple and often quite effective optimization method. And simple to implement. It can optimize over complex state spaces. And for that reason, it is very popular. So this Science paper that they published in 1983 has 28,000 citations (now 35,000, October 2015). And interestingly, Scott Kirkpatrick later in the '90s worked at the IBM T.J. Watson Research Center and there, at least he writes this, he invented the first pen-based tablet computer. So it is nice to see that he is innovative on quite different levels.

So how does it work? Say we have a function that we want to optimize. A very simple function here. There are only 40 possible inputs to that function. So in that case, we could simply enumerate all the 40 possible states and pick the one that is maximal, so we want to maximize that function. But imagine you have a different problem with exponentially many states so we can no longer list them and this is just for illustration. But imagine instead of 40, you would have 2 to the power of 40 or something. What we are going to do is we convert that function into a probability distribution and we do that by what is called a Gibbs distribution. So just a simple formula, where $Z$ is a normalizing constant and the formula depends on the parameter $T$, the so called temperature parameter. If the temperature parameter is very high, you divide that function value by a very large value and the argument almost does not matter. So the function value does not matter. In this case, the temperature is 100 and you see that the resulting distribution is almost uniform. Maybe hard to see but it is almost uniform because the temperature is quite high compared to the function value.

Attendee: What is $Z$?

Speaker: $Z$ is a normalizing constant. So it is just a sum over all the possible configurations. But it is a constant and it just depends on $T$. And interestingly, if we apply this Metropolis algorithm, the $Z$ constant cancels out, it is not really important. You do not even need to write it down. You can just write $P$ is proportional to $X$, ($P \propto X$) or something, the constant is just a normalizing constant. If I decrease the temperature, you see, so this is temperature 10, temperature 4, 1, and now I decrease it even further, 0.1, the distribution puts more and more mass on the function values that are higher; and basically, what the simulated annealing does it runs a Markov chain, a Metropolis chain, for example. But, while it runs it, it modifies it by decreasing the temperature.

So it tries to shift all the probability mass in our current state as well to the states that have high function value. And how it does it, well, it chooses its schedule, a temperature schedule so on the x-axis here I have the steps that I take with the Markov chain and on the y-axis I have the temperature that I use, and I just decrease the temperature here, a geometric schedule so I just modify the temperature on each step with 0.99 or something.

For very high temperatures, the Markov chain is basically just a purely random walk; it does not even look at the function value. For very low temperatures, it basically is a local search algorithm; it only accepts improvements in the function value. But, for intermediate temperature values, it is something in between so it tries to optimize but it can still escape local minima.

So that is intuition. There's some theory to it actually, in another famous paper by Stuart and Don Geman, a 1984 paper which is actually famous for a different reason because they proposed another famous Monte Carlo method they Gibbs sampler in that paper but in that very paper they also have some theory of simulated annealing and they prove that if you decrease the temperature slow enough the probability is one to obtain the optimal state. But that optimal schedule is too slow in practice so you cannot use it and that is why we are still stuck with the geometric schedule.

Last minute, let us do simulated annealing and I go back to the more complicated model where we actually have this two types of elements: the black ones that attract each other and the white ones which are neutral. And you see here I plotted whenever two black elements are close to each other on this 2-D grid; I plotted a red line, and I am going to try to optimize the number of red lines so I try to get as many red connections as possible.

(This is the video I showed, which is not visible in the slides above. The video file is also available as MP4/H.264 file (5MB).)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video

And that is really a model for protein folding, folding in such a way that there are many black to black noncovalent bonds. So here, is an animation of performing simulated annealing exactly with that proposal that I had, bending it at 90 degrees left or right at a random position and, I do 100,000 steps and I show you every 100th step. At high temperatures, this is quite high temperatures still you see it is very stretched out, there are not very many compact structures but as the optimization proceeds the temperature decreases and it favors more and more compact configuration. So I think already at a step of, like now, you would see already quite some compact structures appearing.

So this is a purely random walk. I think it goes until 1,000. I can skip it in the interest of time but, I can show you the result. This is the configuration that we have obtained with 100,000 steps in our Markov chain in simulated annealing and for that model problem there is a paper that analyzes different model problems and the optimal configuration is known, the ground state is known and the ground state is slightly better. It has 23 connections. We only have 21 but actually, with such a simple method we have obtained a quite good solution and that is really the essence of why simulated annealing is popular.

We can often get quite far with allegedly fewer effort both in implementation and runtime. Although, there may be a better method in a specific domain, that is optimized for that. And let us reflect a little bit on what we did. We have solved a rather complicated problem like folding this 2D protein with a very, very simple method with just a random walk that accepts or rejects simple modifications.

And that is basically it. Before I have my last slide, I just want to say a little bit about the literature if you are interested in the Monte Carlo Methods I can highly recommend the first book and if you are interested in the black and white pictures I showed of the people that are relevant to the invention of the Monte Carlo method I can recommend the last book which is the autobiography of Stanislaw Ulam, it is very interesting.

So, thank you very much for your attention.

References

Here are some of the introductionary book references mentioned in the talk.

The historical context and anecdotes are mostly from the autobiography of Stan Ulam, Adventures of a Mathematician. The book is accessible to anyone with a basic high school math background. See also this kind 1978 review of the book.

A great, now somewhat dated introduction to Monte Carlo methods is Jun Liu's Monte Carlo Strategies in Scientific Computing. I learned Monte Carlo through this book and it has a spot in my bookshelf that is at arm-length from my desk.

The Liu book is somewhat dated and it covers a lot of ground; a slightly more formal but up-to-date Monte Carlo book is Art Owen's upcoming book Monte Carlo theory, methods and examples, which is excellent.

A highly accessible and very well written introduction to Markov chains and simple Monte Carlo methods is Olle Häggström's Finite Markov Chains and Algorithmic Applications. I recommend it if you want a most gentle introduction to the theory behind MCMC. Still the most authoritative reference on MCMC is the Handbook of Markov Chain Monte Carlo. In particular, the first twelve chapters cover are on general methodology and contain a wealth of information not found in other textbooks.

History of Monte Carlo Methods - Part 2

2015-10-30T20:00:00+00:00

This is the second part of a three part post. The last part covered the early history of Monte Carlo and the rejection sampling method.

Part 2

In this part we are going to look at importance sampling and sequential Monte Carlo.

The video files are also available for offline viewing in MP4/H.264, WebM/VP8, and WebM/VP9 formats.

Your browser does not support the video tag.

(Click on the slide to advance, or use the previous/next buttons.)

Transcript

(This is a slightly edited and link-annotated transcript of the audio in the above video. As this is spoken text, it is not as polished as my writing.)

Speaker: Okay, so rejection sampling works only for short chain lengths. And then, in light of this finding the next method, sequential importance sampling, was introduced independently by two groups. John Michael Hammersley, who actually did his PhD here in Cambridge and then moved to Oxford to become professor, and by Marshall Rosenbluth and his wife Arianna Rosenbluth. They did it independently. They called it different by different names. I think Hammersley called it inversely restricted sampling and the Rosenbluth called it biased sampling and both these names did not really stick. So now, nowadays it is called sequential importance sampling. In different communities, it is also called the growth method or the Rosenbluth method.

How does this method work? It is based on the idea that was suggested by the audience just before (in the first part). Remember when we are growing these chains step by step and we make the step and we would have to reject the sample? Well, we could just prevent making that step, all right. We could just say, "Look, there are two possibilities but you would not reject that sample. So why not just take one of these?"

Of course the method is not fail-safe. So if we, in such a situation, no matter what we do if we keep growing we will still run into trouble. Right, so the method is still myopic. We still only make one inference at a time. But the real problem with this method is that we no longer sample uniformly from the set that interest us. In fact, we favor more compact configurations and what Hammersley, and Morton, and the Rosenbluth's realized is a method to systematically compensate for that bias.

So let me first talk about how this in general is done and then how it is done in our specific example. So in general, we have this expectation expression that we want to approximate. And what they said is, well, we assign one weight to each sample and if the weight would be one that would be the original expectation. But we assign a weight to the sample and we choose the weight in such a way that we compensate for that bias, that we favor some configurations. So whenever we favor these configurations, we down-weight them, and whenever some configurations is rare but we generate them, we up-weight them.

In practice, we would generate a few samples and if every sample would have a weight, in this particular instance how it works is, well we just count the number of possibilities we have in each step. So we basically decompose a sampling distribution. So in the first step we have four possibilities, four points free adjacent to it and the next one we have only three available. And so we just unroll basically our decisions. Here we only have two available, two choices available, all right? So because we grow the chain sequentially, the probability of generating that particular configuration also decomposes sequentially. So the final chain would have this probability of being generated. And what they simply do is say, "Okay, this is the distribution by which we generate the sample, but we want to generate it at uniformly at random so we also decompose the weight and just build the weight as a inverse". So when we weight these samples by weights we will systematically de-bias the sampling distribution, so we will get unbiased estimates of the expectation that we are interested in.

Let us take a look at where we were with the rejection sampler. So this is a limit, what we could do with a rejection sampler. Now, with the growth method, we can go to significantly longer chain length, to a chain length of 60 and then again, the uncertainty estimates the confidence intervals blow up. So why is this? I mean why do we say uncertainty intervals blow up in this improved method? Well the thing is, the weights that we compute, they become very unbalanced. And even though we generate maybe a few thousand samples, only few of them will have significant share of the weights.

So here is a visualization of that. So here, I grow 50 chains in parallel, one step at a time, and I show you in each step the normalized weights. So in the beginning, everything is uniform because in the beginning, everything has equal number of possibilities. But over time, as I grow more and more, as append more elements, the weights become very unbalanced so that after 100 steps, actually five elements have weights significantly different from zero. And this means we actually do not have 50 samples on this case, we only have five, and our estimates become very poor. And this only amplifies when you have a few thousands ones. One way to measure the quality of the samples we have generated is to ask, "Okay, I have generated, say 5,000 samples with weights. How much are these worth in terms of computing expectations, in terms of unweighted samples?"

Because unweighted samples are optimal there is a quality measure that you can compute that is an estimated quantity and it is called effective sample size, which exactly measures the worthiness of a weighted sample set. And for this plot I have shown you now with 5,000 samples, you see that it drops and drops and drops until it is almost close to one. So that is a real problem.

You guessed the next step would be another improved Monte Carlo method and indeed it improves on that, and it is generally known as Sequential Monte Carlo Method. The idea is quite simple and natural and it has been reinvented many times in different communities. So the original paper is from 1959 but has been reinvented in the signal processing community as particle filter, it has been reused and pioneered in computer vision by our Andrew Blake for tracking objects and their contours. And it is used across many different communities often under very different names. But generally, Sequential Monte Carlo is the preferred name. And the basic idea here is quite simple. The problem is unbalanced weights. We have to prevent getting unbalanced weights in each step. So how we are going to do this is by introducing a process which removes samples that have low weight and duplicate samples that have high weight. That is called re-sampling.

So say, in one certain timestamp, we grow all the chains in parallel, we grow say 50 chains in parallel. On this example, it is only six chains in parallel. And we have observed that the weights are unbalanced. Then remove some of the low weight instances and we duplicate some of the high weight instances as shown here. The algorithm that corresponds to that is the same as before just weighted sequential importance sampling but we grow all the samples in parallel and monitor the weights. And if the weights are in trouble, if the weights become unbalanced, we enforce balanced weights again, by removing low weight samples and duplicating other.

Attendee: Question.

Speaker: Yes.

Attendee: In your little white chain example, the weights are going to be high when at every step you can take three choices, right? Like three times three times three. And that is going to get bigger. The long ones do not go near each other. So this is going to bias in favor of things that just sort of go off into the distance, all of them curly wurly things, is that right?

Speaker: Right, exactly, because the sampling distribution biases towards compact configurations the weights have to undo that bias by favoring the configurations that walk off and are no longer compact.

Attendee: But isn't that bad because then essentially we'll you dominated by all the ones that go off into the distance and we won't get any curly wurly ones.

Speaker: It would be bad but remember that the sampling distribution that I had above the weight, the sampling distribution exactly has that opposing bias. So we generate from that distribution and we want the weights to compensate for that bias. So the extra samples that we get are more compact than they should be. And that is why we have to downweight these to get a low weight, right? And we have to upweight these samples that are not compact but that go out into basically long chains.

Attendee: I did not get that, sorry. But never mind. I believe you.

Speaker: Okay. So if you do that re-sampling operation to compensate for the unbalanced weight effect, and I plot again the effective sample size and whenever we see that the effective sample size, in this case drops below 2,500, I perform this re-sampling operation and reset the weights to uniform and I enforce this effective sample size to become 5,000 again. You see that basically, I can control how unbalanced the weights become. Here is another visualization in terms of the same plot that I had before and now with the red arrows are indicated whenever I reset the weights to uniform, I perform his re-sampling operation so I always can ensure that my weights are close to the uniform distribution.

So let us compare that again. This was a plot without re-sampling. You will see that the uncertainty estimates indicate that our estimates are very unreliable. And this is basically with re-sampling. The whole family of Sequential Monte Carlo approaches are really state of the art methods. This scales to almost no limits, so people have used generate chains with over a million bonds. It is state of the art for any kind of probabilistic model where you can sequentially decompose the model. For example, time series models, hidden Markov models, state space models, dynamic Bayesian networks, all these kind of models, these methods are applicable and highly efficient.

History of Monte Carlo Methods - Part 1

2015-10-16T21:30:00+01:00

Some time ago in June 2013 I gave a lab tutorial on Monte Carlo methods at Microsoft Research. These tutorials are seminar-talk length (45 minutes) but are supposed to be light, accessible to a general computer science audience, and fun.

In this tutorial I explain and illustrate a number of Monte Carlo methods (rejection sampling, importance sampling, sequential Monte Carlo, Markov chain Monte Carlo, and simulated annealing) on the same problem. Although I am not exactly a comedian, in order to keep the tutorial fun I peppered the talk with lots of historical anecdotes from the inventors of the methods.

This is the first of three parts.

Part 1

The first part (17 minutes) covers the history of modern Monte Carlo methods, their use in scientific computation, and one of the most basic Monte Carlo methods, rejection sampling.

The video files are also available for offline viewing in MP4/H.264, WebM/VP8, and WebM/VP9 formats.

Your browser does not support the video tag.

(Click on the slide to advance, or use the previous/next buttons.)

Transcript

(This is a slightly edited and link-annotated transcript of the audio in the above video. As this is spoken text, it is not as polished as my writing.)

Speaker: Thank you all for coming to this lab tutorial. I know many of you have used Monte Carlo techniques in your research or in your projects. And still I decide to keep the level of this tutorial very basic and I try to show you a few different Monte Carlo methods and how they may be useful in your research. I hope that after the talk you basically understand how these methods can be applied and what different limitations the different methods have. And I will introduce these different methods in chronological order and also say a little about the interesting history, how these methods have been invented.

But before I get to that, I first want to ask you, do you like to play solitaire? I certainly do sometimes play solitaire and when you play a couple of games, you realize that some games are actually not solvable. So some games are just, no matter what you try, no matter what you do, they are just provably not solvable. And so if you shuffle a random deck of 52 cards and put it out as a solitaire deck, it's a valid question to ask is what probability do you get a solvable game? That's the question. And it's precisely this question, precisely this question for the game of Canfield Solitaire that has led to the invention of the modern Monte Carlo methods.

One way to attack this problem would be instead of trying analytic or mathematical approaches, basically having to take into account all the rules of the game, is to just take a random set of cards, play a hundred times after randomly shuffling the cards and just looking at how many times you come up with a solvable game. And that would give you a ballpark estimate on the probability. And that's precisely what this man, Stanislaw Ulam, has recognized, that this is possible.

I want to say a few words about Stanislaw Ulam because he's so crucial to the invention of Monte Carlo methods. So he was born in today's Ukraine in a town called Lviv (formerly Lemberg, in Austria-Hungary). And he was enjoying a very good education. His family had a good background. And he very early discovered in his life that he likes to do mathematics. He was part of the Lviv School of Mathematics who has done many contribution to the more abstract mathematics, vector spaces. So he's known for some of the mathematical results. But then he had to flee to the United States in the 1930s and there became professor in Mathematics and was recruited to Los Alamos to do research on the Hydrogen bomb. Not the first nuclear weapon but on the second Hydrogen bomb design.

During that time in 1946, working at Los Alamos, he had a breakdown. For couple of days he had a headache and he had a breakdown and was delivered to a hospital. The doctors performed an emergency surgery, removed part of his skull, because it turned out he had a brain infection, the brain has swollen and he would have died if the doctors didn't perform his operation. And the doctors told him "You have to recover, you have to stay at home for half a year and don't do any mathematics."

He was obsessed with Mathematics for his whole life. So instead of doing mathematics, he tried to pass the time playing Canfield Solitaire. And while playing Canfield Solitaire he asked the question, "Okay, what's probability to solve this game?" and with his quite broad knowledge of Mathematics he tried a few different analytic attempts to come up with the answer to that question. But ultimately he realized that it is much easier to get an estimate by just playing games randomly.

And at that time he was already doing research in Los Alamos. He recognized that this has applications as well for studying different scientific problems such as Neutron Transport, which is essential to understand when designing nuclear weapons. So he is also the inventor of the first working Hydrogen bomb design together with Edward Teller. And the inventor of the Monte Carlo method, published a few years later in 1949. And also he's known for having performed probably the most laborious manual computation ever undertaken (with Cornelius Everett) to disprove Edward Teller's earlier nuclear weapon design, to show that it is not possible. So very interesting history. I will talk a little bit later about him some more.

So nowadays, Monte Carlo methods, and with Monte Carlo methods, really, I mean, any method where you perform some random experiment, which is typically quite simple, and you aggregate this results into some inferences about a more complex system. Today, Monte Carlo methods are very popular in simulating complex systems. For example, models of physical or biological or chemical processes, for example, weather forecasting, and of course, nuclear weapon design. But also just last week, it was used to simulate the HI Virus capsid. A simulation of 64 million atoms, a major breakthrough in understanding the HI Virus. So it has huge applications in scientific simulations, it also has applications in doing inference in probabilistic models. The most famous system there would be the BUGS system also developed here in Cambridge at the University, initially developed in the early '90s. Infer.NET also supports Monte Carlo inference and here at MSRC also the Filzbach system does. Also there's a quite popular system now, from the University of Columbia, called STAN. It's actually named STAN because of Stanislaw Ulam.

Monte Carlo methods can also be used for optimization. So not just for simulating but also for optimizing a cost function. We will see an example later, but typically it is often used where very complicated systems are optimized. So something like the circuit layout that has many interdependent constraints. And it is also used for planning, for games, and for robotics, where it is essential to approximate intractable quantities, to perform planning under uncertainty or where measurement noise makes it essential to represent uncertainty in a representative way. So these are many, many different applications, too many to really list. I want to pick out one application for the rest of the talk and illustrate this application with different Monte Carlo methods.

And that application is protein folding. So protein folding happens right now in your body, in every cell of your body. In every cell you have a structure called the Ribosome and that's basically the factory in your cell. It transforms information, encoded in the DNA into one linear long structure, the protein. And that structure is such a long chain that folds itself into very intricate three dimensional structures. Very beautiful structures arise, and it is really the three-dimensional shape that this long chain folds into that determines the functional properties. It is really essential to understand in order to make predictions about what these molecules do. This can take anywhere between a few milliseconds and a few hours. And I think the state of the art on a modern machine is to be able to simulate accurately something like 60 nanoseconds per computer day. So we are nowhere in reach of being able to accurately fold these structures. But there is the Stanford Folding@home project which uses Monte Carlo methods. And I think right now, they have something like a hundred fifty thousand computers working right now on the problem of protein folding. So it is quite essential to understand a couple of different diseases.

We are not going to solve protein folding in this talk but I am going to use a slightly more simplified model. One thing to simplify is you still have a chain. But you say, "Okay, first the chain does not live in three dimensions, it only lives on the plane." And we do not have many different amino acids, we only have two: the black ones and the white ones. And the white ones repel water, the white ones like water and the black ones repel water and so they fold into something that has a black core and a white surrounding. In fact, I am going to make it even simpler. I say, "Okay, it lives on the plane but it lives on the grid". So it is a further simplification. And now for the next few slides, I even simplify this one step further: we only have the white bonds.

So that is a so-called 2D lattice self-avoiding random walk model. So you have a certain length. Say 48 bonds, 48 elements, and you have a self-avoiding walk, so this walk is not allowed to cross onto itself. And this is a very simplified model but already some questions which are interesting become very hard or actually intractable. For example, if I fix a number of elements in this walk, one question is, how many self-avoiding walks are there on the plane? Another question is, okay the number is finite, while there are many but finitely many possible combinations, how do I uniformly generate such random walk? And the third question would be, okay, I am interested in some average properties, for example, the average end-point distance between the two ends, how do I compute an approximation to that average quantity?

These are really typical problems that can be addressed with Monte Carlo methods: - average quantities, - counting problems, - random sampling problems. So that's what's going to be with us in the next few slides.

The first method is a very simple one. It's called rejection sampling and the idea is really very simple to explain. While we have this complicated set, the set of all self-avoiding random walks of a certain lengths and we want to generate one element uniformly at random from the set. This is hard. So what we do is we instead consider a super set, the set of all random walks of a certain lengths, and this is allowed to cross onto each other. And it is very easy to simulate from that set. So we just simulate from this orange set, from this larger set, and whenever we end up outside the blue set we discard that sample. And whenever we are inside the blue sample set, then we keep the sample. And because we uniformly generate samples, we can just keep doing this and collect whenever we reach an element in the blue set.

In practice this would work as follows: we start and we just keep appending in a randomly chosen direction, one out of three say, and if we happen to cross on ourselves we can already discard that sample and start over. And we would keep all the sample set that we can grow to the full lengths we want and we keep them and we maybe collect a thousand of them. And compute whatever property we want from that sample set.

Attendee 1: May I ask a question?

Speaker: Yes, sure.

Attendee 1: What happens when instead you say, "Oh Dear, I shouldn't have gone down, I should have gone-in in a different direction." Did you just get a biased sample or something?

Speaker: You are anticipating the future. We are still in 1949.

Attendee 1: But I thought this was the-- you said, right to begin with, generating the ones that don't cross themselves is hard.

Speaker: Yes and it would still be hard-- just bear with me for a few slides, this is actually where it's leading to.

Attendee 1: Okay.

Speaker: But this is a simple method. And we can, once we have generated the set of samples, we can compute average properties. For example, this squared extension there where you compute the distance in the plane between the two end points, that is a model problem that people considered. And more generally, what we would like to do is to compute expectations. So we have a distribution $\pi$ of $X$ over some state $X$ and we would like to evaluate some quantity $\phi$ of $X$. For example, this distance between the two endpoints of a certain state and we want to compute some sum and the sum contains exponentially many terms in this case. We want to compute the sum as an expectation and average quantity. And the Monte Carlo idea is to simply replace it, approximate that huge sum with exponentially many terms with something that has only say a thousand or 10,000 terms which is the samples we generated.

When we do that, when we actually do this rejection sampling here as a function of the chain lengths. I do that and I generate here 10 million samples, 10 million times I try and I keep all the samples at length that helped me to be self-avoiding. Then I can plot this average distance and because it is an average of many terms I know that the central limit theorem applies so I can also plot confidence intervals. So I not only get the inferences that I am interested in now, I also get an estimate, a confidence interval that captures with a certain probability the two value.

Okay and it works until a chain length of thirty so already quite large chain lengths. Then the confidence intervals become larger because I get less and less samples accepted. I use 10 million attempts here but actually similar methods are very useful even for a few hundred attempts. This is a picture around that time in Los Alamos where they performed the simulation manually by drawing with this drawing device, basically, on a sheet of paper, and whenever they cross from one type of material to another type of material, they would change the wheels and roll a new random number and then move it and turn it in random directions and they do it a few hundred times and get a global picture on how the neutrons are scattered in this matter. Because everything was named MANIAC, ENIAC, etc., and this idea was from Enrico Fermi they called this device a FERMIAC.

But anyway, another thing we could do is solve the counting problem. So we can estimate the acceptance rate. We have the number of attempts that we made and the number of attempts that were accepted that happened to be self-avoiding. It gives us the acceptance rate and we can estimate the number of self-avoiding walks simply as a product of this acceptance rate with a total number of 2D walks that are not necessarily self-avoiding. That is easy to calculate as well because the first step was into right direction and we had three possibilities in each step, so we could just have a formula for that one and this gives an estimate of the number of self-avoiding walks. So here is a plot of that and in this paper I found from 2005 where people have exhaustively computed that with clever enumeration methods up to a length of 59, but beyond that the exact number is unknown. But it happens to agree very well with these known ground truths.

Attendee 1: Quick question. Is that what even with your early rejection business?

Speaker: Yes, that's the one thing.

Attendee 1: Okay.

Speaker: It's exactly with the rejection sample here. So the acceptance rate is from the rejection sample here. $P$ is estimated from the rejection sample.

Attendee 1: What is the acceptance rate when you get to 30?

Speaker: Again, next slide here.

Speaker: I am impressed. One second. Let us first enjoy what we have achieved, let us take a look at Monte Carlo, enjoy some sunshine. So the name Monte Carlo, I mean, what first comes to mind is all the casinos, right? And the gambling and that is indeed one of the origins of the name. But the particular reason and the person who suggested this was Nicholas Metropolis, the colleague of Stanislaw Ulam, was very much amused about the stories Stan was telling about his uncle, Michael Ulam, who was a wealthy businessman in his hometown in Lviv. And then switched to the finance industry and spent the rest of his life gambling away his fortune in Monte Carlo and Nicholas Metropolis found this so amusing that he insisted the method being called, Monte Carlo method. So that is the real reason why it is called Monte Carlo method.

And it is not all sunny and that is where we come to this slide, which is the acceptance rate as a function of the chain lengths. And you see the simpler rejection sample for long enough chains. I mean, intuitively you can understand when you grow the chain very long, the probability to cross onto yourself when you walk randomly becomes higher and higher. The acceptance rate is very, very small. So I think for a million samples I had only like 15 walks accepted at the lengths of 30. And that's why the confidence intervals have been blowing up because the estimates become unreliable.

The next part is available.

The Julia language for Scientific Computing

2015-10-02T22:30:00+01:00

Julia is a relatively new programming language with the declared goal to become the leading language for scientific computing.

I have probably annoyed half of my colleagues by raving about how great the language is and what it is good at. Before we get to this, and in my defense, let me provide some context. I have been developing using C and C++ for 20 years now, and have been using Matlab and Python for over ten years now. These are great languages and I can be productive using each, infact I continue to use them regularly.

Also, I tend to be quite conservative in terms of adopting new languages or development tools: while learning a new language and environment is fun it also takes a lot of effort and most languages/tools/libraries tend to come and go rather quickly and every developer carries with him a graveyard of tools and languages long gone.

Because of this short-lived nature of software, when someone approaches me with a new language or tool I am skeptical by default, and my litmus test question is usually how confident they are that this tool will still be around in five years time. This is of course unfair, but I prefer to invest my time in learning things that have long term value. Which brings me to the point that I firmly believe Julia is here to stay and in fact may even become a popular language in scientific computing.

Enough rambling, let's get to the good parts.

I have been using Julia for the last 18 month now, both for work and pleasure. Counting all code I wrote at work (just counting .jl files, no notebooks) I see that I wrote more than 15k lines of Julia code in that time, including several larger projects, ports of existing Matlab and C++ code, and interfaces to C libraries. Given my experience Julia is ready for production in internal projects (as opposed to shipping executable code to a customer) and in particular is very well suited to research-type projects.

Julia

Developing code for research projects is in many ways similar to developing other software, but the key difference for me is that I need a quick turnaround time from idea to result not just once but in multiple iterations, sometimes changing the idea and implementation drastically.

In a very real sense most research projects should fail to achieve their original goals; almost by definition research is beyond what is known to work. If you only attempt known-to-work ideas it is not research. If your project fails it is important to learn as much as possible from the failure, that is, increasing the understanding of the problem and finding suitable new research ideas, and quick iterations make this process fun. The new ideas are often variants of earlier ideas and thus can reuse code. If this code happens to be compact and flexible this translates directly into productivity.

Matlab, R, and Python achieve this tight cycle of iterations quite successfully, but in all three languages there is a price towards the later iterations in that for achieving a high performance implementation significant parts of the code needs to be rewritten in a more basic language such as C++, which then needs to be interfaced to the other code through some interface specification. For big high-value projects in industry with dedicated engineering support the additional effort required is typically not a problem, but for individual researchers it means hours and days spend writing additional code without adding functionality.

This process is cumbersome, errorprone, and creates a strong coupling, making further iterations of changing ideas and implementations slower. (As an example, in my grante library I prototyped many algorithms in Matlab, then programmed them in C++, then wrote a Matlab interface which by itself is almost 2,000 lines of C++ code.)

Julia also achieves this tight cycle, but does not require you to resort to compiled statically-typed languages such as C++ in order to achieve high performance. Using a single language maintains productivity both at the very beginning (prototyping) and towards the later iterations (productization).

Productivity in Julia (roughly "scientific results per wallclock developer time") is achieved through a number of features:

compact syntax, for example I can declare a function using f(x) = 2x+5. As mentioned above, I see the advantage of a compact syntax not in the keystrokes saved initially, but in lowering the barrier to future understanding and modification as the code evolves.
optional type annotation, the above function will work for x being an integer, or a float, or anything that has a multiplication and addition with integer arguments defined; in fact, I could write f(x::Float64) = 2x+5 to require that x is a float, but performance-wise they both yield the same code. This means that I can be strict about types when I need to be, but have the feel of a dynamic programming language.
Jupyter notebook interface for quick think-implement-results cycles.
excellent default choices of numerical libraries, dense linear algebra, sparse linear algebra, numerical optimization libraries, arbitrary precision computation, special functions, FFT, etcetera, most of what you can wish for in a technical computing environment is already there by default or in the many numerical packages available. In terms of numerical optimization codes Julia is probably one of the best environments available. All these libraries are carefully chosen to be the best-in-class for the functions that they implement.
foreign function interfaces to a number of languages: C and Fortran, C++ (unfortunately planned only for Julia 0.5), Python, R, Matlab. This makes it relatively easy to use code in any of these languages and I have used several Python libraries without issues.
high performance, I regularly find my first-attempt Julia code for a problem to be an order of magnitude faster than the equivalent Matlab code. Infact, I unlearned a number of bad Matlab programming patterns such as using bsxfun and vectorizing all code. Last year I wrote Julia code for a R-tree data structure to maintain a dynamic spatial index. Doing this in Matlab/R/Python in a reasonably performant way would be unthinkable! Instead you have to resort to wrapping native libraries. In Julia it was fun to write and it is fast, and I could add the required methods I needed for my application easily, including fancy filtering iterators.
no separation between user and developer, almost all of the base library is implemented in Julia itself, and it is easy to find where things are. For example, if you want to find out how two complex numbers are multiplied in Julia's base library? Enter methods(*) and have a look! This transparency makes it easy to learn good Julian style and extends further to how code is run: Want to see what machine code is executed when you call the sqrt function on a single precision float? Enter code_native(sqrt, (Float32,)) and see

.text
    Filename: math.jl
Source line: 132
    push    RBP
    mov RBP, RSP
    xorps   XMM1, XMM1
    ucomiss XMM1, XMM0
Source line: 132
    ja  6
    sqrtss  XMM0, XMM0
    pop RBP
    ret
    movabs  RAX, 140269793784104
    mov RDI, QWORD PTR [RAX]
    movabs  RAX, 140269778958624
    mov ESI, 132
    call    RAX

Almost nothing is hidden from the eyes of the user and this makes it easy and fun to look into the implementation.

Weak parts

Julia, while ready for serious use, is not yet at version 1.0 and lacks several important features. In my work, I found the following pieces missing (as of version 0.4).

Simple single machine parallelism. In C/C++/Fortran this would be OpenMP and in Matlab it is parfor. While Julia does have good support for distributed parallel computing, it currently does not have simple single-machine parallelism. In my experience using the distributed computing abstractions for single machine parallelism has severe performance overheads because all data is serialized and remote method invocations are used to execute code. (Also, I found the use of @everywhere macros cumbersome.) Apparently a simpler single machine parallelism is difficult to implement but in the works, as shown in this recent work by Intel presented at JuliaCon 2015.
Debugger. Quite simply, a debugger is essential for larger projects where errors can arise that are difficult to understand and debug without being able to interactively inspect the context in which the error appeared. Currently Julia has Debug.jl which provides debugging at gdb level in terms of functionality. But Julia lacks an interactive debugging capability on par with what is available in Matlab or most C/C++ environments (actually, I am not sure about Python debuggers here, is there a single popular tool?). As far as I understand, this is planned for the 0.5 version of Julia.
Shipping/productization/static-compilation. With this I mean the ability to select the distribution mechanism for the software, in particular to select whether all dependencies are included so that the software "will just run" on the target system, and whether binaries or source code is delivered. For most researchers and open-source programmers this is not an issue and the Julia package system caters for all their needs, but I found it relevant in a company environment because explaining to someone how they install Julia and a piece of code takes a while, whereas for C++ I can typically easily send an executable file and some library dependencies. As far as I understand, static compilation is planned for a future version of Julia.

How good are your beliefs? Part 2: The Quiz

2015-09-18T21:00:00+01:00

This post continues the previous post, part 1 on scoring rules. However, today we will be more hands on, testing your skill of making good and well-calibrated predictions.

To this end, I will ask you several questions about numerical quantities and I would like to hear an answer stated as a belief interval. First, we consider scoring rules for intervals.

Interval Scoring Rules

Often the prediction or elicitation of a full probability distribution is cumbersome due to the many degrees of freedom a distribution has.

Therefore, in practice we can instead ask our model or users for intervals. This carries the implicit assumption of unimodal beliefs, which may not be satisfied in important tasks, but has the advantage of requiring only two numbers to be elicited.

Given an interval forecast $[L,U]$, where $U > L > 0$, and $x > 0$ is a realization, (Gneiting and Raftery, 2007) define the following interval scoring rule for $\alpha \in (0,\frac{1}{2})$,

$$S_{\textrm{int}}(L,U,x,\alpha) = (U-L) + 1_{\{x < L\}} \, \alpha (L-X) + 1_{\{x > U\}} \, \alpha (X-U).$$

This is a proper scoring rule for intervals constructed from the sum of two quantile losses at the $\alpha$-quantile and the $(1-\alpha)$-quantile. However, it has the problem that if the score is used in different contexts where the quantities $x$ are of very different scales, then the resulting scores also carry this scale and are not comparable.

To achieve a scale-free interval scoring rule, we propose the following scale-free interval scoring rule.

$$S_{\textrm{sf}}(L,U,x,\alpha) = \alpha \log(U/L) + 1_{\{x < L\}} \log(L/x) + 1_{\{x > U\}} \log(x/U).$$

The rule is negatively oriented, thus acting as a loss function. This scoring rule is proper and is minimized in expectation over $X$ if we set $L = F^{-1}(\alpha)$ and $U = F^{-1}(1-\alpha)$ where $F$ is the cummulative distribution function of $X$ so that $L$ and $U$ become the $\alpha$-quantile and the $(1-\alpha)$-quantile. (You can find a short proof that this is a proper scoring rule in an appendix to this article.)

Quiz

The following quiz tests your ability to make well-calibrated but uncertain assessments. (This also means that the quiz becomes somewhat pointless if you resort to Google or Wikipedia searches.) The quiz contains twelve items, and each item asks for a number, assuming there is a single true answer. Please pay attention to the units being asked for. Your knowledge regarding the different items is likely quite variable and for some questions you may have a good idea (your beliefs are concentrated), whereas for some other questions you may be more uncertain.

Because of this uncertainty the quiz does not ask you for your best guess but instead asks for an interval in an attempt to elicit your subjective beliefs. The lower number should be chosen such that you consider it 10 percent likely that the truth is below this number. The upper number is a 90 percent quantile and should be chosen such that there is a 10 percent chance that the truth is above this number.

For example, say the question is "Maximum horsepower of an 2015 Audi R8 car (horsepower)". Given my limited knowledge of cars I know that the Audi R8 is likely a quite powerful car so I would provide maybe an interval of 200 to 510. How I arrive at this is up to me, for example, I may consider that a car manufacturer may want to break the magic "500 horsepower" mark for marketing purposes. Fixing this interval, the truth is revealed. The truth is 570 horsepower, and the above scale-free interval loss would be 0.205.

For the interval score a lower score is better, that is, the score is negatively-oriented and behaves like a loss function. Here is an illustration of different intervals and their scores for the example. I plot the true value 570 as a solid green vertical line, and the intervals are green if they cover the truth and red otherwise. The score is shown next to each interval.

Have fun, and feel free to comment or suggest new questions/answer in the comment field.

Based on my informal testing with a few volunteers, for the above questionaire the following seems like a reasonable subjective scale for the average score:

$0$ to $0.1$, expert
$0.1$ to $0.2$, proficient
$0.2$ to $0.5$, good
$0.5$ to $1.0$, medium
above $1$, fair

As for calibration, you should ideally have around eight to ten out of the twelve questions showing as green, because the quantile range should have 80 percent coverage. (Most persons who do not work with probability on a regular basis will have a lower coverage because of overconfidence.)

Acknowledgements. I thank John Winn for the original calibration experiment he conducted in 2014 which inspired this article, Tilmann Gneiting for commenting on the scale-free quantile score, Peter Gehler for feedback and providing further questions, Cheng Soon Ong for comments that improved clarity of the article, Ian Kash for explaining scoring rules, Christoph Dann and Juan Gao for feedback on the questionnaire.

Appendix: Propriety of the Scale-free Interval Scoring Rule

The following is a proof that the scale-free interval scoring rule is proper. We will use the result from (Gneiting and Raftery, 2007) and show that our scoring rule is a special case.

First, consider the general form of a scoring rule for an $\alpha$-quantile from Theorem 6 in Gneiting and Raftery; for a choice $r$ and realization $x$ this takes the form

\begin{equation} S(r,x,\alpha) = \alpha s(r) + (s(x) - s(r)) \, 1_{\{x \leq r\}} + h(x). \label{eqn:Squantile} \end{equation}

Gneiting and Raftery show that for any nondecreasing function $s$ and an arbitrary function $h$ this yields a proper scoring rule for quantiles. Infact, it is known that any proper scoring rule for quantile has to be of the form $(\ref{eqn:Squantile})$, see Theorem 3.3 in (Gneiting, 2009). In the Gneiting and Raftery JASA paper the authors propose the choices $s(y)=y$ and $h(y)=-\alpha y$. But here, in order to achieve a scale-free rule we propose to use

$$s(y) = \log y,\qquad h(y) = -\alpha \log y.$$

We obtain the specialization of $(\ref{eqn:Squantile})$ as

\begin{eqnarray} S_{\textrm{q}}(r,x,\alpha) & = & \alpha \log r + (\log x - \log r) \,1_{\{x \leq r\}} - \alpha \log x\nonumber\\ & = & \alpha \log (r/x) + 1_{\{x \leq r\}} \, \log (x/r).\label{eqn:qscore} \end{eqnarray}

Because $s$ is a non-decreasing function this is a proper scoring rule for quantiles. This quantile loss looks as follows (compare to the check loss figure earlier), for different quantiles ($x=5$ is the sample realization, and the horizontal axis denotes our quantile estimate).

The expected risk plot has a different shape compared to the check loss that we have seen earlier, but note that the minimizer again corresponds to the right quantiles of the $N(5,1)$ belief distribution.

By using Corollary 1 in (Gneiting and Raftery, 2007) the sum of multiple quantile scoring rules remains a proper scoring rule. To obtain a scoring rule for intervals we use the $\alpha$-quantile and the $(1-\alpha)$-quantile to obtain (after some rewriting of terms)

\begin{eqnarray} S_{\textrm{sf}}(L,U,x,\alpha) & = & -S_{\textrm{q}}(L,x,\alpha) - S_{\textrm{q}}(U,x,1-\alpha)\nonumber\\ & = & \alpha \log(U/L) + 1_{\{x < L\}} \log(L/x) + 1_{\{x > U\}} \log(x/U).\nonumber \end{eqnarray}

How good are your beliefs? Part 1: Scoring Rules

2015-09-04T22:00:00+01:00

This article is the first of two on proper scoring rules, a specific type of loss function defined on probability distributions or functions of probability distributions.

If this article sparks your interest, I recommend the gentle introduction to scoring rules in the context of decision theory in Chapter 10 of Parmigiani and Inoue's "Decision Theory" book, which is a great book to have on your data science bookshelf in any case and it deservedly won the DeGroot prize in 2009.

Scoring Rules

Consider the following forecasting setting. Given a set of possible outcomes $\mathcal{X}$ and a class of probability measures $\mathcal{P}$ defined on a suitably constructed $\sigma$-algebra, we consider a forecaster which makes a forecast in the form of a probability distribution $P \in \mathcal{P}$. After the forecast is fixed, a realization $x \in \mathcal{X}$ is revealed and we would like to assess quality of the prediction made by the forecaster.

A scoring rule is a function $S$ such that $S(P,x)$ is taken to mean the quality of the forecast. Hence the function has the form $S: \mathcal{P} \times \mathcal{X} \to \mathbb{R} \cup \{-\infty,\infty\}$. There are two variants popular in the literature: the positively-orientied scoring rules assign higher values to better forecasts, the negatively-oriented scoring rules behave like loss functions, taking smaller values for better forecasts.

A proper scoring rule has desirable behaviour, to be made precise shortly. Let us first think what could be desirable in a scoring rule. Intuitively we would like to make "cheating" difficult, that is, if we really subjectively believe in $P$, we should have no incentive to report any deviation from $P$ in order to achieve a better score. Formally, we first define the expected score under distribution $Q$,

$$S(P,Q) = \mathbb{E}_{x \sim Q}[S(P,x)].$$

So that if we believe in any prediction $P \in \mathcal{P}$, then we should demand that (for negatively-oriented scores)

$$S(P,P) \leq S(P,Q),\qquad \forall P,Q \in \mathcal{P}.$$

For strictly proper scoring rules the above inequality holds strictly except for $Q=P$. For a proper scoring rule the above inequality means that in expectation the lowest possible score can be achieved by faithfully reporting our true beliefs. Therefore, a rational forecaster who aims to minimize expected score (loss) is going to report his beliefs.

Key uses of scoring rules are:

Evaluating the predictive performance of a model;
Eliciting probabilities;
Using them for parameter estimation.

Let us look briefly at the different uses.

Model Evaluation

For assessing the model performance, we simply use the scoring rule as a loss function and measure the predictive performance on a holdout data set.

Probability Elicitation

For probability elicitation we can use a scoring rule as follows: we ask a user to make predictions and we tell him that we will reward him proportionally to the value achieved by the scoring rule once the prediction can be scored. Assuming that the user is rational and aims to maximize his reward, if we use a proper scoring rule, then he can maximize his expected reward by making predictions according to the true beliefs he holds. However, while the existence of a strictly proper scoring rule roughly means that elicitation of a quantity is possible, more efficient methods for probability elicitation may exist. Infact, Simon French and David Rios Insua argue in their book Statistical Decision Theory, page 76, that

"de Finetti (1974; 1975) and others have championed the use of scoring rules to elicit probabilities of events. ... Scoring rules are important in de Finetti's development of subjective probability, but it is not clear that they have a practical use in statistical or decision analysis. ... Scoring rules could provide a very expensive method of eliciting probabilities. In training probability assessors, however, they can have a practical use."

If you wonder what more efficient alternatives French and Insua have in mind, they do propose several methods to elicit probabilities, such as an idealized "probability wheel" the user can configure and spin, and a sequence of proposed gambles in order to find a fair value accepted by the user.

In general it seems to me (as an outsider of this field), that probability elicitation is as much about theoretically sound methods as it is about human psychology and biases, and how to avoid them. The human aspect of probability elicitation is discussed in the Roger Cooke's book-length monograph on the topic, and the recent study of (Goldstein and Rothschild, "Lay understanding of probability distributions", 2014) (thanks to Ian Kash for pointing me to this study!).

Estimation

For parameter estimation we perform empirical risk minimization on a probabilistic model using the scoring rule as a loss function, an approach dating back to (Pfanzagl, 1969). This is a special case of M-estimation but generalizes maximum likelihood estimation (MLE), where the log-probability scoring rule is used.

If the model class contains the true generating model this yields a consistent estimator but for misspecified models this can yield answers different from the MLE, and these answers may be preferable; for example, if model assumptions are violated and for any choice of parameter the model would put have a low density on some observations these tend to influence the MLE severely because the log-prob scoring rule assigns a large penalty to these observations. Using a suitable scoring rule cannot prevent misspecification of course but the consequences can be made less severe.

It should also be said that for estimation problems the log-prob scoring rule is the most principled in that it is the only one that can be justified from the likelihood principle.

Scoring Rule Examples

Here are a few examples of common and not so common scoring rules both for discrete and continuous outcomes.

Scoring Rule Example: Brier Score

This scoring rule was historically the first, proposed by Glenn Wilson Brier (1913-1998) in his seminal work (Brier, "Verification of Forecasts Expressed in Terms of Probability", 1950) as a means to verify weather forecasts.

Given a discrete outcome set $\{1,2,\dots,K\}$ the forecaster specifies a distribution $P=(p_1,\dots,p_K)$ with $p_i \geq 0$ and $\sum_i p_i = 1$. Then, when an outcome $j$ is realized we score the forecaster according to the Brier score,

$$S_B(P,j) = \sum_{i=1}^K (1_{\{i=j\}} - p_i)^2.$$

The Brier score is extensively discussed in (DeGroot and Fienberg, 1983) and they show that it can be decomposed into two terms measuring calibration and refinement, respectively. Here, refinement measures the information available to discriminate between different outcomes that is contained in the prediction.

For the case with binary classes, the definite work is (Buja, Stuetzle, Shen, 2005) in which a class of scoring rules is proposed based on the Beta distribution which generalizes both the Brier score and the log-probability score.

Scoring Rule Example: Log-Probability

The most common scoring rule in estimation problems is the log-probability, also known as the log-loss in machine learning. Maximum likelihood estimation can be seen as optimizing the log-probability scoring rule.

For the discrete outcome case it is given simply by

$$S_{\textrm{log}}(P,i) = -\log p_i.$$

If $p_i = 0$ the score $S_{\textrm{log}}(P,i) = \infty$. The log-probability is a proper scoring rule, but what really distinguishes it is that it is local in that when outcome $j$ realizes only the predicted value $p_j$ is used to compute the score. Intuitively this is a desirable property because if $j$ happens, why should we care about the precise distribution of probability mass for the other events?

It turns out that this local property is unique to the log-probability scoring rule. (For the result and proof see Theorem 10.1 in Parmigiani and Inoue's book.)

Scoring Rule Example: Energy Statistic

This scoring rule is for predicting a distribution in $\mathbb{R}^d$ and is defined for $\beta \in (0,2)$, realization $x \in \mathbb{R}^d$, and distribution $P$ on $\mathbb{R}^d$ as

$$S_E(P,x) = \mathbb{E}_{X \sim P}[\|X-x\|^\beta] - \frac{1}{2} \mathbb{E}_{X,X' \sim P}[\|X-X'\|^\beta].$$

This score has an intuitive interpretation: the score is the expected distance to the realization minus half the expected pairwise sample distance. Let us think about a few cases: if $P$ is a point mass, then the first term is just the distance to the realization and the second term is zero; in particular for $\beta \to 2$ the score recovers the squared Euclidean norm loss. The original definition is from (Gneiting and Raftery, 2007) except for the sign change, but is based on Szekely's energy statistic which also independently found its way into machine learning through the Hilbert-Schmidt independence criterion.

For $\beta \in (0,2)$ the energy score is a strictly proper scoring function for all Borel measures with finite moment $\mathbb{E}_P[\|X\|^\beta] < \infty$.

Here is a visualization, where $P = \mathcal{N}([0,0]^T, \textrm{diag}([1/2, 5/2]))$ is given by the 10k samples and the red marker corresponds to the realization $x$. Here we have $\beta=1$. We can see that the Euclidean nature of the scoring rule seems to dominate the anisotropic distribution $P$, that is, a realization that is unlikely under our belief distribution (leftmost plot) achieves a lower score than a sample with higher density (second leftmost plot).

As a practical manner, the energy score is simple to evaluate even when you have only predictive Monte Carlo realizations of your model, compared to the log-probability rule which requires the normalizer of the predictive distribution.

Scoring Rule: Check Loss

The check loss, also known as quantile loss or tick loss, is a loss function used for quantile regression, where we would like to learn a model that directly predicts a quantile of a distribution, but we are given only samples of the distribution at training time.

This scoring rule is somewhat different in that a specific property of a belief distribution is scored, namely the quantile of the distribution. Being proper here means that the lowest expected loss is achieved by predicting the corresponding quantile of your belief. (Interestingly proper scoring rules exist only for some functions of the distribution, see (Gneiting, 2009).)

You may know a special case of the check loss already: when using an absolute value loss, your expected risk is minimized by taking the median of your belief distribution, that is, the $\frac{1}{2}$-quantile. The check loss generalizes this to a richer family of loss functions such that the expected minimizer corresponds to arbitrary quantiles, not just the median. Thus, instead of scoring an entire belief distribution $P$ we only score its quantile statistics.

The check loss is defined as

$$S_{\textrm{c}}(r,x,\alpha) = (x-r) (1_{\{x \leq r\}} - \alpha),$$

where $r$ is our predicted $\alpha$-quantile and $x \sim Q$ is a sample from the true unknown distribution $Q$.

Plotting this loss explains the name check loss and tick loss, because it looks like two tilted lines. I show it for a sample realization of $x=5$, and the horizontal axis denotes the quantile estimate.

For any belief distribution, taking the minimum expected risk decision yields the matching quantile. For example, if your beliefs are distributed according to $X \sim N(5,1)$, then you would consider the expected risk

$$R_{\alpha}(r,\alpha) = \mathbb{E}_{X \sim N(5,1)}[-S_c(r,X,\alpha)].$$

This convolves the check loss function with the belief distribution, in this case corresponding to a Gaussian kernel. The minimizer over $r$ of this expected risk function would correspond to your optimal decision.

The above plot marks the 10/50/90 quantiles and these correspond to the minimizers of the expected risks of the respective check losses.

Conclusion

The above is only a small peek into the vast literature on scoring rules. If you are mathematically inclined, I highly recommend (Gneiting and Raftery, 2007) as an enjoyable further read and (Frongillo and Kash, 2015) for the most recent general results; everyone else may enjoy the book mentioned in the introduction.

In the second part we are going to put your forecasting skills to the test via an interactive quiz!

Acknowledgements. I thank Ian Kash for further insightful discussions on scoring rules and pointing me to relevant literature.

Machine Learning for Intelligent Image and Video Processing (ICCV 2015 Workshop)

2015-09-02T23:30:00+01:00

Michael Hirsch and myself are organizing a workshop on the topic of machine learning for image and video processing as part of the ICCV 2015 programme.

The workshop takes place on the 17th December 2015 in Santiago, Chile, right after the main ICCV conference.

Call for Contributions

Image processing methods are highly relevant in a large variety of industrial and consumer applications. Traditionally some of the successful methods have been derived based on a careful consideration of the particular imaging modality and task, or on an adhoc basis by image processing practitioners. More recently statistical machine learning models have been proposed for tasks such as denoising, deblurring, inpainting, etc., often leading to significant gains in image quality. Machine learning methods require training data to learn about the image statistics and the task, and challenges arise in how this data should be collected and how ground truth is obtained.

The goal of this workshop is to bring together researchers from the image processing and machine learning community to discuss all issues related to machine learning models for image processing applications.

We invite submission of papers on relevant topics including, but not limited to the following areas:

Statistical modelling of image processing tasks
Runtime and data efficiency
Tractable estimation
Deep learning for image processing applications
Procedures to obtain ground truth data sets

In all aspects the ICCV community has been at the forefront of developing new ideas and we hope to continue this development through this workshop.

Keynote Speakers

Join us for an exciting program including invited talks by:

Peyman Milanfar, Google
Stefan Roth, TU Darmstadt

Important Dates

Submission deadline: Friday, September 25th, 2015
Author Notification: Friday, October 16th, 2015
Final version of submission: Friday, October 23rd, 2015

Submission Instructions

Papers should be in ICCV style
Maximum paper length is 6 pages
Papers will be reviewed in a double blind process
Accepted papers are not published as part of IEEE Proceedings but inofficially on the workshop website

Accepted papers will be presented at the poster session with an additional poster spotlight presentation. One author of every accepted paper has to attend the workshop to present poster and spotlight talk.

Organizers

Sebastian Nowozin, Microsoft Research, Cambridge, UK
Michael Hirsch, Max Planck Institute for Intelligent Systems, Germany

Please find further details at the workshop website or send me an email in case you have any questions.

Effective Sample Size in Importance Sampling

2015-08-21T21:30:00+01:00

In this article we will look at a practically important measure of efficiency in importance sampling, the so called effective sample size (ESS) estimate. This measure was proposed by Augustine Kong in 1992 in a technical report which until recently has been difficult to locate online, but after getting in contact with the University of Chicago I am pleased that the report is now available (again):

Augustine Kong, "A Note on Importance Sampling using Standardized Weights", Technical Report 348, PDF, Department of Statistics, University of Chicago, July 1992.

Before we discuss the usefulness of the effective sample size, let us first define the notation and context for importance sampling.

Importance sampling is one of the most generally applicable method to sample from otherwise intractable distributions. In machine learning and statistics importance sampling is regularly used for sampling from distributions in low dimensions (say, up to maybe 20 dimensions). The general idea of importance sampling has been extended since the 1950s to the sequential setting and the resulting class of modern Sequential Monte Carlo (SMC) methods constitute the state of the art Monte Carlo methods in many important time series modeling applications.

The general idea of importance sampling is as follows. We are interested in computing an expectation,

$$\mu = \mathbb{E}_{X \sim p}[h(X)] = \int h(x) p(x) \,\textrm{d}x.$$

If we can sample from $p$ directly, the standard Monte Carlo estimate is possible, and we draw $X_i \sim p$, $i=1,\dots,n$, then use

$$\hat{\mu} = \frac{1}{n} \sum_{i=1}^n h(X_i).$$

In many applications we cannot directly sample from $p$. In this case importance sampling can still be applied by sampling from a tractable proposal distribution $q$, with $X_i \sim q$, $i=1,\dots,n$, then reweighting the sample using the ratio $p(X_i)/q(X_i)$, leading to the standard importance sampling estimate

$$\tilde{\mu} = \frac{1}{n} \sum_{i=1}^n \frac{p(X_i)}{q(X_i)} h(X_i).$$

In case $p$ is known only up to an unknown normalizing constant, the so called self-normalized importance sampling estimate can be used. Denoting the weights by $w(X_i) = \frac{p(X_i)}{q(X_i)}$ it is defined as

$$\bar{\mu} = \frac{\frac{1}{n} \sum_{i=1}^n w(X_i) h(X_i)}{ \frac{1}{n} \sum_{i=1}^n w(X_i)}.$$

The quality of this estimate chiefly depends on how good the proposal distribution $q$ matches the form of $p$. Because $p$ is difficult to sample from, it typically is also difficult to make a precise statement about the quality of approximation of $q$.

The effective sample size solves this issue: it can be used after or during importance sampling to provide a quantitative measure of the quality of the estimated mean. Even better, the estimate is provided on a natural scale of worth in samples from $p$, that is, if we use $n=1000$ samples $X_i \sim q$ and obtain an ESS of say 350 then this indicates that the quality of our estimate is about the same as if we would have used 350 direct samples $X_i \sim p$. This justifies the name effective sample size.

Since the late 1990s the effective sample size is popularly used as a reliable diagnostic in importance sampling and sequential Monte Carlo applications. Sometimes it even informs the algorithm during sampling; for example, one can continue an importance sampling method until a certain ESS has been reached. Another example is during SMC where the ESS is often used to decide whether operations such as resampling or rejuvenation are performed.

Definition

Two alternative but equivalent definitions exist. Assume normalized weights $w_i \geq 0$ with $\sum_{i=1}^n w_i = 1$. Then, the original definition of the effective sample size estimate is by Kong, popularized by Jun Liu in this paper, as

$$\textrm{ESS} = \frac{n}{1 + \textrm{Var}_q(W)},$$

where $\textrm{Var}_q(W) = \frac{1}{n-1} \sum_{i=1}^n (w_i - \frac{1}{n})^2$. The alternative form emerged later (I did not manage to find its first use precisely), and has the form

$$\textrm{ESS} = \frac{1}{\sum_{i=1}^n w_i^2}.$$

When the weights are unnormalized, we define $\tilde{w}_i = w_i / (\sum_{i=1}^n w_i)$ and see that

$$\textrm{ESS} = \frac{1}{\sum_{i=1}^n \tilde{w}_i^2} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}.$$

As is often the case in numerical computation in probabilistic models the quantities are often stored in log-domain, i.e. we would store $\log w_i$ instead of $w_i$, and compute the above equations in log-space.

Example

As a simple example we set the target distribution to be a $\textrm{StudentT}(0,\nu)$ with $\nu=8$ degrees of freedom, and the proposal to be a Normal $\mathcal{N}(\mu,16)$. We then visualize the ESS as a function of the shift $\mu$ of the Normal proposal. The sample size should decrease away from the true mean (zero) and be highest at zero.

This is indeed what happens in the above plot and, not shown, the estimated variance from the ESS agrees with the variance over many replicates.

Derivation

The following derivation is from Kong's technical report, however, to make it self-contained and accessible I fleshed out some details and give explanations inline.

We start with an expression for $\textrm{Var}(\bar{\mu})$. This is a variance of a ratio expression with positive denominator; hence we can apply the multivariate delta method for ratio expressions (see appendix below) to obtain an asymptotic approximation. Following Kong's original notation we define $W_i = w(X_i)$ and $W=W_1$, as well as $Z_i = h(X_i) w(X_i)$ and $Z = Z_1$. Then we have the asymptotic delta method approximation

\begin{eqnarray} \textrm{Var}_q(\bar{\mu}) & \approx & \frac{1}{n}\left[\frac{\textrm{Var}_q(Z)}{(\mathbb{E}_q W)^2} - 2 \frac{\mathbb{E}_q Z}{(\mathbb{E}_q W)^3} \textrm{Cov}_q(Z,W) + \frac{(\mathbb{E}_q Z)^2}{(\mathbb{E}_q W)^4} \textrm{Var}_q(W)\right].\label{eqn:delta1} \end{eqnarray}

We can simplify this somewhat intimidating expression by realizing that

$$\mathbb{E}_q W = \int \frac{p(x)}{q(x)} q(x) \,\textrm{d}x = \int p(x) \,\textrm{d}x = 1.$$

(For the unnormalized case the derivation result is the same because the ratio $\bar{\mu}$ does not depend on the normalization constant.) Then we can simplify $(\ref{eqn:delta1})$ to

\begin{eqnarray} & = & \frac{1}{n}\left[\textrm{Var}_q(Z) - 2 (\mathbb{E}_q Z) \textrm{Cov}_q(Z,W) + (\mathbb{E}_q Z)^2 \textrm{Var}_q(W)\right].\label{eqn:delta2} \end{eqnarray}

The next step is to realize that $\mathbb{E}_q Z = \int w(x) h(x) q(x) \,\textrm{d}x = \int \frac{p(x)}{q(x)} q(x) h(x) \,\textrm{d}x = \int h(x) p(x) \,\textrm{d}x = \mu.$ Thus $(\ref{eqn:delta2})$ further simplifies to

\begin{eqnarray} & = & \frac{1}{n}\big[\underbrace{\textrm{Var}_q(Z)}_{\textrm{(B)}} - 2 \mu \underbrace{\textrm{Cov}_q(Z,W)}_{\textrm{(A)}} + \mu^2 \textrm{Var}_q(W)\big]. \label{eqn:delta3} \end{eqnarray}

This is great progress, but we need to nibble on this expression some more. Let us consider the parts (A) and (B), in this order.

(A). To simplify this expression we can leverage the definition of the covariance and then apply the known relations of our special expectations. This yields.

\begin{eqnarray} \textrm{(A)} = \textrm{Cov}_q(Z,W) & = & \mathbb{E}_q[\underbrace{Z}_{= W H} W] - \underbrace{(\mathbb{E}_q Z)}_{= \mu} \underbrace{(\mathbb{E}_q W)}_{= 1}\nonumber\\ & = & \mathbb{E}_q[H W^2] - \mu\nonumber\\ & = & \mathbb{E}_p[H W] - \mu.\label{eqn:A1} \end{eqnarray}

Note the change of measure from $q$ to $p$ in the last step. To break down the expectation of the product further we use the known rules about expectations, namely $\textrm{Cov}(X,Y) = \mathbb{E}[XY] - (\mathbb{E}X)(\mathbb{E}Y)$, which leds us to

\begin{eqnarray} \textrm{(A)} = \textrm{Cov}_q(Z,W) & = & \textrm{Cov}_p(H,W) + \mu \mathbb{E}_p W - \mu.\label{eqn:A2} \end{eqnarray}

(B). First we expand the variance by its definition, then simplify.

$$\textrm{Var}_q(Z) = \textrm{Var}_q(W H) = \mathbb{E}_q[W^2 H^2] - (\underbrace{\mathbb{E}_q[WH]}_{= \mu})^2 = \mathbb{E}_p[W H^2] - \mu^2.$$

For approaching $\mathbb{E}_p[W H^2]$ we need to leverage the second-order delta method (see appendix) which gives the following approximation,

\begin{eqnarray} \mathbb{E}_p[W H^2] & \approx & (\mathbb{E}_p W)\underbrace{(\mathbb{E}_p H)^2}_{= \mu^2} + 2 \underbrace{\mathbb{E}_p[H]}_{\mu} \textrm{Cov}_p(W,H) + (\mathbb{E}_p W) \textrm{Var}_p(H)\nonumber\\ & = & (\mathbb{E}_p W) \mu^2 + 2 \mu \textrm{Cov}_p(W,H) + (\mathbb{E}_p W) \textrm{Var}_p(H). \label{eqn:B1} \end{eqnarray}

Ok, almost done. We now leverage our work to harvest:

\begin{eqnarray} \textrm{Var}_q(\bar{\mu}) & \approx & \frac{1}{n}\big[\underbrace{\textrm{Var}_q(Z)}_{\textrm{(B)}} - 2 \mu \underbrace{\textrm{Cov}_q(Z,W)}_{\textrm{(A)}} + \mu^2 \textrm{Var}_q(W)\big]\nonumber\\ & \approx & \frac{1}{n}\big[ \left( (\mathbb{E}_p W) \mu^2 + 2 \mu \textrm{Cov}_p(W,H) + (\mathbb{E}_p W) \textrm{Var}_p(H) - \mu^2 \right)\nonumber\\ & & \qquad - 2 \mu \left(\textrm{Cov}_p(H,W) + \mu\mathbb{E}_p W - \mu\right) \nonumber\\ & & \qquad + \mu^2 \textrm{Var}_q(W) \big]\nonumber\\ & = & \frac{1}{n}\left[\mu^2 \left( 1 + \textrm{Var}_q(W) - \mathbb{E}_p W\right) + (\mathbb{E}_p W) \textrm{Var}_p(H)\right].\label{eqn:H1} \end{eqnarray}

Finally, we can reduce $(\ref{eqn:H1})$ further by

$$\mathbb{E}_p W = \mathbb{E}_q[W^2] = \textrm{Var}_q(W) + (\mathbb{E}_q W)^2 = \textrm{Var}_q(W) + 1.$$

For the other term we have

$$\frac{1}{n} \textrm{Var}_p(H) = \textrm{Var}_p(\hat{\mu}).$$

This simplifies $(\ref{eqn:H1})$ to the following satisfying expression.

$$\textrm{Var}_q(\bar{\mu}) \approx \textrm{Var}_p(\hat{\mu}) (1 + \textrm{Var}_q(W)).$$

This reads as "the variance of the self-normalized importance sampling estimate is approximately equal to the variance of the simple Monte Carlo estimate times $1 + \textrm{Var}_q(W)$."

Therefore, when taking $n$ samples to compute $\bar{\mu}$ the effective sample size is estimated as

$$\textrm{ESS} = \frac{n}{1 + \textrm{Var}_q(W)}.$$

Two comments:

We can estimate $\textrm{Var}_q(W)$ by the sample variance of the normalized importance weights.
This estimate does not depend on the integrand $h$.

The simpler form of the ESS estimate can be obtained by estimating

\begin{eqnarray} \textrm{Var}_q(W) & \approx & \frac{1}{n} \sum_{i=1}^n (w_i - \frac{1}{n})^2 \nonumber\\ & = & \frac{1}{n} \sum_{i=1}^n w_i^2 - \frac{1}{n^2}.\nonumber \end{eqnarray}

which yields

$$\textrm{ESS} = \frac{n}{1 + \frac{1}{n} \sum_i w_i^2 - \frac{1}{n^2}} = \frac{1}{\sum_{i=1}^n w_i^2}.$$

Conclusion

Monte Carlo methods such as importance sampling and Markov chain Monte Carlo can fail in case the proposal distribution is not suitable chosen. Therefore, we should always employ diagnostics, and for importance sampling the effective sampling size diagnostic has become the standard due to its simplicity, intuitive interpretation, and its robustness in practical applications.

However, the effective sample size can fail, for example when all proposal samples are in a region where the target distribution has few probability mass. In that case, the weights would be approximately equal and the ESS close to optimal, failing to diagnose the mismatch between proposal and target distribution. This is, in a way, unavoidable: if we never get to see a high probability region of the target distribution, the low value of our samples is hard to recognize.

For another discussion on importance sampling diagnostics and an alternative derivation, see Section 9.3 in Art Owen's upcoming Monte Carlo book. Among many interesting things in that chapter, he proposes an effective sample size statistic specific to the particular integrand $h$. For this, redefine the weights as

$$w_h(X_i) = \frac{\frac{p(X_i)}{q(X_i)} |h(X_i)|}{ \sum_{i=1}^n \frac{p(X_i)}{q(X_i)} |h(X_i)|},$$

then use the normal $1/\sum_i w_h(X_i)^2$ estimate. This variant is more accurate because it takes the integrand into account.

Addendum: This paper by Martino, Elvira, and Louzada, takes a detailed look at variations of the effective sample size statistic.

Appendix: The Multivariate Delta Method

The delta method is a classic method using in asymptotic statistics to obtain limiting expressions for the mean and variance of functions of random variables. It can be seen as the statistical analog of the Taylor approximation to a function.

The multivariate extension is also classic, and the following theorem can be found in many works, I picked the one given as Theorem 3.7 in DasGupta's book on asymptotic statistics (by the way, this book is a favorite of mine for its accessible presentation of many practical result in classical statistics). A more advanced and specialized book on expansions beyond the delta method is Christopher Small's book on the topic.

Delta Method for Distributions

Theorem (Multivariate Delta Method for Distributions). Suppose $\{T_n\}$ is a sequence of $k$-dimensional random vectors such that

$$\sqrt{n}(T_n - \theta) \stackrel{\mathcal{L}}{\rightarrow} \mathcal{N}_k(0,\Sigma(\theta)).$$

Let $g:\mathbb{R}^k \to \mathbb{R}^m$ be once differentiable at $\theta$ with the gradient vector $\nabla g(\theta)$. Then

$$\sqrt{n}(g(T_n) - g(\theta)) \stackrel{\mathcal{L}}{\rightarrow} \mathcal{N}_m(0, \nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta))$$

provided $\nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta)$ is positive definite.

This simply says that if we have a vector $T$ of random variables and we know that $T$ converges asymptotically to a Normal, then we can make a similar statement about the convergence of $g(T)$.

For the effective sample size derivation we will need to instantiate this theorem for a special case of $g$, namely where $g: \mathbb{R}^2 \to \mathbb{R}$ and $g(x,y) = \frac{x}{y}$. Let's quickly do that. We have

$$\nabla g(x,y) = \left(\begin{array}{c} \frac{1}{y} \\ -\frac{x}{y^2}\end{array}\right).$$

We further define $X_i \sim P_X$, $Y_i \sim P_Y$ iid, $X=X_1$, $Y=Y_1$,

$$T_n=\left(\begin{array}{c} \frac{1}{n}\sum_{i=1}^n X_i\\ \frac{1}{n} \sum_{i=1}^n Y_i\end{array}\right),\qquad \theta=\left(\begin{array}{c} \mathbb{E}X\\ \mathbb{E}Y\end{array}\right),$$

assuming our sequence $\frac{1}{n} \sum_{i=1}^n X_i \to \mathbb{E}X$ and $\frac{1}{n} \sum_{i=1}^n Y_i \to \mathbb{E}Y$. For the covariance matrix we know that the empirical average of $n$ iid samples has a variance as $1/n$, that is

$$\textrm{Var}(\frac{1}{n}\sum_{i=1}^n X_i) = \frac{1}{n^2} \textrm{Var}(\sum_{i=1}^n X_i) = \frac{1}{n^2} \sum_{i=1}^n \textrm{Var}(X_i) = \frac{1}{n} \textrm{Var}(X),$$

and similar for the covariance, so we have

$$\Sigma(\theta) = \frac{1}{n} \left(\begin{array}{cc} \textrm{Var}(X) & \textrm{Cov}(X,Y)\\ \textrm{Cov}(X,Y) & \textrm{Var}(Y)\end{array}\right).$$

Applying the above theorem we have for the resulting one-dimensional transformed variance

\begin{eqnarray} B(\theta) & := & \nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta)\nonumber\\ & = & \frac{1}{n} \left(\begin{array}{c} \frac{1}{\mathbb{E}Y} \\ -\frac{\mathbb{E}X}{(\mathbb{E}Y)^2}\end{array}\right)^T \left(\begin{array}{cc} \textrm{Var}(X) & \textrm{Cov}(X,Y)\\ \textrm{Cov}(X,Y) & \textrm{Var}(Y)\end{array}\right) \left(\begin{array}{c} \frac{1}{\mathbb{E}Y} \\ -\frac{\mathbb{E}X}{(\mathbb{E}Y)^2}\end{array}\right)\nonumber\\ & = & \frac{1}{n} \left[ \frac{1}{(\mathbb{E}Y)^2} \textrm{Var}(X) - 2 \frac{\mathbb{E}X}{(\mathbb{E}Y)^3} \textrm{Cov}(X,Y) + \frac{(\mathbb{E}X)^2}{(\mathbb{E}Y)^4} \textrm{Var}(Y) \right].\nonumber \end{eqnarray}

One way to interpret the quantity $B(\theta)$ is that the limiting variance of the ratio $X/Y$ depends both on the variances of $X$ and of $Y$, but crucially it depends most sensitively on $\mathbb{E}Y$ because this quantity appears in the denominator: small values of $Y$ have large effects on $X/Y$.

This is an asymptotic expression which is based on the assumption that both $X$ and $Y$ are concentrated around the mean so that the linearization of $g$ around the mean will incur a small error. As such, this approximation may deteriorate if the variance of $X$ or $Y$ is large so that the linear approximation of $g$ deviates from the actual values of $g$.

(For an exact expansion of the expectation of a ratio, see this 2009 note by Sean Rice.)

Second-order Delta Method

The above delta method can be extended to higher-order by a multivariate Taylor expansion. I give the following result without proof.

Theorem (Second-order Multivariate Delta Method). Let $T$ be a $k$-dimensional random vectors such that $\mathbb{E} T = \theta$. Let $g:\mathbb{R}^k \to \mathbb{R}$ be twice differentiable at $\theta$ with Hessian $H(\theta)$. Then

$$\mathbb{E} g(T) \approx g(\theta) + \frac{1}{2} \textrm{tr}(\textrm{Cov}(T) \: H(\theta)).$$

For the proof of the effective sample size we need to apply this theorem to the function $g(X,Y)=XY^2$ so that

$$H(X,Y)=\left[\begin{array}{cc} 0 & 2Y\\ 2Y & 2X\end{array}\right].$$

Then the above result gives

$$\mathbb{E} g(X,Y) \approx (\mathbb{E}X)(\mathbb{E}Y)^2 + 2 (\mathbb{E}Y) \textrm{Cov}(X,Y) + (\mathbb{E}X) \textrm{Var}(Y).$$

Reverse Search

2015-08-07T21:30:00+01:00

One of my all-time favorite algorithms is reverse search proposed by David Avis and Komei Fukuda in 1992, PDF.

Reverse search is an algorithm to solve enumeration problems, that is, problems where you would like to list a finite set of typically combinatorially related elements. Reverse search is not quite an algorithm, rather it is a general construction principle that is applicable to a wide variety of problems and often leads to optimal algorithms for enumeration problems.

Problems in which reverse search is applicable often have the flavour where the elements have a natural partial order (such as sets, sequences, graphs where we can define subsets, subsequences, and subgraphs), or where there is a natural neighborhood relation between elements which can be used to traverse from one element to the other (such as the linear programming bases considered in the Avis and Fukuda examples).

The reverse search construction leads to a structured search space that is also suitable for combinatorial search and optimization algorithms. For example, we can often readily use the resulting enumeration tree in branch-and-bound search methods. I made heavy use of this possibility during my PhD a few years ago during my work with Koji Tsuda, and reverse search is the working horse in my CVPR 2007, ICCV 2007, and ICDM 2008 papers. (Needless to say, I have fond memories of it, but even now I regularly see applications of the reverse search idea.) In the following, my presentation will differ quite a bit from the Avis and Fukuda paper.

Basic Idea

At its core reverse search is a method to organize all elements to be enumerated into a tree where the nodes in the tree each represent a single element. Each element appears exactly once in the tree and by traversing the tree from the root we can enumerate all elements exactly once.

Here is the recipe:

Define a ``reduction'' operation which takes an enumeration element and reduces it to a simpler one. This defines an enumeration tree.
Invert the reduction operation.
Enumerate all elements, starting from the root.

Let us illustrate this recipe first on a simple example: enumerating subsets of a given set. Say we are given the set $\{1,2,3\}$ and would like to enumerate subsets. To define the reduction operation we simply say ``remove the largest integer from the set''. Formally, this defines defines a function $f$ from the set of sets to the set of sets. Here is an illustration:

Now we consider the inverse map $f^{-1}$, from the set of sets to the set of powersets. Here is an illustration:

The inverse defines an enumeration strategy: we start at $\emptyset$ and evaluate $f^{-1}(\emptyset) = \{\{1\}, \{2\}, \{3\}\}$. For each set element we now recurse. This enumerates all elements in the tree exactly once.

The above recipe has the following practical advantages:

Reverse search often yields a simple algorithm.
Typically there is no additional memory or bookkeeping required beyond the recursion call stack, so that the total memory required is $O(r)$ where $r$ is the recursion depth.
Yields a output-linear polynomial-delay enumeration algorithms, which means that the total time complexity is linear in the number of items enumerated and for each item only polynomial time is needed. (This slightly unconventional notion of complexity makes sense for enumeration problems because the answer is often exponential in the size of the input.)
Often yields optimal enumeration algorithms in terms of memory and runtime.
The resulting algorithms are trivially parallelizable over the enumeration tree.

Ok, the above was a trivial example, let us look at a more complicated example.

Example: Enumerating all Connected Subgraphs

Let us consider a non-trivial application of the reverse search idea: enumerating all connected subgraphs of a given graph.

To apply the recipe, how could the reduction operation look like? Intuitively, we are given a connected graph and we could remove a single vertex from the graph, thereby making it smaller. By removing one vertex at a time we would eventually arrive at the empty graph.

But given a graph, how do we determine which vertex to remove? For this, let us assume all vertices in the given graph have a unique integer index. Then, given such a graph we can then attempt to remove the highest integer vertex, just as in the set example above. Here we hit a complication: upon removal of the vertex the graph may become disconnected. For example, consider the chain graph $1-3-2$. Here the vertex labeled $3$ would be removed, yielding two disconnected components, which violates the requirement of enumerating only connected subgraphs. Therefore we simply say: ``Remove the highest-index vertex such that the resulting graph remains connected''.

Here is an example of the reduction operation in action on the following simple cycle graph:

The enumeration tree of all fourteen connected subgraphs (counting the empty graph as well) looks as follows. Here each arrow is the application of one reduction operation.

Looking at the above tree, you can note the following:

The graph $1-4-2$ has the highest vertex $4$ but this cannot be removed because it would yield a disconnected subgraph; therefore the reduction operation removes $2$ instead.
By construction, there is a unique path from every graph to the root.
By construction only connected subgraphs are present in the tree, and each such graph is present exactly once.

In order to enumerate all connected subgraphs, we have to invert the arrows of this graph. That is, we have to invert the reduction operation and given a graph we have to generate all child nodes in the reversed graph. This reversion is what gives reverse search its name.

The inverse operation is described as follows: ``given a connected subgraph, add a vertex which will become the highest-index vertex and whose removal retains a connected graph." This is quite a mouthful but luckily the actual implementation is simple.

Here is a Julia implementation.

using LightGraphs

is_connected1(g::Graph) = nv(g) <= 1 ? true : is_connected(g)
is_removable(g::Graph, vset::IntSet, rmv) =
    is_connected1(induced_subgraph(g, setdiff(vset, rmv)))
rm_vertex(g::Graph, vset::IntSet) =
    maximum(filter(rmv -> is_removable(g, vset, rmv), vset))

function connsubgraphs(g::Graph)
    function _connsubgraphs(vset::IntSet)
        produce(copy(vset))   # output current subgraph vertex set

        # Generate child nodes of the current subgraph.
        # Consider all vertices not yet in graph
        for add_vi = filter(v -> !in(v, vset), vertices(g))
            push!(vset, add_vi)     # Add new vertex
            if is_connected1(induced_subgraph(g, vset)) &&
                add_vi == rm_vertex(g, vset)
                # Recurse
                _connsubgraphs(vset)
            end
            setdiff!(vset, add_vi)  # Remove new vertex
        end
    end
    function _connsubgraphs()
        _connsubgraphs(IntSet())
    end
    Task(_connsubgraphs)
end

g = Graph(4)
add_edge!(g, 1, 3)
add_edge!(g, 3, 2)
add_edge!(g, 4, 2)
add_edge!(g, 1, 4)
S = collect(connsubgraphs(g))

Note the key statements between the push! and setdiff! lines that govern the recursion. In the if-condition we check that the new graph remains connected and the added vertex would be the one that would be removed.

The above code uses the Julia producer-consumer pattern. When run, it produces the following output, identical to the above diagram.

14-element Array{Any,1}:
 IntSet([])          
 IntSet([1])         
 IntSet([1, 3])      
 IntSet([1, 2, 3])   
 IntSet([1, 2, 3, 4])
 IntSet([1, 3, 4])   
 IntSet([1, 4])      
 IntSet([1, 2, 4])   
 IntSet([2])         
 IntSet([2, 3])      
 IntSet([2, 3, 4])   
 IntSet([2, 4])      
 IntSet([3])         
 IntSet([4])

Conclusion

Reverse search is a general recipe to construct tree-structured enumeration methods useful for enumerating combinatorial sets and optimization over them.

In fact, it is so useful that some authors have reinvented reverse search without noticing. For example, the popular gSpan algorithm of Yan and Han published in 2003 defines a clever total ordering on labeled graphs essentially in order to be able to define the reduction operation needed in reverse search.

So, check it out, the Avis and Fukuda paper is very rich and well worth a read! (If you prefer a different presentation similar to the one above but more technical, have a look at my PhD thesis.)

Acknowledgements. I thank Koji Tsuda for reading a draft version of the article and providing feedback.

Stochastic Computation Graphs

2015-07-24T22:00:00+01:00

This post is about a recent arXiv submission entitled Gradient Estimation Using Stochastic Computation Graphs, and authored by John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel.

In a nutshell this paper generalizes the backpropagation algorithm to allow differentiation through expectations, that is, to compute unbiased estimates of

$$\frac{\partial}{\partial \theta} \mathbb{E}_{x \sim q(x|\theta)}[f(x,\theta)].$$

The paper also provides a nice calculus on directed graphs that allows quick derivation of unbiased gradient estimates. The basic technical results in the paper have been known and used in various communities before and the arXiv submission properly discusses these.

But dismissing the paper as non novel would miss the point in a similar way as missing the point when stating that backpropagation is ``just an application of the chain rule of differentiation''. Instead, the contribution of the current paper is in the practical utility of the graphical calculus and a rich catalogue of machine learning problems where the computation of unbiased gradients of expectations is useful.

In typical statistical point estimation tasks unbiasedness is often not quite as important compared to expected risk. However, here it is crucial. This is because the applications where stochastic computation graphs are useful involve optimization over $\theta$ and stochastic approximation methods such as stochastic gradient methods can only be justified theoretically in the case of unbiased gradient estimates.

A Neat Derivative Trick

To get an idea of the flavour of derivatives involving expectations, let us look at a simpler case explained in Section 2.1 of the paper. The proof of that case also contains a neat trick worth knowing. The case is as above but inside the expectation we have only $f(x)$ instead of $f(x,\theta)$. The ``trick'' is in the identity (obvious in retrospect),

$$\frac{\partial}{\partial \theta} p(x|\theta) = p(x|\theta) \frac{\partial}{\partial \theta} \log p(x|\theta).$$

This allows to establish

\begin{eqnarray} \frac{\partial}{\partial \theta} \mathbb{E}_{x \sim p(x|\theta)}[f(x)] & = & \frac{\partial}{\partial \theta} \int p(x|\theta) f(x) \,\textrm{d}x\nonumber\\ & = & \int \frac{\partial}{\partial \theta} p(x|\theta) f(x) \,\textrm{d}x\nonumber\\ & = & \int p(x|\theta) f(x) \frac{\partial}{\partial \theta} \log p(x|\theta) \,\textrm{d}x\nonumber\\ & = & \mathbb{E}_{x \sim p(x|\theta)}[f(x) \frac{\partial}{\partial \theta} \log p(x|\theta)].\nonumber \end{eqnarray}

In this case the derivation was straightforward but for multiple expectations a derivation based on this elementary definition of the expectation is cumbersome and error-prone. Stochastic computation graphs allow a much quicker derivation of the derivative.

Stochastic Computation Graphs

Stochastic computation graphs are directed acyclic graphs that encode the dependency structure of computation to be performed. The graphical notation generalizes directed graphical models. Here is an example graph.

There are three (or four) types of nodes in a stochastic computation graph:

Input nodes. These are the fixed parameters we would like to compute the derivative of. In the example graph, this is the $\theta$ node and they are drawn without any container. While technically it is possible to have graphs without input nodes, in order to compute gradients the graph should include at least one input node.
Deterministic nodes. These compute a deterministic function of their parents. In the above graph this is the case for the $x$ and $f$ nodes.
Stochastic nodes. These nodes specify a random variable through a distribution conditional on their parents. In the above graph this is true for the $y$ node, and the circle mirrors the notation used in directed graphical models.
Cost nodes. These are a subset of the deterministic nodes in the graph whose range are the real numbers. In the above graph the node $f$ is a cost node. I draw them shaded, this is not the case in the original paper.

The entire stochastic computation graph specifies a single objective function whose domain are the input nodes and whose scalar objective is the sum of all cost nodes. The sum of all cost nodes is taken as an expectation over all stochastic nodes in the graph.

Therefore the above graph has the objective function

$$F(\theta) = \mathbb{E}_{y \sim p(y|x(\theta))}[f(y)].$$

Derivative Calculus

The notation used in the paper is a bit heavy and (for my taste at least) a bit too custom, but here it is. Let $\Theta$ be the set of input nodes, $\mathcal{C}$ the set of cost nodes, and $\mathcal{S}$ be the set of stochastic nodes. The notation $u \prec v$ denotes that there exist a directed path from $u$ to $v$ in the graph. The notation $u \prec^D v$ denotes that there exist a path whose nodes are all deterministic with the exception of the last node $v$ which may be of any type. We write $\hat{c}$ for a sample realization of a cost node $c$. The final notation needed for the result is

$$\textrm{DEPS}_v = \{ w \in \Theta \cup \mathcal{S} | w \prec^D v\}.$$

The key result of the paper, Theorem 1, is now stated as follows:

$$\frac{\partial}{\partial \theta} \mathbb{E}\left[\sum_{c \in \mathcal{C}} c\right] = \mathbb{E}\Bigg[\underbrace{\sum_{w \in \mathcal{S}, \theta \prec^D w} \left( \frac{\partial}{\partial \theta} \log p(w|\textrm{DEPS}_w) \right) \sum_{c \in \mathcal{C}, w \prec c} \hat{c}}_{\textrm{(A)}} + \underbrace{\sum_{c \in \mathcal{C}, \theta \prec^D c} \frac{\partial}{\partial \theta} c(\textrm{DEPS}_c)}_{\textrm{(B)}}\Bigg].$$

The two parts, (A) and (B) can be interpreted as follows. If we only have deterministic computation so that $\mathcal{S} = \emptyset$, as in an ordinary feedforward neural network for example, the part (B) is just the ordinary derivative and we have to apply the chain rule to that expression. The part (A) originates from each stochastic node and the consequences that originate from the stochastic nodes is absorbed in the sample realizations $\hat{c}$.

It takes a bit of practice to apply Theorem 1 quickly to a given graph, and I found it easier to instead manually, on a piece of paper, executing Algorithm 1 of the paper, which generalizes backpropagation and builds the derivative node by node by traversing the graph backwards.

Example

To understand the basic technique I illustrate the stochastic computation graph technique on the concrete graph above, which is problem (1) in the paper (Section 2.3), but I make the example concrete.

$$x(\theta) = (\theta-1)^2,$$

$$y(x) \sim \mathcal{N}(x,1),$$

$$f(y) = \left(y-\frac{5}{2}\right)^2.$$

Before we apply Theorem 1 to the graph, here is how the problem actually looks like. First, the objective $F(\theta) = \mathbb{E}_{y \sim p(y|x(\theta))}[f(y)]$. This objective is just an ordinary one-dimensional deterministic function.

The true gradient of the objective is also just an ordinary function. You can see three zero-crossings at approximately -0.6, 1, and 2.6, corresponding to two local minima and a saddle-point of the objective function.

For this simple example we can find a closed form expression for $F(\theta)$, but in general stochastic computation graphs we are not able to evaluate $F(\theta)$ and instead only sample values $\hat{F}_1, \hat{F}_2, \dots$ which are unbiased estimates of the true $F(\theta)$. By taking averages of a few samples, say of a 100 samples, we can improve the accuracy of our estimates. In order to minimize $F(\theta)$ over $\theta$ our goal is to sample unbiased gradients as well. The unbiased sample gradients look as follows, for $1$ sample (shown in green) and for averages of a $100$ samples (shown in red), evaluated at a 100 points equispaced along the $\theta$ axis shown.

To derive the unbiased gradient estimate we apply Theorem 1. From the summation (A) we will only have one term because our graph contains only one stochastic node, namely $y$. We will not have any term from (B) as there is no deterministic path from $\theta$ to $f$. Therefore we have

$$\frac{\partial}{\partial \theta} \mathbb{E}_{y \sim p(y|x(\theta))}[f(y)] = \mathbb{E}_{y \sim p(y|x(\theta))}\left[\frac{\partial}{\partial \theta} \log p(y|x(\theta)) \hat{f}\right].$$

For the logarithm we need to differentiate the log-likelihood of the Normal distribution and compute

\begin{eqnarray} \frac{\partial x}{\partial \theta} \frac{\partial}{\partial x} \log p(y|x(\theta)) & = & \frac{\partial x}{\partial \theta} \frac{\partial}{\partial x} \left[ - \frac{(y-x(\theta))^2}{2} - \frac{1}{2} \log 2\pi \right]\nonumber\\ & = & \frac{\partial x}{\partial \theta} (y-x(\theta))\nonumber\\ & = & 2(\theta - 1)(y - x(\theta)).\nonumber \end{eqnarray}

So the overall unbiased gradient estimator is

$$\mathbb{E}\left[\frac{\partial}{\partial \theta} \log p(y|x(\theta)) \hat{f}\right] = \mathbb{E}[2(\theta-1)(\hat{y}-\hat{x}) \hat{f}].$$

And the last expression in the expectation is the estimate for a single sample realization.

Variational Bayesian Neural Networks

One important application of being able to compute gradients of expectation objectives is the approximate variational Bayesian posterior inference of neural network parameters.

The original pioneering work of applying variational Bayes (aka mean field inference) to neural network learning is this 1993 paper of Hinton and van Kamp. Recently this has made a revival in particular through the appearance of stochastic variational inference methods around 2011, including a paper of Alex Graves. Many works followed up on this lead, for example Kingma and Welling, Rezende et al., ICML 2014, Blundell et al., ICML 2015, and Mnih and Gregor. They use different estimators of the gradient with varying quality and the SCG paper provides a nice overview of the bigger picture.

In any case, here is a visualization of prototypical variational Bayes learning for feedforward neural networks. A normal feedforward neural network training objective yields the following computation graph, without any stochastic nodes.

Here we have a fixed weight vector $w$ with a regularizer $R(w)$. We have $n$ training instances and each input $x_i$ produces a network output, $P_i(x_i,w)$, for example a distribution over class labels. Together with a known ground truth label $y_i$ this yields a loss $\ell_i(P_i,y_i)$, for example the cross-entropy loss. If we use a likelihood based loss and a regularizer derived from a prior, i.e. $R(w)=-\log P(w)$ the training objective becomes just regularized maximum likelihood estimation.

$$F(w) = -\log P(w) - \sum_{i=1}^n \log P(y_i|x_i;w).$$

The variational Bayes training objective yields the following slightly extended stochastic computation graph.

Here $w$ is still a network parameter, but it is now a stochastic vector, $w \sim Q(w|\theta)$ and $\theta$ becomes the parameter we would like to learn. The additional cost node $H$ arises from the entropy of the approximating posterior distribution $Q$. (An interesting detail: in principle we would not need an arrow $w \to H$ because we can compute $H(Q)$. However, if we allow this arrow, then we can use a Monte Carlo approximation of the entropy for approximating families which do not have an analytic entropy expression.) The training objective becomes:

$$F(\theta) = \mathbb{E}_{w \sim Q(w|\theta)}\left[-\log P(w) + \log Q(w|\theta) - \sum_{i=1}^n \log P(y_i|x_i;w)\right].$$

The stochastic computation graph rules can now be used to derive the unbiased gradient estimate.

$$\frac{\partial}{\partial \theta} F(\theta) = \mathbb{E}_{w \sim Q(w|\theta)}\left[ \frac{\partial}{\partial \theta} \log Q(w|\theta) \left( -\log P(w) + \log Q(w|\theta) - \sum_{i=1}^n \log P(y_i|x_i;w) \right)\right].$$

This is now quite practical: the expectation can be approximated using simple Monte Carlo samples of $w$ values using the current approximating posterior $Q(w|\theta)$. Because the gradient is unbiased we can improve the approximation by running standard stochastic gradient methods.

Additional Applications

The paper contains a large number of machine learning applications, but there are many others. Here is one I find useful.

Experimental design. In Bayesian experimental design we make a choice that influences our future measurements and we would like to make these choices in such a way that we will maximize the future expected utility or minimize expected loss. For this we use a model of how our choices relate to the information we will capture and to how valuable these information will be. Because this is just decision theory and the idea is general, let me be more concrete. Let us assume the objective function

$$\mathbb{E}_{z \sim p(z)}[\mathbb{E}_{x \sim p(x|z,\theta)}[\ell(\tilde{z}(x,\theta), z)]].$$

Here $\theta$ is our design parameter, $z$ is the true state we are interested in with a prior $p(z)$. The measurement process produces $x \sim p(x|z,\theta)$. We have an estimator $\tilde{z}(x,\theta)$ and a loss function which compares the estimated value against the true state. The full objective function is then the expected loss of our estimator $\tilde{z}$ as a function of the design parameters $\theta$. The above expression looks a bit convoluted but this structure appears frequently when the type of information that is collected can be controlled. One example application of this: $z$ could represent user behaviour and $\theta$ some subset of questions we could ask that user to learn more about his behaviour. We then assume a model $p(x|z,\theta)$ of how the user would provide answers $x$ given questions $\theta$ and behaviour $z$. This allows us to build an estimator $\tilde{z}(x,\theta)$. The design objective then tries to find the most informative set of questions to ask.

Acknowledgements. I thank Michael Schober for discussions about the paper and Nicolas Heess for feedback on this article.

Multilevel Splitting

2015-07-10T22:50:00+01:00

This article is about multilevel splitting, a method for estimating the probability of rare events.

Estimating the probability of rare events is important in many fields. One vivid example is in the study of reliability of systems; imagine for example, that we are responsible for building a mechanical structure such as a bridge and we aim to design it to last one hundred years. To provide any kind of guarantee we need to have a model of what could happen in these 100 years, for example how the bridge will be used during that time, what weight it will have to bear, how strong winds and floods may be, how corrosion and other processes deteriorate the structure, etc. Considering all these factors may only be possible approximately via a simulation of the structure under different effects.

For concreteness let's say we denote by $X$ the random variable that represents the maximum force that is applied to the bridge during the 100 years lifetime. Each simulation allows us to obtain a sample $X_i \sim P$ of this force, where $P$ is a probabilistic model of everything that can happen during the 100 years. Given that we designed the bridge to widthstand a certain force, the question is now to make statements of the form

$$P(X \geq \delta) \leq \epsilon.$$

Often we want the probability of something bad happening (the event $X \geq \delta$) to be exceptionally small, say $\epsilon = 10^{-9}$.

Another common example is the computation of P-values, where we observe a sample $x$ and compute a test statistic $t=T(x)$. Given a null model in the form of a distribution $P(X)$ we are interested in the P-value, that is, the probability of the event $P(T(X) \geq t)$. This number is the probability under the null of observing a test statistic at least as extreme as the one actually observed. Using the multilevel splitting idea we can hope to accurately compute the P-value as long as we can run an MCMC chain on the null model. Also, more general P-values for composite null models, such as the posterior predictive P-value are computable. So if this sounds good, how does multilevel splitting work and why is it needed in the first place?

In the absence of an analytic form for $P$, a naive simulation approach is to repeatedly draw samples $X_i \sim P$ and to count how often the bad event happens. For rare events as the one above this does not work very well and if we would exactly meet the guarantee of the above example, $\epsilon = 10^{-9}$, then we would on average have to draw around $1/\epsilon = 10^9$ samples just to see a single bad event. But because we would like to estimate the rare event probability we need even more samples.

There are a number of custom methods for accurate estimation of rare event probabilities. The remainder of the article discusses multilevel splitting, but at this point I would like to mention that another popular set of methods for rare events is based on adaptive importance sampling which is described in detail in Rubinstein and Kroese's book on Monte Carlo methods.

Multilevel Splitting

John von Neumann had an idea better than naive simulation on how to address the problem of estimating rare event probabilities. He named his solution multilevel splitting. The first published description of multilevel splitting is due to Kahn and Harris in 1951 (who attribute it to John von Neumann).

The basic idea of multilevel splitting is to steer an iterative simulation process towards the rare event region by removing samples far away from the rare event and splitting samples closer to the rare event.

The application considered in the 1951 paper is interesting in this regard in that it clearly relates to nuclear weapon research:

"We wish to estimate the probability that a particle is transmitted through a shield, when this probability is of the order of $10^{-6}$ to $10^{-10}$, and we wish to do this by sampling about a thousand life histories." ... "In one method of applying this, one defines regions of importance in the space being studied, and, when the sampled particle goes from a less important to a more important region, it is split into two independent particles, each one-half the weight of the original."

Back in 1951 the algorithm was somewhat adhoc but effective. In a recent 2011 paper by Guyader, Hengartner, and Matzner-Lober the authors propose a more practical variant of the same idea and provide theoretical results.

Setup

The general setup is as follows. We have a distribution $P$ defining our system. We have $X \in \mathcal{X}$ for the realizations $X \sim P$. A continuous map $\phi: \mathcal{X} \to \mathbb{R}$ defines the quantity of interest. We are interested in computing the probability $P(\phi(X) \geq q)$. To this end we assume we can approximately simulate from $P$ using a Markov chain, which is typically possible even in complex models.

The basic idea of the original 1951 algorithm is to fix a set of levels $-\infty = L_0 < L_1 < L_2 < \dots < L_k = q$. Then we can formally write

$$P(\phi(X) \geq q) = \prod_{i=1}^k P(\phi(X) \geq L_i \:|\: \phi(X) \geq L_{i-1}).$$

The above product can be estimated term-by-term as follows: we use a set of $N$ particles $(X_1,\dots,X_N)$ and simulate these according to $X_i \sim P(X)$. Then we estimate the fraction

$$P(\phi(X) \geq L_1 \:|\: \phi(X) \geq L_0) = P(\phi(X) \geq L_1) \approx \frac{\sum_{i=1}^N 1_{\{\phi(X_i) \geq L_1\}}}{N}.$$

Afterwards we discard all particles with $\phi(X_i) < L_1$ and use the remaining particles to resample a set of $N$ particles (the splitting). Finally, we update all particles using a number of steps of our MCMC kernel, but this time restricted to $\phi(X_i) \geq L_1$, that is, we reject all proposals that would violate this condition. This is one level, and for the multilevel scheme we repeat the above procedure with the next level. Eventually, when we reach the final level $L_k$, we take the product of the estimated probabilities as the estimate of the rare event probability. Upon reaching the final level the surviving particles are properly distributed conditional on the restriction $\phi(X) \geq q$.

The above algorithm is effective but has the major drawback of having to fix a ladder of levels apriori. It would be more practical to instead have an automatic method to create these levels or to get rid of them entirely. The algorithm of Guyader et al. achieves this automatic selection by keeping the particles sorted according to $\phi$, with the lowest particle defining the current level, at the cost of having a random runtime of the algorithm.

The 2011 paper is quite rich in that it also contains an approximate confidence interval for the true probability as well as an analysis of the random runtime and an interesting application of estimating the false positive rate of watermark detection schemes (which ideally should be very small). Also, a variant of their method can solve for the quantile, that is, given $p$ in $p = P(\phi(X) \geq q)$, solve for $q$. (Unfortunately, in the paper, as is often the case with many statistics and applied math papers, the algorithm (in Section 3.2) is not presented very clearly compared to a typical CS or ML paper.)

Example

The following is an implementation in the Julia language that estimates $P(X \geq 16.5)$ where $X \sim \mathcal{N}(0,1)$ is a standard Normal random variable.

using Distributions

N=2000  # number of particles
T=10   # number of MCMC steps
q=16.5 # quantile
target=Normal(0.0, 1.0)
K=Normal(0.0, 0.2)  # Markov kernel

m=1
X=sort(rand(target, N))
L=X[1]

while L < q   # as long as there are particles below q
    X[1] = X[rand(2:N)]

    # Run a Markov chain on the lowermost sample
    for t=1:T
        y = X[1] + rand(K)
        log_alpha = logpdf(target, y) - logpdf(target, X[1])
        if log(rand()) <= log_alpha && y > L
            X[1] = y
        end
    end
    X = sort(X)
    L = X[1]
    m += 1
end
phat = (1.0-1.0/N)^(m-1)
# Estimate, Truth
phat, ccdf(target, q)

Giving the output

(1.4581487078794118e-61,1.8344630031647276e-61)

where the first number is the estimate and the second number is the ground truth, known in this case analytically. The relative estimation accuracy in this case is remarkably, given that this event occurs on average only once every $10^{61}$ samples. For this simulation a total of $m=280,092$ sample updates have been performed until the algorithm stopped.

Conclusion

Multilevel splitting is a useful algorithm for estimating the probability of rare events and the recent algorithm of Guyader et al. is practical in that it can be implemented on top of an arbitrary MCMC sampler.

There are caveats, however. In the above example, the problem structure is almost ideal for the application of multilevel splitting: a slowly varying continuous function $\phi$ whose level sets are topologically connected. This means that the MCMC sampler can mix easily in the restricted subsets and the resulting rare event probabilities can be accurately estimated. If these assumptions are not satisfied the algorithm may fail to work, and current research addresses these more general situations, see, for example this recent paper by Walter.

In summary, although some care is required for the application of multilevel splitting to real problems it is likely to be orders of magnitude more efficient than naive approaches.

Bayesian P-Values

2015-06-27T00:15:00+01:00

P-Values (see also Jim Berger's page on p-values) are probably one of the most misunderstood concepts in statistics and certainly have been abused in statistical practice. Originally proposed as an informal diagnostic by Ronald Fisher, there are many reasons for the bad reputation of p-values, and in many relevant situations good alternatives such as Bayes factors can and should be used instead. One key objection to p-values is that although they provide statistical evidence against an assumed hypothesis, this does not imply that the deviation from the hypothesis is large or relevant. In practice, the largest criticisms are not related to the p-value itself but related to the widespread misunderstanding of p-values and the arbitrariness of accepting formal tests of significance based on p-values in scientific discourse.

In this article I am not going to defend p-values, also because others have done a good job at giving a modern explanation of their benefits in context, as well as refuting some common criticisms, for example the article Two cheers for P-values? by Stephen Senn and the more recent In defense of P values by Paul Murtaugh.

Instead, I will consider a situation which often arises in practice.

Setup

Suppose you have decided on a probabilistic model $P(X)$ or $P(X|\theta)$, where $\theta$ is some unknown parameter. With decided I mean that we actually commit and ship our model and we no longer entertain alternative models. Alternatives could be too expensive computationally or it could be too difficult to accurately specify these alternative models. (For example, a more complicated model may involve additional latent variables for which it is difficult to elicit prior beliefs.)

Given such a model but no assumed alternative, and some observed data $X$, can we identify whether "the model fits the data"? This problem is the classic goodness of fit problem and classical statistics has a repertoire of methods for standard models. These methods have their own problems in that they are often unsatisfactory case-by-case studies or strong results are obtained only in asymptotia. However, it would be too easy to just criticise these methods. The real question is whether the problem they address is an important one, and what alternatives should be used, especially from a Bayesian viewpoint.

Prediction versus Scientific Theories

In machine learning, at least in its widespread current industrial use, we are most often concerned with building predictive models that automatically make decisions such as showing the right advertisements, classifying spam emails, etcetera.

This current focus on prediction may shift in the future, for example due to a revival in artificial intelligence systems or in general more autonomous agent type systems which do not have a single clearly defined prediction task.

But as it currently stands, model checking and goodness of fit is not so relevant for building predictive models.

First, even when the observation does not comply with model assumptions, your prediction may still be correct, in which case the non-compliance does not matter. I.e. the p-value does not use a decision-theoretic viewpoint that includes a task-dependent utility; cf. Watson and Holmes. To know whether the model is "correct" or not may not be important at all for prediction, but even likewise within science, as summarized by Bruce Hill in this comment,

"A major defect of the classical view of hypothesis testing, [...], is that it attempts to test only whether the model is true. This came out of the tradition in physics, where models such as Newtonian mechanics, the gas laws, fluid dynamics, and so on, come so close to being "true" in the sense of fitting (much of) the data, that one tends to neglect issues about the use of the model. However, in typical statistical problems (especially in the biological and social sciences but not exclusively so) one is almost certain a priori that the model taken literally is false in a non-trivial way, and so one is instead concerned whether the magnitude of discrepancies is sufficiently small so that the model can be employed for some specific purpose."

Second, if the deviation from modelling assumptions leads to incorrect predictions you would detect this through simple analysis of incorrect predictions using ground truth holdout data, not through fancy model checking. Checking accuracy of predictions is easy with annotated ground truth data, and is the bread-and-butter basic tool of machine learning.

The only useful application of model checking for predictive systems that I could think of are systems in which a conservative "prefer-not-to-predict" option exists, so that observations which are violating model assumptions could be excluded from further automated processing. Yet, much of this potential benefit may already be accessible through posterior uncertainty of the model. Only the subset of instances for which the model is certain but its predictions are wrong could profit from this special treatment.

In contrast to prediction, in science we build models not purely for prediction, but as a formal approximation to reality. Here I see that model checking is crucial, because it allows falsification of scientific hypotheses, leading hopefully to improved scientific understanding in the form of new models. One historically efficient method to falsify a scientific model is to check the predictions it makes, so a scientific model must normally also be a "predictive model". This viewpoint of establishing a model not just for making good predictions but also to understand mechanisms of reality also seems closer to the field of statistics.

The above separation of prediction versus science is of course not a simple dichotomy, but just a preference of the practitioner.

Bayesian Viewpoints?

So then, what is the Bayesian viewpoint here? The answer is that some well respected figures in the field accept frequentist tests and p-values as a method to criticise and attempt to falsify Bayesian models. One example can be seen in a recent article by Andrew Gelman and Cosma Shalizi where mechanisms to falsify a Bayesian model a discussed, stating

"The main point where we disagree with many Bayesians is that we do not think that Bayesian methods are generally useful for giving the posterior probability that a model is true, or the probability for preferring model A over model B, or whatever. Bayesian inference is good for deductive inference within a model, but for evaluating a model, we prefer to compare it to data (what Cox and Hinkley , 1974, call "pure significance testing") without requiring that a new model be there to beat it."

(They use pure significance tests and frequentist predictive checks, but no p-values in that paper.)

Another example is an article by Susie Bayarri and James Berger, where "Bayesian p-values" are discussed.

A third and maybe more popular pragmatic Bayesian stance is summarized in Bruce Hill's comment on Gelman, Meng, and Stern's article on posterior predictive testing,

"Like many others, I have come to regard the classical p-value as a useful diagnostic device, particularly in screening large numbers of possibly meaningful treatment comparisons. It is one of many ways quickly to alert oneself to some of the important features of a data set. However, in my opinion it is not particularly suited for careful decision-making in serious problems, or even for hypothesis testing. Its primary function is to alert one to the need for making such a more careful analysis, and perhaps to search for better models. Whether one wishes actually to go beyond the p-value depends upon, among other things, the importance of the problem, whether the quality of the data and information about the model and a priori distributions is sufficiently high for such an analysis to be worthwhile, and ultimately upon the perceived utility of such an analysis."

Not arguing by reference to authorities, but given the broad spectrum of contributions of Andrew Gelman, Bruce Hill, and James Berger (many of us learned Bayesian methods from the books Bayesian Data Analysis and Statistical Decision Theory and Bayesian Analysis), it should be clear that if they take frequentist tests and p-values seriously in statistical practice, they may actually be useful.

So let's look again at our goodness-of-fit problem.

Simple Models (simple null model)

The p-value can provide a useful diagnostic of goodness of fit. For the case of a simple model $P(X)$ with an observation $X \in \mathcal{X}$ we can pick a test statistic $T: \mathcal{X} \to \mathbb{R}$ where high values indicate unlikely outcomes, and then compute

$$p_{\textrm{classic}} = \textrm{Pr}_{X' \sim P}(T(X') \geq T(X)),$$

that is, the probability of observing a $T(X')$ greater than the actually observed $T(X)$, given the assumed model $P(X)$. This probability is the p-value and if the probability of observing a more extreme test statistic is small we should righly be suspicious of the assumed model. The choice of test statistic $T$ is the only degree of freedom and has to be made given the model.

This is the classic p-value and its formal definition is completely unambiguous. One important observation is that if we assume the null hypothesis is true and we treat the p-value as a random variable, then this random variable is uniformly distributed, for any sample size.

Latent Variable Models (composite null model)

Now assume a slightly more general setting, where we have a model $P(X|\theta)$, and $\theta \in \Theta$ is some unknown parameter of the model which is not observed.

Because it is not observed, the above definition does not apply. We could apply the definition only if we knew $\theta$. Classic methods assume that we have an estimator $\hat{\theta}$ so that we can evaluate the p-value on $P(X|\hat{\theta})$, fixing the parameter to a value hopefully close to it's true value. The key problem with this approach is that the p-value in general will no longer be uniformly distributed. This diminishes its value as a diagnostic for model misspecification. (Another alternative is to take the supremum probability over all possible parameters, again yielding a non-uniformly distributed p-value under the null.)

Bayesians to the rescue! Twice!

First, assume we would like to compute a p-value in the above setting. What would a Bayesian do? Of course, he would integrate over the unknown parameter, using a prior. This yields the so called posterior predictive p-value going back to the work of Guttman. Assuming a prior $P(\theta)$ we compute the posterior predictive p-value as

$$p_{\textrm{post}} = \mathbb{E}_{\theta \sim P(\theta|X)}[ \textrm{Pr}_{X' \sim P(X'|\theta)}(T(X') \geq T(X))],$$

where $P(\theta|X) \propto P(X|\theta) P(\theta)$ is the proper posterior. The definition is simple: take the expectation of the ordinary p-value weighted by the parameter posterior. This definition is very general and typically easy to compute during posterior inference, i.e. it is quite practical computationally.

Unfortunately, it is also overly conservative, as explained in the JASA paper ``Asymptotic Distribution of P Values in Composite Null Models'' by Robins, van der Vaart, and Ventura from 2000. Intuitively this is because the observed data $X$ is used twice, a violation of the likelihood principle of Bayesian statistics: first it is used to obtain the posterior $P(\theta|X)$, and then it is used again to compute the p-value.

Bayesians to the rescue again! This time it is Susie Bayarri and Jim Berger, and in their JASA paper P values for Composite Null Models they introduce two alternative p-values which exactly "undo" the effect of using the data twice by conditioning on the information already observed. (I will not discuss the U-conditional predictive p-value proposed by Bayarri and Berger.) Here is the basic idea: let $X$ be the observed data and $t=T(X)$ the test statistic. We then define the partial posterior,

$$P(\theta|X \setminus t) \propto \frac{P(X|\theta) P(\theta)}{P(t|\theta)}.$$

To understand this definition remember that random variables are functions from the sample space to another set. Hence, conditioning on $t$ means that we condition on the event $\{X' \in \mathcal{X} | T(X') = t\}$. The partial posterior predictive p-value is now defined as

$$p_{\textrm{ppost}} = \mathbb{E}_{\theta \sim P(\theta|X \setminus t)}[ \textrm{Pr}_{X' \sim P(X'|\theta)}(T(X') \geq T(X))].$$

Bayarri and Berger, as well as Robins, van der Vaart, and Ventura analyze the properties of this particular p-value and show that is asymptotically uniformly distributed and thus is neither conservative nor anti-conservative.

If you are a Bayesian and consider providing a general model-fit diagnostic in the absence of a formal alternative hypothesis this partial posterior predictive p-value is the method to use.

However, there are two drawbacks I can see that have affected it's usefulness for me:

It is much harder to compute. Whereas the posterior predictive p-value can be well approximated even with naive Monte Carlo as soon as normal posterior inference is achieved, this is not the case for the partial posterior predictive p-value. The reason is that $P(t|\theta)$, although typically an univariate density in the test statistic, is the integral over potentially complicated sets in $\mathcal{X}$, that is $P(t|\theta) = \int_{\mathcal{X}} 1_{\{T(X)=t\}} P(X|\theta) \,\textrm{d}X$. I have not seen generally applicable methods to compute $p_{\textrm{ppost}}$ efficiently so far.
The nice results of Bayarri and Berger do not extend to so called discrepancy statistics as proposed by Xiaoli Meng in his 1994 paper. These more general test statistics include the parameter, i.e. we use $T(X,\theta)$ instead of just $T(X)$. Why is this useful? For example, and I found this a very useful test statistic, you can directly use the likelihood of the model itself as a test statistic: $T(X,\theta) = -P(X|\theta)$.

Enough thoughts, let's get our hands dirty with a simple experiment.

Experiment

We take a simple composite null setting as follows. Our assumed model is

$$X_i \sim \mathcal{N}(\mu, \sigma^2),\qquad i=1,\dots,n.$$

We get to observe $X=(X_1,\dots,X_n)$ and know $\sigma$ but consider $\mu$ unknown.

After some observations we would like to assess whether our model is accurate in light of the data. To this end we would like to use the P-values described above. We will need two ingredients: we need to define a test statistic and we need to work out the posterior inference in our model.

For the test statistic we actually use a generalized test statistic (discrepancy variable in Meng's vocabulary) as

$$T(X,\mu) = - \prod_{i=1}^n p(X_i|\mu) = -\prod_{i=1}^n \mathcal{N}(X_i ; \mu, \sigma^2).$$

For the posterior inference, as Bayesians we place a prior on $\mu$ and we select

$$\mu \sim \mathcal{N}(\mu_0, \sigma_0).$$

The Bayesian analysis is particularly straightforward in this case, as this note by Kevin Murphy details. In particular, after observing $n$ samples $X=(X_1,\dots,X_n)$ the posterior on $\mu$ has a simple closed form as

$$p(\mu|X) = \mathcal{N}(\mu_n, \sigma^2_n),$$

with

$$\sigma^2_n = \frac{1}{\frac{n}{\sigma^2}+\frac{1}{\sigma^2_0}},$$

and

$$\mu_n = \sigma^2_n \left(\frac{\mu_0}{\sigma^2_0} + \frac{n \bar{x}}{\sigma^2}\right),$$

where $\bar{x} = \frac{1}{n} \sum_i X_i$ is the sample average.

From this simple form of the posterior distribution we can derive the closed form partial posterior $P(\mu|X\setminus t)$ as well (not shown here, but essentially using known properties of the $\chi^2$ distribution). Here is a picture of the posterior $P(\mu|X)$ and the partial posterior $P(\mu | X \setminus t)$, where the data $X$ actually comes from the assumed model with true $\mu=4.5$ and $n=10$. Interestingly the partial posterior is more concentrated (which makes sense from the theory derived in Robins et al.).

Let us generate data from the assumed prior and model and see how our p-values behave. Because the null model is then correct, we can hope that the resulting p-values will be uniformly distributed. Indeed, if they were perfectly uniformly distributed they would be proper frequentist p-values. Because of the paper of Robins et al. we know that they will only be asymptotically uniformly distributed as $n \to \infty$. But here we are also outside the theory because our test statistic $T(X,\mu)$ includes the unknown parameter $\mu$. So, walking on thin theory, let's verify the distribution for $n=10$ by taking a histogram over $10^6$ replicates.

This looks good, and the partial posterior predictive p-value is more uniformly distributed obtaining better frequentist properties, in line with the claims in Bayarri and Berger and in Robins et al.

Finally, let us check with data from a model that is different from the assumed model. Here I sample from $\mathcal{N}(\mu, s^2)$, where $s \in [0,2]$. For $s=1$ this is the assumed model, but ideally we can refute the model for values that differ from one by detecting this deviation through a p-value close to zero. The plot below shows, for each $s$, the average p-value over 1000 replicates.

Clearly for $s < 0.6$ or so we can reliably discover that our assumed model is problematic. Interestingly the partial posterior predictive p-value has significantly more power, in line with the theory.

For $s > 1$ however, our p-value goes to one! How can this be? Well, remember that the choice of test statistic determines which deviations from our assumptions we can detect and that the p-value cannot verify the correctness of our assumed model but instead may only provide one-sided evidence against the model. With our current test statistic clearly this significant deviation passes undetected. We could replace our test statistic using the negative of our current test statistic and would be able detect the above deviation for $s > 1$, but this implicitly more or less starts the process of thinking about alternative models, a point Bruce Hill mentioned above.

If we would like to consider alternative models we should ideally consider them in a formal way, and as a result we would be better off using a fully Bayesian approach over an enlarged model class.

The Entropy of a Normal Distribution

2015-06-13T23:30:00+01:00

The multivariate normal distribution is one of the most important probability distributions for multivariate data. In this post we will look at the entropy of this distribution and how to estimate the entropy given an iid sample.

For a multivariate normal distribution in $k$ dimensions in standard form with mean vector $\mathbf{\mu} \in \mathbb{R}^k$ and covariance matrix $\mathbf{\Sigma}$ we have the density function

$$f(\mathbb{x};\mathbf{\mu},\mathbf{\Sigma}) = \frac{1}{\sqrt{(2\pi)^k |\mathbf{\Sigma}|}} \exp\left(-\frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu})\right).$$

For this density, the differential entropy takes the simple form

\begin{equation} H = \frac{k}{2} + \frac{k}{2} \log(2\pi) + \frac{1}{2} \log |\mathbf{\Sigma}|.\label{eqn:Hnormal} \end{equation}

In practice we are often provided with a sample

$$\mathbf{x}_i \sim \mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}), \quad i=1,\dots,n,$$

without knowledge of $\mathbf{\mu}$ nor $\mathbf{\Sigma}$. We are then interested in estimating the entropy of the distribution from the sample.

Plugin Estimator

The simplest method to estimate the entropy is to first estimate the mean as the empirical mean,

$$\hat{\mathbf{\mu}} = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i,$$

and the sample covariance as

$$\hat{\mathbf{\Sigma}} = \frac{1}{n-1} \sum_{i=1}^n (\mathbf{x}_i - \hat{\mathbf{\mu}}) (\mathbf{x}_i - \hat{\mathbf{\mu}})^T.$$

Given these two estimates we simply use equation $(\ref{eqn:Hnormal})$ on $\mathcal{N}(\hat{\mathbf{\mu}},\hat{\mathbf{\Sigma}})$. (We can also use $\mathcal{N}(\mathbf{0},\hat{\mathbf{\Sigma}})$ instead as the entropy is invariant under translation.)

This is called a plugin estimate because we first estimate parameters of a distribution, then plug these into the analytic expression for the quantity of interest.

It turns out that the plugin estimator systematically underestimates the true entropy and that one can use improved estimators. This is not special and plugin estimates are often biased or otherwise deficient. In case of the problem of estimating the entropy of an unknown normal distribution however, the known results are especially beautiful. In particular,

there exist unbiased estimators,
there exist an estimator that is a uniformly minimum variance unbiased estimator (within a restricted class, see below),
this estimator is also a (generalized) Bayesian estimator under the squared-loss, with an improper prior distribution.

Hence, for this case, a single estimator is satisfactory from both a Bayesian and frequentist viewpoint, and moreover it is easily computable.

Great, we will look at this estimator, but first look at an earlier work that studies a simpler case.

Ahmed and Gokhale, 1989

An optimal UMVUE estimator for the problem of a zero-mean Normal distribution $\mathcal{N}(\mathbf{0},\Sigma)$ has been found by (Ahmed and Gokhale, 1989). This is a restricted case: while the entropy does not depend on the mean of the distribution, it does affect the estimation of the sample covariance matrix.

For a sample their estimator is

$$\hat{H}_{\textrm{AG}} = \frac{k}{2} \log(e\pi) + \frac{1}{2} \log \left|\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T\right| - \frac{1}{2} \sum_{j=1}^d \psi\left(\frac{n+1-j}{2}\right),$$

where $\psi$ is the digamma function.

If you know the mean of your distribution (so you can center your data to ensure $\mu=0$), this estimator provides a big improvement over the plugin estimate. Here is an example in mean squared error and bias, were $\Sigma \sim \textrm{Wishart}(\nu,I_k)$ and $\mathbf{x}_i \sim \mathcal{N}(\mathbf{0},\Sigma)$, with $k=3$ and $n=20$. The plot below shows a Monte Carlo result with $80,000$ replicates.

As promised, we can observe a big improvement over the plugin estimate, and we also see that the Ahmed Gokhale estimator is indeed unbiased.

Here is a Julia implementation.

function entropy_ag(X)
    # X is a (k,n) matrix, samples in columns
    k = size(X,1)
    n = size(X,2)
    C = zeros(k,k)
    for i=1:n
        C += X[:,i]*X[:,i]'
    end
    H = 0.5*k*(1.0 + log(pi)) + 0.5*logdet(C)
    for i=1:k
        H -= 0.5*digamma(0.5*(n+1-i))
    end
    H
end

Because the case of a known mean is maybe less interesting, we go straight to the general case.

Misra, Singh, and Demchuk, 2005

In (Misra, Singh, and Demchuk, 2005) (here is the PDF) the authors do a thorough job of analyzing the general case. Beside a detailed bias and risk analysis the paper proposes two estimators for the general case:

An UMVUE estimator in a restricted class of estimators, that is a slight variation of the Ahmed and Gokhale estimator;
A shrinkage estimator in a larger class, which is proven to dominate the UMVUE estimator in the restricted class.

The authors are apparently unaware of the work of Ahmed and Gokhale. For their UMVUE estimator $\hat{H}_{\textrm{MSD}}$ they use the matrix

$$S = \sum_{i=1}^n (\mathbf{x}_i-\hat{\mu})(\mathbf{x}_i-\hat{\mu})^T$$

and define

\begin{equation} \hat{H}_{\textrm{MSD}} = \frac{k}{2} \log(e\pi) + \frac{1}{2} \log |S| - \frac{1}{2} \sum_{j=1}^d \psi\left(\frac{n-j}{2}\right). \label{Hmsd} \end{equation}

Can you spot the difference to the Ahmed and Gokhale estimator? There are two: the matrix $S$ is centered using the sample mean $\hat{\mu}$, and, to adjust for the use of the sample mean for centering, the argument to the digamma function is shifted by $1/2$.

Here is a Julia implementation.

function entropy_msd(X)
    # X is a (k,n) matrix, samples in columns
    k = size(X,1)
    n = size(X,2)

    Xbar = mean(X,2)
    Xs = sqrt(n)*Xbar
    S = zeros(k,k)
    for i=1:n
        S += (X[:,i]-Xbar)*(X[:,i]-Xbar)'
    end

    res = 0.5*k*(1.0 + log(pi)) + 0.5*logdet(S)
    for i=1:k
        res -= 0.5*digamma(0.5*(n-i))
    end
    res
end

Outline of Derivation

The key result that is used for deriving both the MSD and the AG estimator is a lemma due to Robert Wijsman from 1957 (PDF).

Wijsman proved a result relating the determinants of two matrices: the covariance matrix $\Sigma$ of a multivariate Normal distribution, and the empirical outer product matrix $X X^T$ of a sample $X \in \mathbb{R}^{n\times k}$ from that Normal. In Lemma 3 of the above paper he showed

$$\frac{|X X^T|}{|\Sigma|} = \prod_{i=1}^k \chi_{n-i+1}^2.$$

By taking the logarithm of this equation we can relate the central quantity in the differential entropy, namely $\log |\Sigma|$ to the log-determinant of the sample outer product matrix.

The sample outer product matrix of a zero-mean multivariate Normal sample with $n \geq k$ is known to be distributed according to a Wishart distribution, with many known analytic properties. By using the known properties of the Wishart and $\chi^2$ distributions this allows the derivation and proving unbiasedness of the AG and MSD estimators.

Generalized Bayes

Misra, Singh, and Demchuk also show that their MSD estimator is the mean of a posterior that arises from a full Bayesian treatment with an improper prior. This prior is shown to be, in (Theorem 2.3 in Misra et al., 2005),

$$\pi(\mu,\Sigma) = \frac{1}{|\Sigma|^{(k+1)/2}}$$

This is a most satisfying result: a frequentist-optimal estimator in large class of possible estimators is shown to be also a Bayes estimator for a suitable matching prior.

Because the posterior is proper for $n \geq k$, one could also use the proposed prior to derive posterior credible regions for the entropy, and most likely this is a good choice in that it could achieve good coverage properties.

Brewster-Zidek estimator

Going even further, Misra and coauthors also show that while the MSD estimator is optimal in the class of affine-equivariant estimators, when one enlarges the class of possible estimators there exist estimators which uniformly dominate the MSD estimator by achieving a lower risk.

They propose a shrinkage estimator, termed Brewster-Zidek estimator which I give here without further details.

$$\hat{H}_{BZ} = \frac{k}{2} \log(2 e \pi) + \frac{1}{2} \log |S + YY^T| + \frac{1}{2}(\log T - d(T))$$

$$d(r) = \frac{\int_r^1 t^{\frac{n-k}{2}-1} (1-t)^{\frac{k}{2}-1} \left[\log t + k \log 2 + \sum_{i=1}^k \psi\left(\frac{n-i+1}{2}\right) \right] \textrm{d}t}{\int_r^1 t^{\frac{n-k}{2}-1}(1-t)^{\frac{k}{2}-1} \textrm{d}t}$$

$$T = |S| |S+YY^T|^{-1}$$

$$Y = \sqrt{n} \hat{\mu}$$

Here is a Julia implementation using numerical integration for evaluating $d(r)$.

function entropy_bz(X)
    # X is a (p,n) matrix, samples in columns
    p = size(X,1)
    n = size(X,2)

    Bfun(t) = t^((n-p)/2-1) * (1-t)^(p/2-1)
    function Afun(t)
        res = log(t) + p*log(2)
        for i=1:p
            res += digamma(0.5*(n-i+1))
        end
        res * Bfun(t)
    end
    A(r::Float64) = quadgk(Afun, r, 1.0)[1]
    B(r::Float64) = quadgk(Bfun, r, 1.0)[1]
    d(r) = A(r) / B(r)

    Xbar = mean(X,2)
    Xs = sqrt(n)*Xbar
    S = zeros(p,p)
    for i=1:n
        S += (X[:,i]-Xbar)*(X[:,i]-Xbar)'
    end

    T = det(S)/det(S+Xs*Xs')
    dBZ = logdet(S + Xs*Xs') - d(T) + log(T)

    0.5*(p*(1+log(2*pi))+dBZ)
end

Shoot-out

Remember the zero-mean case? Let us start with this case. I use $k=3$ and $n=20$ as before, and $\Sigma \sim \textrm{Wishart}(\nu,I_k)$. Then samples are generated as $\mathbf{x}_i \sim \mathcal{N}(\mathbf{0},\Sigma)$. All numbers are from $80,000$ replications of the full procedure.

What you can see from the above plot is that the AG estimator which is UMVUE for this special case dominates the MSD estimator. Both unbiased estimators are indeed unbiased. In terms of risk the Brewster-Zidek estimator is indistinguishable from the MSD estimator.

Now, what about $\mu \neq \mathbf{0}$? Here, for the simulation the setting is as before, but the mean is $\mu \sim \mathcal{N}(\mathbf{0},2I)$, so that samples are distributed as $\mathbf{x}_i \sim \mathcal{N}(\mu,\Sigma)$.

The result shows that the AG estimator becomes useless if its assumption is violated, as is to be expected. (Interestingly, if we were to try using the scaled sample covariance matrix $n \hat{\Sigma}$ with the AG estimator it is reasonable but biased, that is, it has lost its UMVUE property.) The MSD estimator and the Brewster-Zidek estimators are virtually indistinguishable and seem to be both unbiased in this case.

Conclusion

Estimating the entropy of a multivariate Normal distribution from a sample has a satisfying solution, the MSD estimator $(\ref{Hmsd})$, which can be robustly used in all circumstances. It is computationally efficient, and with sufficient samples, $n \geq k$, the Bayesian interpretation also provides a proper posterior distribution over $\mu$ and $\Sigma$ which can be used to derive a posterior distribution over the entropy.

Acknowledgements. I thank Jonas Peters for reading a draft version of the article and providing feedback.

A quick summary of CVPR 2015

2015-06-11T22:00:00+01:00

CVPR 2015, "Computer Vision and Pattern Recognition" is the main conference of the computer vision community and just finished. I unfortunately was only able to stay for the three main conference days, but here is my short subjective summary.

For an overview of individual research papers, see this excellent summary page by Andrej Karpathy.

From the papers I have seen at the conference my personal favorite is Barron et al., "Fast Bilateral-Space Stereo for Synthetic Defocus", PDF here. I liked it for a number of reasons. First, this research is already successfully productized in a high-profile product and the presentation of the work was excellent. Second, the flavour of this work is to take a data structure (the permutohedral lattice) which has been used for one problem successfully (bilateral filtering), and use it to solve a more difficult problem (disparity from stereo) within the domain of the data structure. This general idea may be useful in other contexts. To admit the truth, I never liked pixels as a representation of image data, and many statistical models are just awkward to specify on the pixel level; for this reason we as a community often use higher representations such as superpixels or region proposals. This paper provides an alternative method on how a regular representation that is more aligned with the semantic content of the image could be used to solve problems in such a way that one can reconstruct a solution on the pixel level.

Research Trends

Deep Learning and Convolutional Neural Networks. Since the seminal ECCV 2012 workshop presentation by Alex Krizhevsky that announced the ImageNet results and was published as a NIPS paper the same year the computer vision community has rapidly adopted convolutional networks and some of the largest vision labs developed toolkits that democratized this technology, such as Caffe, and existing toolkits such as Torch, and Theano are also used. In effect, I estimate that around 30 percent of all papers used convolutional networks or features derived from them in their work, often substantially increasing predictive performance on the given task. Significant research directions remain open to everyone, but it is fair to say that standard convnets are now a mature vision technology regularly used by large parts of the community.
Rich Linguistic Outputs. Automatic image captioning is now feasible. There is an excellent summary of the may works at Piotr Dollar's blog and also in another summary by John Platt. Many of these works are enabled by the recent Microsoft COCO dataset and by recurrent neural networks.

Non-Research Trends and the IEEE Controversy

Growth in attendance. Attendance was at more than 2,400 persons, continuing the rapid growth of the computer vision community.
More code published. On almost every second poster there was a github URL and the licenses are generally very liberal (MIT, BSD, etc.) so as to permit wide distribution; this is great as it further accelerates the speed at which efforts can be redirected towards promising approaches.
IEEE splits from CVPR. The conference has always been organized in part by IEEE in various capacities as an insurer, organizer, and publisher. However, with traditional publishing models being obsoleted, and with examples of independent conferences and journals in the machine learning community (NIPS, ICML, and JMLR), and considering that CVPR as one of the premier conferences in all of computer science, the power balance has shifted away from IEEE towards the computer vision community; as a result, over the last few years the ties with IEEE have been weakened and now seem to be lost. To be fair, following CVPR 2011, IEEE has moved and negotiated a fairer deal, with CVPR papers made available open-access since CVPR 2013, and allowing co-sponsoring arrangements with the Computer Vision Foundation. But now, after threats made by IEEE, it has been voted at the PAMI-TC meeting that future CVPR conferences (starting with CVPR 2016) that the computer vision foundation will take over the functions previously carried out by the IEEE. More details will be announced shortly, I am sure. Whether this has any repercussions for the TPAMI journal is unclear at the current point, but before making threats and actions that would serve as a catalyst for community action, IEEE would be wise to consider what has happened to Springer's Machine Learning journal in 2001 and the events that led to the founding of the Journal of Machine Learning Research, a very successful experiment.

Demosaicing

2015-05-29T23:00:00+01:00

This article describes the basic problem of image demosaicing and a recent work of mine providing a research dataset for demosaicing research.

Image demosaicing is a procedure used in almost all digital cameras. From your smartphone camera to the top-of-the-line digital SLR cameras, they all use a demosaicing algorithm to convert the captured sensor information into a color image. So what is this algorithm doing and why is it needed?

Why do we need Demosaicing?

Modern imaging sensors are based on semiconductors which have a large number of photo-sensitive sensor elements, called sensels. When a quantum of light hits a sensel it creates an electric charge. The amount of the charge created depends on the energy of the photon which depends on the wavelength of the incident light. Unfortunately, in current imaging sensors, once the electric charge is created it is no longer possible to deduce the color of the light. (The exception is the Foveon sensor which uses a layered silicon design in which photons of higher energy levels (green and blue) penetrate into lower silicon layers than photons of lower energy levels (red)).

To produce color images current sensors therefore do not record all wavelengths equally at each sensor element. Instead, each element has it's own color filter. A typical modern sensor uses three distinct filter types, each most sensitive to a particular range of wavelengths. The three types are abbreviated red (R), green (G), and blue (B), although in reality they are remain sensitive to all wavelengths. For a detailed plot of the wavelength sensitivities, this page has a nice graph.

Each sensor element therefore records only one measurement: the charge related to a certain range of wavelengths. It does not record the full color information. To reproduce an image suitable for human consumption we require three measurements, such as red/green/blue values. (This is a simplification, and in real systems the concept of a color space is used; a camera records in a camera-specific color space which is then transformed into a perceptual color space such as Adobe sRGB.)

The most popular arrangement of color filters is the so called Bayer filter and has a layout as shown below.

Image demosaicing is the process of recovering the missing colors at each sensor element. For example, in the top left sensel of the above figure only the blue response is measured and we need to recover the value of the green and red responses at this spatial location.

In principle, why should this even be possible? Because images of the natural world are slowly changing across the sensor, we can use color information from adjacent sensels (but different filter types) to provide the missing information.

Challenges for Demosaicing Algorithms

The above description is correct in that all demosaicing algorithms use correlations among spatially close sensels to restore the missing information. However, there are around three dozen publically available demosaicing algorithms and probably many more proprietary ones. Beside differences in resource requirements and complexity, these algorithms also differ widely in their demosaicing performance.

Without considering implementation concerns for a moment, what makes a good demosaicing algorithm? A good demosaicing method has the following desirable properties:

Visually pleasing demosaiced output images;
No visible high-frequency artifacts (zippering), no visible color artifacts;
Robustness to noise present in the input;
Applicable to different color filter array layouts (not just Bayer);

To achieve this, a demosaicing algorithm has to be highly adapted to the statistics of natural images. That is, it has to have an understanding of typical image components such as textures, edges, smooth surfaces, etcetera.

Research Dataset

One approach to image demosaicing is to treat it is a statistical regression problem. By learning about natural image statistics from ground truth data, one should be able -- given sufficient data -- to approach the optimal demosaicing performance possible.

The problem is, perhaps surprisingly, that there are no suitable datasets. Current comparisons of demosaicing algorithms in the literature resort to two approaches to provide results for their algorithms:

Use a small set of Kodak images that were scanned onto Photo-CD's (remember those?) in the mid-1990'ies from analogue films. To me it is unclear whether this scanning involved demosaicing, and whether the properties of the analogue films are an adequate proxy for digital imaging sensors.
Download sRGB images from the Internet and remove color channels to obtain a mosaiced image. But all these images have been demosaiced already, so we merely measure the closeness of one demosaicing algorithm to another one.

This is appalling on the one hand, but it is certainly challenging to improve on it, if only for the reason that currently no sensor can capture ground truth easily. There have been ideas to obtain ground truth using a Foveon sensor or by using a global switchable color filter and multiple captures. The first idea (using a Foveon camera) sounds feasible but the noise and sensitivity characteristics of a Foveon sensor are quite different from popular CFA-CMOS sensors. The second idea sounds ideal but would only work in a static lab setup.

We introduce the Microsoft Research Demosaicing Dataset, our attempt at providing a suitable dataset. Our dataset is described in detail in an IEEE TIP paper. The dataset contains 500 images captured by ourselves containing both indoor and outdoor imagery. Here are some example images.

How did we overcome the problem of creating suitable ground truth images? The basic idea is as follows: it is difficult to capture ground truth for demosaicing for a full image sensor, but if we group multiple sensels into one virtual sensel then we can interpret this group as possessing all necessary color information. That is, we simultaneously reduce the image resolution and perform demosaicing. There are multiple proposals in the paper how to do this technically in a sound manner, but to see it visually, here is an example of downsampling using 3-by-3 sensel blocks on a Bayer filter.

As you can see, within each 3-by-3 sensel block we may have an unequal number of measurements of each color, but the spatial distribution of sensels of different types is uniform in the sense that their center of gravity is the center of the 3-by-3 block. This is not the case in general, for example when averaging 4-by-4 blocks of Bayer measurements, then the red channels will have a higher density in the upper left corner of each block.

Algorithm Comparison

So how do common algorithms (and our novel algorithm) fare on our benchmark data set? Performance is typically measured as a function of the mean-squared-error of the predicted image intensities. The most common measurement is the peak signal to noise ratio measured in decibels (dB), where higher is better. We also report another performance measure based on a perceptual similarity metric, the structural similarity index (SSIM) which measures mean and variance statistics in image blocks, and again a higher score means a better demosaiced image.

The top algorithms achieve the following performance. I also include bilinear interpolation as a baseline method.

Method	PSNR (dB)	SSIM
Bilinear interpolation	30.86	0.882
Non-Local means	38.42	0.978
Contour stencils	39.41	0.980
RTF (our method)	39.39	0.980

Hence we achieve a result comparable to the state of the art. The experiments become interesting when we perform simultaneous denoising and demosaicing. Performing both operations simultaneously is desirable in a real imaging pipeline because both they happen at the same stage in the processing. For the task of simultaneous denoising and demosaicing the results tell a different story.

Method	PSNR (dB)	SSIM
Bilinear interpolation	30.40	0.859
Non-Local means	36.46	0.949
Contour stencils	37.17	0.953
RTF (our method)	37.78	0.961

In the paper we compare more than a dozen methods. The proposed method achieves an improved demosaicing performance of over 0.5dB in realistic conditions which is visually significant. Our method is based on the non-parametric regression tree field model (RTF) which we have published earlier; essentially this is a Gaussian conditional random field (CRF) with very rich potential functions defined by regression trees. Due to its high capacity it can learn a lot about image statistics relevant to demosaicing.

The next best method is the contour stencils method of Getreuer. This method performs smoothing and completion of values along a graph defined on the sensor positions. While the method works well it is manually defined for the Bayer pattern and may not be easily generalized to arbitrary color filter arrays.

Outlook

Demosaicing for the Bayer layout is largely solved, but for novel color filter array layouts there currently is no all-around best method. While our machine learning approach is feasible and leads to high quality demosaicing results, the current loss functions used (such as peak signal to noise ratio (PSNR) and structural similarity (SSIM)) are not sufficiently aligned with human perception to accurately measure image quality, in particular for zippering artifacts along edge structures. Whatever demosaicing method is adopted, it is beneficial to simultaneously perform demosaicing and denoising, because either task becomes more difficult if performed in isolation.

Becoming a Bayesian, Part 3

2015-05-15T21:00:00+01:00

This post continues the previous post, part 1 and part 2, outlining my criticism towards a ''naive'' subjective Bayesian viewpoint:

The consequences of model misspecification.
The ''model first computation last'' approach.
Denial of methods of classical statistics, in this post.

Denial of the Value of Classical Statistics

Suppose for the sake of a simple example that our task is to estimate the unknown mean $\mu$ of an unknown probability distribution $P$ with bounded support over the real line. To this end we receive a sequence of $n$ iid samples $X_1$, $X_2$, $\dots$, $X_n$.

Now suppose that after receiving these $n$ samples I do not use the obvious sample mean estimator but I take only the first sample $X_1$ and estimate $\hat{\mu} = X_1$. Is this a good estimator? Intuition tells us that it is not, because it ignores part of the useful input data, namely $X_i$ for any $i > 1$, but how can we formally analyze this?

From a subjective Bayesian viewpoint the likelihood principle does not permit us to ignore evidence which is already available. If we posit a model $P(X_i|\theta)$ and a prior $P(\theta)$ we have to work with the posterior

$$P(\theta | X_1,\dots,X_n) \propto \prod_{i=1,\dots,n} P(X_i|\theta) P(\theta).$$

Therefore our estimator $\hat{\mu}=X_1$ cannot correspond to a Bayesian posterior mean of any non-trivial model for all parameters. This is of course a very strict viewpoint and one may object that we can talk about properties of the sequence of posteriors $P(\theta | X_1)$, $P(\theta | X_1, X_2)$, etc. But even in this generous view, after observing all samples we are not permitted to ignore part of them. (If you are still not convinced, consider the estimator defined by $\hat{\mu} = X_2$ if $X_1 > 0$, and $\hat{\mu} = X_3$ otherwise.) So Bayesian statistics does not offer us a method to analyze our proposed estimator.

A classical statistician can analyze pretty much arbitrary procedures, including ones of the silly type $\hat{\mu}$ that we proposed. The analysis may be technically difficult or apply only in the asymptotic regime but does not rule out any estimator apriori. Typical results may take the form of a derivation of the variance or bias of the estimator. In our case we have an unbiased estimate of the mean, $\mathbb{E}[\hat{\mu}]-\mu = 0$. As for the variance, because we only take the first sample, even as $n \to \infty$ the variance $\mathbb{V}[\hat{\mu}]$ remains constant, so the estimator is inconsistent, a clear indication that our $\hat{\mu}$ is a bad estimator.

Another typical result is in the form of a confidence interval of a parameter of interest. One can argue that confidence intervals are not exactly answering the question of interest (that is, whether the parameter really is in the given interval), but if they are of interest, one can sometimes obtain them also from a Bayesian analysis.

There exist cases where existing statistical procedures can be reinterpreted from a Bayesian viewpoint. This is achieved by proposing a model and prior such that inferences under this model and prior exactly or approximately match the answers of the existing procedure or at least have satisfying frequentist properties. Two cases of this are the following:

Matching priors, where in some cases it is possible to establish an exact equivalence for simple parametric models without latent variables. One recent example for even a non-parametric model is the Good-Turing estimator for the missing mass, where an asymptotic equivalence between the classic Good-Turing estimator and a Bayesian non-parametric model is established.
Reference priors, a generalization of the Jeffrey prior, in which the prior is constructed to be least informative. Here least informative is in the sense that when you sample from the prior and consider the resulting posterior using the sample, the divergence to the original prior should be large in expectation; that is, samples from the prior should be able to change your beliefs to the maximum possible extend. When it is possible to derive reference priors, these typically have excellent frequentist robustness properties, and are useful default prior choices. Unfortunately, in models with multiple parameters there is no unique reference prior, and generally the set of known reference priors seems to be quite small. This problematic case-by-case state is nicely summarized in this recent work on overall objective priors.

Should we care at all about these classic notions of qualities of an estimators? I have seen Bayesians dismiss properties such as unbiasedness and consistency as unimportant, but I cannot understand this stance. For example, an unbiased estimator operating on iid sampled data immediately implies a scalable parallel estimator applicable to the big data setting, simply by separately estimating the quantity of interest, then taking the average of estimates. This is a practical and useful consequence of the unbiasedness property. Similarly, consistency is at least a guarantee that when more data is available the qualities of your inferences are improving, and this should be of interest to anyone whose goal it is to build systems which can learn. (There do exist some results on Bayesian posterior consistency, for a summary see Chapter 20 of DasGupta's book.)

Let me summarize. Bayesian estimators are often superior to alternatives. But the set of procedures yielding Bayesian estimates is strictly smaller than the set of all statistical procedures. We need methods to analyze the larger set, in particular to characterize the subset of useful estimators, where useful is application dependent.

Acknowledgements. I thank Jeremy Jancsary, Peter Gehler, Christoph Lampert, and Cheng Soon-Ong for feedback.

Becoming a Bayesian, Part 2

2015-05-02T18:30:00+01:00

This post continues the previous post, part 1, outlining my criticism towards a ''naive'' subjective Bayesian viewpoint:

The consequences of model misspecification.
The ''model first computation last'' approach, in this post.
Denial of methods of classical statistics.

The ''Model First Computation Last'' approach

Without a model (not necessarily probabilistic) we cannot learn anything. This is true for science, but it is also true for any machine learning system. The model may be very general and make only a few general assumptions (e.g. ''the physical laws remain constant over time and space''), or it may be highly specific (e.g. ''$X \sim \mathcal{N}(\mu,1)$''), but we need a model in order to relate observations to quantities of interest.

But in contrast to science, when we build machine learning systems we are also engineers. We build models not in isolation or on a piece of whiteboard, but instead we build them to run on our current technology.

Many Bayesians adhere to a strict separation of model and inference procedure; that is, the model is independent of any inference procedure. They argue convincingly that the goal of inference is to approximate the posterior under the assumed model, and that for each model there exist a large variety of possible approximate inference methods that can be applied, such as Markov chain Monte Carlo (MCMC), importance sampling, mean field, belief propagation, etc. By selecting a suitable inference procedure, different accuracy and runtime trade-offs can be realized. In this viewpoint, the model comes first and computation comes last, once the model is in place.

In practice this beautiful story does not play out very often. What is more common is that instead of spending time building and refining a model, time is spent on tuning the parameters of inference procedures, such as:

MCMC: Markov kernel, diagnostics, burn-in, possible extensions (annealing, parallel tempering ladder, HMC parameters, etc.);
Importance sampling: selecting the proposal distribution, effective sample size, possible extensions (e.g. multiple importance sampling);
Mean field and belief propagation: message initialization, schedule, damping factor, convergence criterion.

In fact, it seems to me, that many works describing novel models ultimately also describe inference procedures that are required to make their models work. I say this not to diminish the tremendeous progress we as a community have made in probabilistic inference; it is just an observation that the separation of model and inference is not plug-and-play in practice. (Other pragmatic reasons for deviating from the subjective Bayesian viewpoint are provided in a paper by Goldstein.)

Suppose we have a probabilistic model and we are provided an approximate inference procedure for it. Let us draw a big box around these two components and call this the effective model, that is, the system that takes observations and produces some probabilistic output. How similar is this effective model to the model on our whiteboard? I know of only very few results, for example Ruozzi's analysis of the Bethe approximation.

Another practial example along these lines was given to me by Andrew Wilson is to compare an analytically tractable model such as a Gaussian process against a richer but intractable model such as a Gaussian process with Student-T noise. The latter model is certainly more capable formally but requires approximate inference. In this case, the approximate inference implicitly changes the model and it is not clear at all whether it is it worth to give up analytic tractability.

Resource-Constrained Reasoning

It seems that when compared to machine learning, the field of artificial intelligence is somewhat ahead; in 1987 Eric Horvitz had a nice paper at UAI on reasoning and decision making under limited resources. When read liberally the problem of adhering to the orthodox (normative) view he described in 1987 seems to mirror the current issues faced by large scale probabilistic models used in machine learning, namely that exact analysis in any but the simplest models is intractable and resource constraints are not made explicit in the model or the inference procedures.

But some recent work is giving me new hopes that we will treat computation as a first class citizen when building our models, here is some of that work from the computer vision and natural language processing community:

Adrian Barbu's active random fields from 2009, where he explicitly considers the effects of using suboptimal inference procedure in graphical models.
Stoyanov, Ropson, and Eisner's work on predicting with approximate inference procedures at AISTATS 2011; although this is an empirical risk minimization approach.
Justin Domke's work on unrolling approximate inference procedures and training the resulting models end-to-end using backpropagation.

Cheng Soon Ong pointed me to work on anytime probabilistic inference, which I am not familiar with, but the goal of having inference algorithms which adapt to the available resources is certainly desirable. The anytime setting is practically relevant in many applications, particular in the real-time systems.

All these works share the characteristic that they take a probabilistic model and approximate inference procedure and construct a new "effective model" by entangling the model and inference. By doing so the resulting model is tractable by construction and retains to a large extent the specification of the original intractable model. However, the separation between model and inference procedure is lost.

This is the first step towards a computation first approach, and I believe we will see more machine learning works which recognize available computational primitives and resources as equally important to the model specification itself.

Acknowledgements. I thank Jeremy Jancsary, Peter Gehler, Christoph Lampert, Andrew Wilson, and Cheng Soon-Ong for feedback.

Becoming a Bayesian, Part 1

2015-04-19T17:30:00+01:00

I have used probabilistic models for a number of years now and over this time I have used different paradigms to build my models, to estimate them from data, and to perform inference and predictions.

Overall I have slowly become a Bayesian; however, it has been a rough journey. When I say that "I became a Bayesian" I mean that my default view on problems now is to think about a probabilistic model that relates observables to quantities of interest and of suitable prior distributions for any unknowns that are present in this model. When it comes to solving the practical problem using a computer program however, I am ready to depart from the model on my whiteboard whenever the advantages to do so are large enough, for example in simplicity, runtime speed, tractability, etc. Some recent work to that end:

Our work on informed sampling for generative computer vision models with Varun, Matthew and Peter, where we argue for a generative and Bayesian approach to computer vision problems;
Our Bayesian NMR work (and here) with Andrew Wilson and collaborators from the Cambridge university chemical department we have taken a full Bayesian viewpoint, with great success over conventional NMR Fourier analysis;
Our work on using GPs for structured prediction with Sebastien, Novi, and Zoubin, which was motivated by the struggle to scale up a conceptually satisfying model.
My work on maximum expected utility in some structured prediction models at CVPR 2014, which was motivated by applying basic decision theory, but ended up trying to cope with resulting intractabilities.

However, I have remained skeptical of the naive and unconditional adoption of the subjective Bayesian viewpoint. In particular, I object to the viewpoint that every model and every system ought to be Bayesian, or to the view that at the very least, if a statistical system is useful that it should have an approximate Bayesian interpretation. In this post and the following two posts I will try to explain my skepticism.

There is a risk of barking up the wrong tree by attacking a caricature of a Bayesian here, which is not my intention. In fact, to be frank, every one of the researchers I have interacted with in the past few years holds a nuanced view of their principles and methods and more often than not is aware of their principles' limitations and willing to adjust if circumstances require it.

Let me summarize the subjective Bayesian viewpoint. In my experience this view of the world is arguably the most prevalent among Bayesians in the machine learning community, for example at NIPS and at machine learning summer schools.

The Subjective Bayesian Viewpoint

The subjective Bayesian viewpoint on any system under study is as follows:

Specify a probabilistic model relating what is known to what is unknown;
Specify a proper prior probability distribution over unknowns based on any information that is available to you;
Obtain the posterior distribution over unknowns given the known data (using Bayes rule);
Draw conclusions based on the posterior distribution; for example, solve a decision problem or select a model among the alternative models.

This approach is used exclusively for any statistical problem that may arise. This approach is strongly advocated, for example by Lindley and in a paper by Michael Goldstein.

Alternative Bayesian views deviate from this recipe. For example, they may allow for improper prior distributions or instead aim to select uninformative prior distributions, or even select the prior as a function of the inferential question at hand.

Criticism

My main criticism towards a ''naive'' subjective Bayesian viewpoint are related to the following three points:

The consequences of model misspecification.
The ''model first computation last'' approach.
Denial of methods of classical statistics.

The Consequences of Model Misspecification

To model some system in the world we often use probabilistic models of the form

$$p(x;\theta),\qquad \theta \in \Theta,$$

where $x \in \mathcal{X}$ is a random variable of interest and $\Theta$ is the set of possible parameters $\theta$. We are interested in $p(x)$ and thus would like to find a suitable parameter given some observed data $x_1, x_2, \dots, x_n \in \mathcal{X}$. Because we can never be entirely certain about our parameters we may represent our current beliefs through a posterior distribution $p(\theta|x_1,\dots,x_n)$.

Misspecification is the case when no parameter in $\Theta$ leads to a distribution $p(x;\theta)$ that behaves like the true distribution. This is not exceptional, infact most models of real world systems are misspecified. It also is not a property of any inferential approach but rather a fundamental limitation of building expressive models given our limited knowledge. If we could observe all relevant quantities and know their deterministic relationships we would not need a probabilistic model. Hence the need for a probabilistic model arises because we cannot observe everything and we do not know all the dependencies that exist in the real world. (Alas, as Andrew Wilson pointed out to me, the previous two sentences expose my deterministic world view.) So what can be said about this common case of misspecified models?

Let us talk about calibration of probabilities, and what happens in case your model is wrong. Informally, you are well-calibrated if you neither overestimate nor underestimate the probability of certain events. Crucially, this does not imply a degree of certainty, only that your uncertain statements (forecasted probabilities of events) are on average correct.

For any probabilistic model, being well-calibrated is a desirable goal. There are various methods to assess calibration and to check the forecasts of your model. In 1982 Dawid, in a seminal paper, established a general theorem whose consequence (in Section 4.3 of that paper) is to guarantee that a Bayesian using a parametric model will eventually be well-calibrated.

This is reassuring, except there is one catch: it does not apply in the case when the model is misspecified. Unfortunately, in most practical applications of probabilistic modelling, misspecification is the rule rather than the exception (''All models are wrong''). We could hope for a ''graceful degradation'', in that we are still at least approximately calibrated. But this is not the case.

Calibration and Misspecification

In the misspecified case, there are simple examples due to Brad Delong and Cosma Shalizi where beliefs in a parametric model do not converge and become less-calibrated over time. In their example two contradicting things happen at the same time: the beliefs become very confident, yet a single new observation revises the belief to the other extreme, again confident.

Improving the model?

One can object that in these examples, and more generally, one should revise the model to more accurately reflect the system under study. But then, in order not to end up in an infinite loop of trying to improve a model, how to determine when to stop? Actually, how to even determine the accuracy of the model? Model evidence cannot be used to this end, as it is conditioned on the set of possible models being used. (In fact, in Delong's example the evidence would assure us that everything is fine.) The answers to how model's can be criticised and improved are not simple, and quite likely not Bayesian.

Andrew Gelman and Cosma Shalizi discuss this issue and others in a position paper, and I find myself agreeing with their assessment that there is no answer to wrong model assumptions within the (strictly) subjective Bayesian viewpoint:

"We fear that a philosophy of Bayesian statistics as subjective, inductive inference can encourage a complacency about picking or averaging over existing models rather than trying to falsify and go further. Likelihood and Bayesian inference are powerful, and with great power comes great responsibility. Complex models can and should be checked and falsified."

Non-parametric Models to the Rescue?

Another objection is that this is all well-known and hence we should use non-parametric models which endow us with prior support over essentially all reasonable alternatives.

Unfortunately, while the resulting models are richer and are practically useful in real applications, we now may have other problems: even when there is prior support for the true model simple properties like consistency (which were guaranteed to hold in the parametric case) can no longer be taken for granted. The current literature and basic results on this topic are nicely summarized in Section 20.12 of DasGupta's book.

Conclusion

Misspecification is not a Bayesian problem, and applies equally to other estimation approaches, for example in the case of maximum likelihood estimation see the book by White. However, a subjective Bayesian has no Bayesian means to test for the presence of misspecification and that makes it hard to deal with the consequences.

There are some ideas for applying Bayesian inference in a misspecification-aware manner, for example the Safe Bayesian approach, and an interesting analysis of approximate Bayesian inference using the Bootstrap in a relatively unknown paper of Fushiki.

Are these alternatives practical and do they somehow overcome the misspecification problem? To be frank, I am not aware of any satisfactory solution and common practice seems to be a careful model criticism using tools such as predictive model checking and graphical inspection. But these require first acknowledging the problem.

When the model is wrong ideally it would be reassuring to have,

a reliable diagnostic and quantification on how wrong it is (say, an estimate $D(q\|p^*)$ where $q$ is the true distribution), and
a test for whether the type of model error present will matter for making certain predictions (say, an error bound on the deviation of certain expectations, $\mathbb{E}_q[f(x)] - \mathbb{E}_{p^*}[f(x)]$ for a given function $f$).

To me it appears the (pure) subjective Bayesian paradigm cannot provide the above.

Addendum

Andrew Wilson pointed out to me that in most cases of statistical problems we cannot know the true distribution, even in principle. I agree, and indeed if we pursue such elusive ideal then this may divert our attention away from the practical issue of building a model good enough for the task at hand. I entirely agree with taking such pragmatic stance and this follows Francis Bacon's ideal of assessing the worth of a model (scientific theory in his case) not by an abstract ideal of truthfulness, but instead by its utility.

In machine learning and most industrial applications building the model is easy because we merely focus on predictive performance which can be reliably assessed using holdout data. For scientific discovery however, things are more subtle in that our goal is in establishing the truth of certain statements with sufficient confidence; but this truth is only a conditional truth, conditioned on assumptions we have to make.

A Bayesian makes all assumptions explicit and then proceeds by formally treating them as truth, correctly inferring the consequences. A classical/frequentist approach also makes assumptions by positing a model, but then may be able to make statements that hold uniformly over all possibilities encoded in the model. Therefore, in my mind the Bayesian is an optimist, believing entirely in their assumptions, whereas the classical approach is more pessimistic, believing in their model but then providing worst-case results over all possibilities. Misspecification affects both approaches.

If you want to continue reading, the second part of this post is now available.

Acknowledgements. I thank Jeremy Jancsary, Peter Gehler, Christoph Lampert, and Andrew Wilson for feedback.

Extended Formulations

2015-04-05T16:30:00+01:00

An amazing fact in high dimensions is this: Projecting a simple convex set described by a small number of inequalities can create complicated convex set with an exponential number of inequalities.

It is amazing because it contradicts our everyday human experience. We are most familiar with projections of objects in three dimensions down to two dimensions, namely when objects cast shadows, like this:

(Image courtesy to Cloud Nines Designs.)

In three dimensions any polyhedral object, when projected onto a plane, becomes simpler, i.e. the number of facets stays the same or becomes smaller. Think of a three dimensional cube that casts a shadow. The cube has six facets but its shadow has four or six, depending on the position of the light and plane. [Edit and correction, July 2015: Thanks to reader Paul (comment below), I have been made aware that it is not true that the number of facets cannot increase when projecting form three dimensions onto the plane. A great example is provided by Sebastian Pokutta, where a convex 3D polytope with six facets projects onto the 2D plane as an octagon with eight facets. Thanks Paul!]

Now, how I can I convince you that a convex set can become more complex when projected? Here is an impressive example.

Ben-Tal/Nemirovski Polyhedron

The following example is from (Ben-Tal, Nemirovski, 2001), (PDF). In this paper the authors are motivated by approximating certain second order cones using extended polyhedral formulations, in order to be able to perform robust optimization using linear programming. As a special case of their results I select the problem of approximating a unit disk in the 2D plane. (The following is a specialization of equation (8) in the paper.)

First, let us fix some notation. Let $x=(x_1,x_2,\dots,x_n,\alpha_1,\dots,\alpha_m) \in \mathbb{R}^{n+m}$ be a vector, where $x_1$ to $x_n$ represent the basic dimensions and $\alpha_1$ to $\alpha_m$ represent the extended dimensions. For any set $\mathcal{E} \subseteq \mathbb{R}^{n+m}$ we define the projection as

$$\textrm{proj}_x(\mathcal{E}) = \{ (x_1,\dots,x_n) \:|\: \exists (\alpha_1,\dots,\alpha_m): (x_1,\dots,x_n,\alpha_1,\dots,\alpha_m) \in \mathcal{E} \}.$$

This corresponds to the familiar notion of a projection.

For the 2D unit disk the following is an extended polyhedral formulation, parametrized by an integer accuracy parameter $k \geq 2$. The formulation has the basic dimensions $x_1$ and $x_2$, and the extended dimensions $\mathbf{\alpha}=(\xi_j,\eta_j)_{j=0,\dots,k}$. Defining the constants $c_j = \cos(\pi / 2^{j})$, $s_j = \sin(\pi / 2^j)$, and $t_j = \tan(\pi / 2^j)$ the polyhedral set $\mathcal{E}_k$ is given by the following intersection of linear inequality and equality constraints.

\begin{eqnarray} \xi_0 - x_1 & \geq & 0,\nonumber\\ \xi_0 + x_1 & \geq & 0,\nonumber\\ \eta_0 - x_2 & \geq & 0,\nonumber\\ \eta_0 + x_2 & \geq & 0,\nonumber\\ \xi_j - c_{j+1} \xi_{j-1} - s_{j+1} \eta_{j-1} & = & 0, \qquad\textrm{for $j=1,\dots,k$,}\nonumber\\ \eta_j + s_{j+1} \xi_{j-1} - c_{j+1} \eta_{j-1} & \geq & 0, \qquad\textrm{for $j=1,\dots,k$,}\nonumber\\ \eta_j - s_{j+1} \xi_{j-1} + c_{j+1} \eta_{j-1} & \geq & 0, \qquad\textrm{for $j=1,\dots,k$,}\nonumber\\ \xi_k & \leq & 1,\nonumber\\ \eta_k - t_{k+1} \xi_k & \leq & 0.\nonumber \end{eqnarray}

Note that the set $\mathcal{E}_k$ can be described by $6+3k$ sparse linear constraints. The intersection of these convex constraint sets is of course again a convex set. Thus, the description of the set takes $O(k)$ space, where $k$ is the approximation parameter.

If we write $\mathcal{D}_k := \textrm{proj}_{x_1,x_2} \mathcal{E}_k$ for the projection onto the first two dimensions, the following figure illustrates just how remarkably accurate the formulation is as we increase $k$.

How accurate is it? Ben-Tal and Nemirovski say that a set $\mathcal{D}$ is an $\epsilon$-approximation to a set $\mathcal{L}$ if $\mathcal{L} \subseteq \mathcal{D}$ and if for all $x \in \mathcal{D}$ it holds that $(\frac{1}{1+\epsilon} x) \in \mathcal{L}$. They then show that the above formulation is an $\epsilon_k$-approximation, where

$$\epsilon_k = \frac{1}{\cos(\pi / 2^{k+1})} - 1 = O(\frac{1}{4^k}).$$

That is, despite having a compact description in $O(k)$ space the accuracy improves exponentially. In the basic dimensions the set $\mathcal{D}_k$ has exponentially many facets and cannot be described compactly through a polynomial sized collection of linear inequalities. (The paper further generalizes the above results to the family of $d$-dimensional Lorentz cones.)

Is there a Recurring Pattern?

The abstract idea behind obtaining complicated structures in one space by means of something like an extended formulation can be found in other domains; for example, in probabilistic graphical models.

Suppose we would like to specify a potentially complicated probability distribution $P(X)$. Akin to an extended formulation we may proceed as follows. We define an extended set of random variables $\alpha$ and a distribution $P(\alpha)$. We then couple both spaces by means of a conditional specification, $P(X|\alpha) P(\alpha)$. We then project, that is, marginalize out, the extended dimensions $\alpha$ to obtain

$$P(X) = \int P(X|\alpha) P(\alpha) \,\textrm{d}\alpha.$$

In practice this construction is often used in the form of a hierarchical graphical model, for example when using a Normal mixture to define a student T distribution.

The increase in flexibility of the resulting marginal distribution can be as impressive as for the above polyhedral sets: for example, if $P(X|\alpha)$ is a Normal distribution and $P(\alpha)$ is a distribution over Normal parameters, then the infinite Normal mixture can essentially represent any absolutely continuous distribution.

Another observation, which may be just a coincidence, but maybe there is more to it: the extended formulation construction in both cases suggests a practical implementation. In the polyhedral set this was through linear programming in the extended space, for the graphical model it would be ancestral sampling or MCMC inference.

This leaves me with the following questions:

Are there more examples of similar constructions (extension, coupling, projection)?
What is the shared mathematical structure behind this similarity (e.g. permitting a projection operation that leads to complexity in the basic dimensions that no longer admits a compact description in this space)?

Feedback very much welcome :-)

Conclusion

I first learned of extended formulations from this book of Pochet and Wolsey, who pioneered the technique for practical scheduling optimization problems. (Yes, I had enough time for tinkering during my PhD to take such creative diversions.) A recent summary of extended formulations for combinatorial optimization problems is Conforti, Cornuejols, Zambelli, 2012.

Many so called higher-order interactions in computer vision random field models are representable as extended formulations, a point I elaborated on in a talk I gave at the Inference in Graphical Models with Structured Potentials workshop at the CVPR 2011 conference. Another relevant work is Miller and Wolsey, 2003.

How to report uncertainty

2015-03-19T22:30:00+00:00

Error bars and the $\pm$-notation are used to quantitatively convey uncertainty in experimental results. For example, you would often read statements like $140.7 \textrm{Hz} \pm 2.8 \textrm{ Hz SEM}$ in a paper to report both an experimental average and its uncertainty.

Unfortunately, in many fields (such as computer vision, and, to a lesser extent, machine learning) researchers often do not report uncertainty or if they do, they may do it wrong.

Of course, dear reader, I am sure you always do report it properly, so the following remarks may only serve as a reminder to your common practice.

First, when reporting a quantitative measurement of uncertainty, it is important to establish the goal of doing so. The two popular goals are as follows.

1. Convey Variability

Here the focus is on the variability itself. For example, take a look at this table of food intake of US teenagers. The variability among the participants of the study is reported through the standard deviation, the square root of the variance.

The reason why the standard deviation (SD) is prefered over the variance is that the SD is on the same scale as the original values. That is, if the original measurements were in $\textrm{Hz}$ the standard deviation is also in the unit of $\textrm{Hz}$, whereas the variance is the square.

One easy question you can ask yourself when thinking about the results you would like to report in an experiment is this: Do you expect the error bars to shrink with more available data? If your goal is to convey variability they would not shrink but remain of a certain size, no matter how many samples are available.

The correct wording to report this type of uncertainty is something similar to

"We report the mean and one unit standard deviation."

2. Convey Uncertainty about an Unknown Parameter

Here the focus is on your remaining uncertainty about a fixed quantity which does not vary. For example, take a look at Table 1 in Ogden et al., 2004 where the average weight of US children is reported. Together with the mean weight in pounds the authors report the standard error of the mean. (Sometimes this is just called standard error.)

Here the uncertainty represents a measurement of uncertainty about the average weight. It is related to the standard deviation $\sigma$ by means of

$$\textrm{SEM} = \frac{\sigma}{\sqrt{n}},$$

where $n$ is the sample size of the experiment. For example, in Table 1 of the above paper the authors report that between 1963 and 1965 for boys of age 6 years living in the USA the average weight was $\hat{\mu}=48.4$ pounds with $\textrm{SEM}=0.3$ standard error of the mean and a sample size of $n=575$. Using the above formula this immediately gives

$$\sigma \approx \sqrt{n} \textrm{SEM} = \sqrt{575} \cdot 0.3 \approx 7.19.$$

What is the use of the standard error? Because of the central limit theorem for independent samples the standard error provides approximate confidence intervals for the unknown true mean of the population, as

$$[\hat{\mu} - 1.96 \textrm{SEM}, \hat{\mu} + 1.96 \textrm{SEM}].$$

Using the above numbers we then know that with 95% confidence over the sampling variation the true average weight $\mu \in [47.8,49.0]$. (Note that for a single experiment this does not mean we cover the true value with a certain probability, because either we cover it or we do not cover it. The 95% probability is the probability associated to a (hypothetical) repetition of the experiment.)

The correct wording to report this type of uncertainty is

"We report the average of $n=123$ samples and the standard error of the mean."

How many digits to report?

When writing out numbers a natural question that arises is how many significant digits to report. Richard Clymo has some advice on how many digits to report.

Most bioscientists need to report mean values, yet many have little idea of how many digits are significant, and at what point further digits are mere random junk. Thus a recent report that the mean of 17 values was 3.863 with a standard error of the mean (SEM) of 2.162 revealed only that none of the seven authors understood the limitations of their work. The simple rule derived here by experiment for restricting a mean value to its significant digits (sig-digs) is this: the last sig-dig in the mean value is at the same decimal decade as the first sig-dig (the first non-zero) in the SEM. ... For the example above the reported values should be a mean of 4 with SEM 2.2. Routine application of these simple rules will often show that a result is not as compelling as one had hoped.

Let's compare with the numbers from before: the average height was reported as 48.4 and the SEM as 0.3. The last significant digit in the mean is the four after the decimal point, and this is the same decimal decade as the first significant digit of the SEM. So the study did it right.

Clymo develops the following simple-to-follow rules for reporting the sample average and SEM:

Rule 1 (for determining the significant digits in the reported mean): the last significant digit in the mean is in the same decade as the first non-zero digit in the SEM.
Rule 2 (for determining significant digits in the reported SEM): depending on the sample size $n$, as per the following table:

Sample size $n$	Significant digits to report
$2 \leq n \leq 6$	1
$7 \leq n \leq 100$	2
$101 \leq n \leq 10,000$	3
$10,001 \leq n \leq 10^6$	4
$n > 10^6$	5

Quiz

Ok, that is enough information. Let's practice.

Question 1

You sample the height of male students in a German school class (grade 6) in centimeters: 148, 148, 137, 152, 140, 149, 152, 152, 159, 155. Report your estimate of the population height (here the population is all German male students in grade 6).

Answer: $149\textrm{cm} \pm 2.1\textrm{cm}$ SEM. Explanation: we are interested in the population mean and hence would like to convey the remaining uncertainty of our estimate. The sample mean is $\hat{\mu} \approx 149.24467\textrm{cm}$, the standard deviation is $6.579429\textrm{cm}$, and the sample size is $n=10$. This gives a $\textrm{SEM} = 6.579429/\sqrt{10} \approx 2.080598$. Applying the above rules: Rule 1 tells us that the first significant digit is in the $10^0$ decade, so we report $149\textrm{cm}$ as mean. Rule 2 tells us that for a sample size of $n=10$ we should report two digits in the SEM, which needs to be properly rounded to $2.1\textrm{cm}$.

Question 2

You run a company and regularly send bills to customers for payment. You measure the time in days between sending the bill and receiving the payment: 10, 7, 10, 7, 12, 10, 8, 4, 15, 3, 9, 4. Report the average and variability.

Answer: $8 \pm 3.5$ SD. Explanation: we are interested in the average time and the variability, so a standard deviation is appropriate. Rule 1 from Clymo still applies and we truncate the sample mean of $8.25$ after the first digit. Rule 2 does not apply (this is the standard deviation, not the SEM), but because we have truncated the mean it makes no sense to be more accurate than the mean except for one additional digit.

Estimating Discrete Entropy, Part 3

2015-03-07T16:00:00+00:00

In the last two parts (part one, part two) we looked at the problem of entropy estimation and several popular estimators.

In this final article we will take a look at two Bayesian approaches to the problem.

Bayesian Estimator due to Wolpert and Wolf

The first Bayesian approach to entropy estimation was proposed by David Wolpert and David Wolf in 1995 in their paper "Estimating functions of probability distributions from a finite set of samples", published in Physical Review E, Vol. 52, No. 6, 1995, publisher link, and a longer tech report from 1993.

The idea is simple and elegant Bayesian reasoning: specify a model relating the known observations to the unknown quantity, then compute the posterior distribution over the entropy given the observations.

The model is the following Dirichlet-Multinomial model, assuming a given non-negative vector $\mathbb{\alpha} \in \mathbb{R}^K_+$,

$\mathbb{p} \sim \textrm{Dirichlet}(\mathbb{\alpha})$,
$x_i \sim \textrm{Categorical}(\mathbb{p})$, $i=1,\dots,n$, iid.

If we define, for each bin $k \in \{1,2,\dots,K\}$ the count

$$n_k = \sum_{i=1}^n 1_{\{x_i = k\}},$$

so that $(n_1,n_2,\dots,n_K)$ is a histogram over $K$ outcomes, which is distributed according to a multinomial distribution. Then, due to conjugacy, the posterior over the unknown distribution $\mathbb{p}$ is again a Dirichlet distribution and given as

$$P(\mathbb{p} | x_1,\dots,x_n) = \textrm{Dirichlet}(\alpha_1 + n_1, \dots, \alpha_K + n_K).$$

We can now attempt to compute the squared-error optimal point estimate of the entropy under this posterior. One of the main contributions of Wolpert and Wolf is to provide a family of results that enable moment computations of the Shannon entropy under the Dirichlet distribution.

In particular, with $n = \sum_{k=1}^K n_k$ and $\alpha = \sum_{k=1}^K \alpha_k$, they provide the posterior mean of the entropy as

$$\hat{H}_{\textrm{Bayes}} = \mathbb{E}[H(\mathbb{p}) | n_1,\dots,n_K] = \psi(n + \alpha + 1) - \sum_{k=1}^K \frac{n_k+\alpha_k}{n+\alpha} \psi(n_k + \alpha_k + 1),$$

where $\psi$ is the digamma function. This expression is efficient to compute, and similarly the second moment and hence the variance of $H(p)$ under the posterior can be computed efficiently.

The only open question is how to select the prior vector of $\mathbb{\alpha}$. In absence of further information about the distribution we can assume symmetry. Then there are four common options,

$\alpha_k = 1$, due to Bayes in 1763 and Laplace in 1812.
$\alpha_k = 1/K$, due to Perks in 1947.
$\alpha_k = 1/2$, due to Jeffreys in 1946 and 1961.
$\alpha_k = 0$, due to Haldane in 1948. This yields an improper prior.

It may not be clear which choice is the best, but I found an interesting discussion in a paper by de Campos and Benavoli. Further down in this article we will be better equiped to assess the above choices.

Independent of the choice of the prior parameter Wolpert and Wolf are very optimistic about their model and highlight the advantages that come from the Bayesian approach:

"One of the strength of Bayesian analysis is its power for dealing with such small-data cases. In particular, not only are Bayesian estimators in many respects more 'reasonable' than non-Bayesian estimators for small data, they also naturally provide error bars to govern one's use of their results. ... In addition, the Bayesian formalism automatically tells you when it is unsure of its estimate, through its error bars."

Also, on the empirical performance they comment,

"... for all N the Bayes estimator has a smaller mean-squared error than the frequency-counts estimator."

And indeed, also asymptotically the prior has support for every possible distribution, so consistency of the estimated entropy is guaranteed as $n\to\infty$.

All good then?

Here is the comparison of the squared error and bias of various Bayes estimators with different choices of prior $\alpha$. The plot shows, like in the previous article, the performance when evaluated on data generated from a different Dirichlet prior. Each value on the x-axis is a different generating distribution, but the prior of the estimator remains fixed.

While all of the Bayes estimators perform better than the plugin estimator, overall they all fare quite badly: there is a low error and bias only at the matching $\alpha$ value, but they deteriorate quickly at different values of $\alpha$.

How can this be the case?

Nemenman-Shafee-Bialek

In 2002 Nemenman, Shafee, and Bialek recognized that the innocent looking Dirichlet-Multinomial model implies a very concentrated prior belief over the entropy of the distribution:

"Thus a seemingly innocent choice of the prior ... leads to a disaster: fixing $\alpha$ specifies the entropy almost uniquely. Furthermore, the situation persists even after we observe some data: until the distribution is well sampled, our estimate of the entropy is dominated by the prior!"

The Implied Beliefs over the Entropy

The following experiment visualizes this: each of the following histograms shows the implied prior over $H(\mathbb{p})$. To create each histogram, I fixed $K$ and $\alpha$ and take 1,000,000 samples of distributions $\mathbb{p}$, then record its entropy. In each histogram plot the x-axis covers exactly the full range over possible entropies.

For the case $K=2$ everything looks fine: the implied prior spreads well over the entire range of possible entropies. But look what happens for $K=10$ and $K=100$:

Here, the implied prior clearly concentrates sharply. (The least possible concentration of the entropy can be achieved using Perks choice of $\alpha = 1/K$.) In fact, there is no choice of $\alpha$ for which the prior belief over the very quantity to be estimated does not concentrate as $K \to \infty$. If we have no reason to believe that the entropy really is in the range where the prior dictates it should be, then this is a bad prior.

How did Nemenman, Shafee, and Bialek solve this problem?

NSB estimator

They construct a mixture-of-Dirichlet prior by defining a hyperprior on $\alpha$ itself. The hyperprior $P(\alpha)$ is chosen such that

$$P(\alpha) \propto \frac{\textrm{d} \mathbb{E}[H|\alpha]}{\textrm{d} \alpha}.$$

Let us take a look at how this can be derived. Nemenman and collaborators first show that under the Dirichlet-Multinomial model the expected entropy is a strictly monotonic continuous function in $\alpha$, and therefore it is invertible. Let us define the shorthand $g^{-1}(\alpha) := \mathbb{E}[H|\alpha]$ as the function that takes $\alpha$ to the expected entropy. Now, by the transformation formula for random vairables, we have the induced density

$$P_{\alpha}(\alpha) = P_H(g^{-1}(\alpha)) \cdot \left|\frac{\textrm{d} g^{-1}(\alpha)}{\textrm{d} \alpha}\right|.$$

If we assume that $P(H|\alpha)$ is highly concentrated (at least for large $K$ in the above plots, this holds), then $P_H(g^{-1}(\alpha)) \approx P(H|\alpha)$, and we want this density to be constant. Hence, we have

$$P_{\alpha}(\alpha) \propto \left|\frac{\textrm{d} g^{-1}(\alpha)}{\textrm{d} \alpha}\right|.$$

Because the right hand side is positive, with $g^{-1}(\alpha) = \mathbb{E}[H|\alpha]$ this yields exactly the original expression above. This expression has an analytic solution which, properly normalized is

$$P(\alpha) = \frac{1}{\log K} \left(K \psi_1(K \alpha + 1) - \psi_1(\alpha+1)\right),$$

where $\psi_1$ is the trigamma function.

Let us look at the implied prior of the entropy when using the NSB prior. They are much more uniform now:

This uniformity results in the NSB estimator having excellent robustness properties and small bias. It is probably the best general purpose discrete entropy estimator available. One drawback however is the increased computational cost: in order to compute the estimator we need to solve a 1D integral numerically over $\alpha$. Each pointwise evaluation of the integrand function corresponds to computing $\hat{H}_{\textrm{Bayes}}$ for a fixed value of $\alpha$. High accuracy requires several hundred such evaluations, and this may be prohibitively expensive in some applications (for example, decision tree induction).

Addendum: Undersampled Regime

After a comment from Ilya Nemenman on the previous version of this article, I also did an experiment in the undersampled regime ($N < K$), where we observe fewer outcomes than there are bins. I am glad I did perform this experiment!

I select $N=100$ and $K=2000$, with $500$ replicates and compare the same methods as in the second part of the article. The results are as follows.

Almost all estimators perform very poorly in this setting, with the naive Miller correction even being off the chart. Only the NSB and the Hausser-Strimmer estimator can be considered usable in this severely undersampled regime, with clear preference towards the NSB estimator.

Ilya Nemenman, the inventor of the NSB estimator, was kind enough to share his feedback on these experiments with me and to allow me to post them here:

I am glad to hear that NSB estimator did well on this test. It's also not surprising that HS estimator did rather well too -- in some sense, it's a frequentist version of NSB. Both NSB and HS perform shrinking towards the uniform distribution (infinite pseudocounts or "alpha" in your notation), and then they lift the shrinkage as $N$ grows. However, HS shrinks much stronger than NSB does. As a result, HS performs very well for large entropy (large alpha) distributions, and worse for lower entropies. It's probably possible to set up a frequentist shrinkage estimator that would shrink towards entropy being half of the maximum value, or shrink towards the maximum value, but less strongly than HS — I think that such an estimator would do better over the whole range of alpha. In practice, the strong shrinkage imposed by HS becomes problematic when the alphabet size is very large, say $2^{150}$, which is what one gets when one takes a 30ms long spike train and discretizes it at 0.2 ms resolution (yes spike = 1, no spike =0). We had numbers like this in our 2008 PLoS Comp Bio paper. With entropy of $\approx 15$ bits, alphabet size of $2^{150}$, and 100-1000 samples, NSB may work (more on this below), and HS will shrink towards 150 bits, and will likely overestimate. One way to see this problem is to realize that, in your comparison plots, once you use $\alpha > 1$, the entropy is nearly the maximum possible entropy. This is why HS works well there, but fails for $\alpha \ll 1$, where the entropy is substantially smaller than the maximum. If you were to replot the data putting the true entropy of the analyzed probability distribution (rather than alpha) on the horizontal axis, this will be visible, I think.

He continues,

A key point for both NSB and HS is that both may work in the regime of $N \sim \sqrt{K}$ (better yet, $\sim \sqrt{2^{H/2}}$). On the contrary, most other estimators you analyzed work well only up to $N \sim 2^H$ (unless I am missing something important). This is because NSB and HS require not good sampling of the underlying distribution, but coincidences in the data only. They estimate entropy, effectively, by inverting the usual birthday paradox, and using the frequency of coincidences to measure the diversity of data. One can illustrate this by pushing $K$ to even larger values in your last plot, 10000 or even more, if you limit yourself to smaller alpha.

These comments are very insightful and show that my earlier discussion and results were, in a way, limited to the simple case where we have a reasonable number of samples per bin. The case Ilya considers in his work is the severely undersampled regime.

One difficulty in producing the plot he suggests that plots the entropy of the distribution along the x-axis is that it would require an additional binning operation along that axis, so I have not produced this plot yet.

Reference Prior, Anyone?

I wonder whether the NSB prior is a simplification of a full reference prior treatment. This is not exactly the standard setting of reference priors because we are interested in a function (the entropy) of our random variables, so there is an additional indirection. But I believe it could work as follows: find in the space of all priors on $\alpha$ the prior that maximizes the KL divergence between implied entropy prior and entropy posterior.

Using the numerical method suggested in the paper above, I obtained a numerical reference prior (with one additional ABC approximation for the likelihood) for $K=2$ and this closely matches the NSB prior.

(Interestingly, I recently discovered this work on overall objective priors in which their hierarchical reference prior approach for the Dirichlet-Multinomial model yields an analytic proper prior which is very similar to the NSB and numerical reference priors.)

Machine Learning in Cambridge 2015

2015-02-26T20:00:00+00:00

This year we (Zoubin, together with David and myself) are again organizing a workshop event for the local Cambridge (UK) machine learning community. The schedule is available at the workshop homepage, Machine Learning in Cambridge 2015, and we also plan to make all talks available as video recordings after the event.

See you at the event!

Estimating Discrete Entropy, Part 2

2015-02-21T19:00:00+00:00

In the last part we have looked at the basic problem of discrete entropy estimation. In this article we will see a number of proposals of improved estimators.

Miller Correction

In 1955 George Miller proposed a simple correction to the naive plugin estimator $\hat{H}_N$ by adding the constant offset in the bias expression as follows.

$$\hat{H}_M = \hat{H}_N + \frac{K-1}{2n}.$$

This is an improvement over the plugin estimator but the added offset does not depend on the distribution but only on the sample size. We can do better.

(A variant of the Miller estimator for the infinite alphabet case is the so called Miller-Madow estimator in which the quantity $K$ is estimated from the data as well.)

Jackknife Estimator

A classic method for bias correction is the jackknife resampling method due to (Quenouille, 1947), although the somewhat catchy name is due to John Tukey. (The literature on the jackknife methodology is quite classic now. A very readable modern summary of the jackknife methodology can be found in DasGupta's book. An older but still readable introduction is (Miller, 1974), PDF.)

In a nutshell, jackknife resampling methods are used to estimate bias and variance of estimators. They are typically simple to implement, and often computationally cheaper than the bootstrap. For the bias reduction application, they often manage to reduce bias considerably, often knocking the bias down to $O(n^{-2})$.

The use of jackknife bias estimation to improve entropy estimation was suggested by (Zahl, 1977). The jackknife bias-corrected estimator of the plugin estimator is given as follows.

\begin{equation} \hat{H}_{J} = n \hat{H}_N - (n-1) \hat{H}^{(\cdot)}_N. \label{H:jackknife} \end{equation}

The quantity $\hat{H}^{(\cdot)}_N$ is the average of $n$ estimates obtained by leaving out a single observation. Thus, writing $\mathbb{h}=(h_1,\dots,h_K)$ for the histogram of bin counts on the full sample, and $\mathbb{h}_{\setminus i}$ for the histogram with the count in $X_i=k$ reduced by one, we have

$$\hat{H}^{\setminus i}_N := \hat{H}_N(\mathbb{h}_{\setminus i}),$$

and the mean of these quantities,

$$\hat{H}^{(\cdot)}_N := \frac{1}{n} \sum_{i=1}^n \hat{H}^{\setminus i}_N.$$

Interestingly, normally it would be expensive to compute $n$ leave-one-out estimates. Here however, two tricks are possible: First, because the histogram is a sufficient statistic, we need to compute only $K$ holdout estimates instead of $n$. Second, one can interleave computation in such a way that computing each holdout estimate is $O(1)$, reducing the overall computation of $(\ref{H:jackknife})$ to $O(K)$ runtime and no additional memory over the plugin estimate. In essence, the computational complexity is comparable to that of the inexpensive plugin estimate, making the jackknife estimator computationally cheap.

Grassberger Estimator

Another proposal for an improved estimator is due to Peter Grassberger. In (Grassberger, 2003) he derives two estimators based on an argument using analytic continuation which I have to admit is somewhat beyond my grasp. The better of the two estimators is the following:

$$\hat{H}_G = \log n - \frac{1}{n} \sum_{k=1}^K h_k G(h_k),$$

where the logarithm of the original naive estimator $\hat{H}_N$ have been replaced by a scalar function $G$, defined as

$$G(h) = \psi(h) + \frac{1}{2} (-1)^{h} \left(\psi(\frac{h+1}{2}) - \psi(\frac{h}{2})\right).$$

The function $\psi$ is the digamma function. (The function $G(h)$ is the solution of $G(h)=\psi(h)+(-1)^h \int^1_0 \frac{x^{h-1}}{x+1} \textrm{d}x$ given as equation $(30)$ in the paper.) Computationally this estimator is almost as efficient as the original plugin estimator, because for integer arguments the digamma function can be accurately approximated by an efficient series expansion.

When compared to the plugin estimator (in histogram count form), we can see an upwards correction of this estimator but also an interesting difference between even and odd histogram counts.

Unfortunately, the original derivation in Grassberger's paper is quite involved and beyond my full understanding. However, for practical purposes, among the computationally efficient estimators, the 2003 Grassberger estimator is probably the most useful and robust estimator.

Experiment

The following plots show a simple evaluation of some popular discrete entropy estimators. We assume a categorical distribution with $K=64$ outcomes with the probability vector $\mathbb{p}$ sampled from a symmetric Dirichlet distribution with hyperparameter $\alpha \in [0.25,5.0]$. We obtain $n=100$ samples from the distribution and estimate the entropy based on this sample. For each $\alpha$ we repeat this procedure 5,000 times to estimate the root mean squared error (RMSE) and bias.

We plot all the estimators discussed above, but also plot four additional estimators:

Hausser-Strimmer estimator, based on a shrinkage estimate,
Polynomial estimator due to Vinck et al.; this is equivalent to Zhiyi Zhang's estimator, but numerically simpler and more stable to evaluate. (I only mention this here because I have not found this mentioned elsewhere.)
Two Bayesian estimators (Bayes and NSB), to be discussed in the next post.

These experiments are not fully representative because the Dirichlet prior makes assumptions which may not be satisfied in your application; in particular, this experiment considers the well-sampled case ($N > K$), and simple bias correction methods work well in this regime (this was pointed out to me by Ilya and Jonas); however, some clear trends are visible:

The Plugin estimator fares badly on both RMSE and bias;
The Miller estimator fares less badly, suggesting RMSE is affected mainly by bias, however, significant errors remain for small values of $\alpha$;
The Bayes estimator fares almost as bad as the plugin estimator, except for $\alpha=1/2$. More on this point in the next post;
The Jackknife, Grassberger 2003, and NSB estimators provide excellent performance throughout the whole range of $\alpha$ values.
The performance of the Polynomial and Hausser estimates are mediocre.

In the next part we will be looking at Bayesian estimators.

Acknowledgements. I thank Il Memming Park, Jonas Peters, and Ilya Nemenman for reading a draft version of the article and providing very helpful feedback.

Estimating Discrete Entropy, Part 1

2015-02-07T14:00:00+00:00

Estimation of the entropy of a random variable is an important problem that has many applications. If you can estimate entropy accurately, you can also estimate mutual information, which allows you to find dependent random variables in large data sets. There are numerous applications.

The setting of discrete entropy estimation with a finite number of outcomes is as follows. There is an unknown categorical distribution over $K \geq 2$ different outcomes, defined by means of a probability vector $\mathbb{p} = (p_1,p_2,\dots,p_K)$, such that $p_k \geq 0$ and $\sum_k p_k = 1$. We are interested in the quantity

\begin{equation} H(\mathbb{p}) = -\sum_{k=1}^K p_k \log p_k,\label{eqn:Hdiscrete} \end{equation}

where $0 \log 0 = 0$ by convention.

Because the probability vector is unknown to us we cannot directly use $(\ref{eqn:Hdiscrete})$. Instead we assume that we observe $n$ samples $X_i$, $i=1,\dots,n$, from the categorical distribution in order to estimate $H(\mathbb{p})$.

Naive Plugin Estimator of the Discrete Entropy

The naive plugin estimator uses the frequency estimates of the categorical probabilities in the expression for the true entropy, that is,

\begin{equation} \hat{H}_N = - \sum_{k=1}^K \hat{p}_k \log \hat{p}_k,\label{Hplugin1} \end{equation}

where $\hat{p}_k = h_k / n$ are the maximum likelihood estimates of each probability $p_k$, and $h_k = \sum_{i=1}^n 1_{\{X_i = k\}}$ is simply the histogram over outcomes. The form $(\ref{Hplugin1})$ is equivalent to the simpler form

$$\hat{H}_N = \log n - \frac{1}{n} \sum_{k=1}^K h_k \log h_k.$$

Problems of the Naive Plugin Estimator

It has long been known, due to (Basharin, 1959) and (Harris, 1975) that the estimator $(\ref{Hplugin1})$ underestimates the true entropy $(\ref{eqn:Hdiscrete})$. In fact, we have for any distribution specified by $\mathbb{p}$ that

$$H(\mathbb{p}) - \mathbb{E}[\hat{H}_N] = \frac{K-1}{2n} - \frac{1}{12 n^2} \left(1-\sum_k^{K} \frac{1}{p_k}\right) + O(n^{-3}) \geq 0,$$

so that most often the true entropy is at least as large as what $\hat{H}_N$ claims it is. Why is this the case? There is a simple explanation illustrated by the following figure and description.

Let us only consider a single bin $k$ with true probability $p_k$. If we would know $p_k$ exactly, the contribution this bin makes to the true entropy of the distribution is $-p_k \log p_k$. We do not know $p_k$ and instead estimate it using its frequency estimate $\hat{p}_k = h_k / n$. The marginal distribution of $\hat{p}_k$ is a Binomial distribution.

I have shown an empirical histogram of 50,000 samples from a $\textrm{Binomial}(1000,p_k)$ distribution in red, where $p_k=0.27$ in this case. As you can see, there is significant sampling variance about the true $p_k$, despite having seen 1,000 samples. It is however exactly centered at $p_k$ because $\hat{p}_k$ is an unbiased estimate of $p_k$, that is we have $\mathbb{E} \hat{p}_k = p_k$. It also is approximately normally distributed, as can be clearly seen in the Gaussian shape of the red histogram.

When we now evaluate the function $f(x) = -x \log x$ we evaluate it at the slightly wrong place $\hat{p}_k$ instead of the true place $p_k$. Because $f$ is concave in this case, the famous Jensen's inequality tells us that

$$H = \sum_k f(p_k) = \sum_k f(\mathbb{E} \hat{p}_k) \geq \sum_k \mathbb{E} f(\hat{p}_k) = \mathbb{E} \sum_k f(\hat{p}_k) = \mathbb{E} H_N,$$

so that for each $p_k$ the contribution to the entropy is underestimated on average. (This does not imply that each particular finite sample estimate is below the true entropy however.)

In the next part we will take a look at some improved estimators of the discrete entropy.

Acknowledgements. I thank Il Memming Park and Jonas Peters for reading a draft version of the article and providing feedback.

Advanced Structured Prediction

2015-01-29T22:30:00+00:00

In December 2014, just in time for NIPS, MIT Press released an edited volume on structured prediction models and their applications in natural language processing, computer vision, and computational biology.

Advanced Structured Prediction, Editors Sebastian Nowozin, Peter V. Gehler, Jeremy Jancsary, Christoph H. Lampert, (MIT Press, Amazon)

The volume offers an overview of the recent research on structured prediction in order to make the work accessible to a broader research community. The chapters, by leading researchers in the field, cover a range of topics, including research trends, the linear programming relaxation approach, innovations in probabilistic modeling, recent theoretical progress, and resource-aware learning.

Contributors

Jonas Behr, Yutian Chen, Fernando De La Torre, Justin Domke, Peter V. Gehler, Andrew E. Gelfand, Sébastien Giguère, Amir Globerson, Fred A. Hamprecht, Minh Hoai, Tommi Jaakkola, Jeremy Jancsary, Joseph Keshet, Marius Kloft, Vladimir Kolmogorov, Christoph H. Lampert, François Laviolette, Xinghua Lou, Mario Marchand, André F. T. Martins, Ofer Meshi, Sebastian Nowozin, George Papandreou, Daniel Prusa, Gunnar Rätsch, Amélie Rolland, Bogdan Savchynskyy, Stefan Schmidt, Thomas Schoenemann, Gabriele Schweikert, Ben Taskar, Sinisa Todorovic, Max Welling, David Weiss, Thomas Werner, Alan Yuille, Stanislav Zivny.

Streaming Mean and Variance Computation

2015-01-25T21:30:00+00:00

Given a sequence of observed data we would often like to estimate simple quantities like the mean and variance.

Sometimes the data is available in a streaming setting, that is, we are given one sample at a time. For example, this is the case when

the number of samples is apriori unknown,
we have to perform some stopping test after each sample,
the number of samples is very large and we cannot store all samples.

More formally, given weighted observations $X_1$, $X_2$, $\dots$, with $X_i \in \mathbb{R}$, and $w_1$, $w_2$, $\dots$, with $w_i \geq 0$ we would like to calculate simple statistics like the weighted mean or weighted variance of the sample without having to store all samples, and by processing them one-by-one.

In this situation we can compute the mean and variance of a sample (and, more generally, any higher-order moments) using a streaming algorithm. Many possibilities exist but because of the incremental computation particular attention needs to be paid to numerical stability. If we were to ignore numerical accuracy we could use a simple derivation to show that the following updates for $i=1,2,\dots$ are correct, when initializing $S^{(0)} = T^{(0)} = U^{(0)} = 0$:

$$S^{(i+1)} = S^{(i)} + w_i$$

$$T^{(i+1)} = T^{(i)} + w_i X_i$$

$$U^{(i+1)} = U^{(i)} + w_i X_i^2$$

Then $\hat{\mu} = T^{(n)} / S^{(n)}$ is the weighted sample mean, and $\hat{\mathbb{V}} = \frac{n}{(n-1) S^{(n)}} (U^{(n)} - S^{(n)} \hat{\mu}^2)$ is an unbiased estimate of the weighted variance.

The problem with this naive derivation arises when $n$ is very large. Then the in all three updates the summation may sum quantities of very different magnitude, leading to large round-off errors. By the way, this can even arise when one is computing the simple sum of many numbers, and a classic solution in that case is Kahan summation.

A clever solution to this problem for streaming mean and variance computation was proposed by West in 1979. In his algorithm the summed quantities are controlled to be on average of comparable size. (It is not the only alternative, for a detailed numerical study of possible options, see the paper linked below.)

The West algorithm supports mean and variance computation for positively weighted samples $(w_i, X_i)$ with $w_i \geq 0$, $X_i \in \mathbb{R}$ and the original paper is

D.H.D. West, "Updating Mean and Variance Estimates: An Improved Method" (publisher link), Comm. of the ACM, Vol. 22, Issue 9, 532--535, 1979.

It outputs

The weighted unbiased mean estimate, $\hat{\mu} = (\sum_i w_i X_i) / (\sum_i w_i)$,
The weighted unbiased variance estimate, $\hat{\mathbb{V}} = \left(\sum_i w_i (X_i - \mu)^2\right) / (\frac{n-1}{n} \sum_i w_i)$.

Here is an implementation for the Julia programming language.

type MeanVarianceAccumulator
    sumw::Float64
    wmean::Float64
    t::Float64
    n::Int

    function MeanVarianceAccumulator()
        new(0.0, 0.0, 0.0, 0)
    end
end
function observe!(mvar::MeanVarianceAccumulator, value, weight)
    @assert weight >= 0.0
    q = value - mvar.wmean
    temp_sumw = mvar.sumw + weight
    r = q*weight / temp_sumw

    mvar.wmean += r
    mvar.t += q*r*mvar.sumw
    mvar.sumw = temp_sumw
    mvar.n += 1

    nothing
end
count(mvar::MeanVarianceAccumulator) = mvar.n
mean(mvar::MeanVarianceAccumulator) = mvar.wmean
var(mvar::MeanVarianceAccumulator) = (mvar.t*mvar.n)/(mvar.sumw*(mvar.n-1))
std(mvar::MeanVarianceAccumulator) = sqrt(var(mvar))

You would call it as follows (tested with Julia version 0.3.5):

X = [5.0, -1.5, 3.33]
w = [0.5, 1.0, 0.1]

n = length(X)
mu_exact = sum(w.*X) / sum(w)
V_exact = sum(w .* ((X .- mu_exact).^2)) / (((n-1)/n) * sum(w))

mvar = MeanVarianceAccumulator()
for i=1:n
    observe!(mvar, X[i], w[i])
end
mean(mvar), mu_exact, var(mvar), V_exact

This gives the correct output (running mean, mean, running variance, variance):

(0.8331250000000003,0.8331249999999999,13.826563476562498,13.8265634765625)

Alternative algorithms and variants for higher-order moments can be found on the excellent Wikipedia page on the topic.

Addendum: (October 2015) A recent paper by (Meng, 2015) contains a variant of the above algorithm for the unweighted case to compute the first four central moments in a numerically stable manner. Meng provides a simple implementation requiring only 24 floating point operations per observation.

Acknowledgements. I thank Amit Adam for reading a draft and providing comments that improved clarity.

The Beginning

2015-01-25T21:00:00+00:00

This is the start of my blog. This will be a quite technical blog and therefore address a more specialized audience.

The articles will cover topics in the area of machine learning, statistics, maybe some computer vision, let's see. I plan to publish one article every two weeks, but let us see how that goes.

Sample size \(n\)	Significant digits to report
\(2 \leq n \leq 6\)	1
\(7 \leq n \leq 100\)	2
\(101 \leq n \leq 10,000\)	3
\(10,001 \leq n \leq 10^6\)	4
\(n > 10^6\)	5

Sebastian Nowozins slow blog

Thoughts on Trace Estimation in Deep Learning

Skilling-Hutchinson 1989 trace estimator

Praise for Hutchinson's estimator

Problems of Hutchinson's estimator

Variance Reduction Approaches

Control-variate Methods

Low-rank Approximation Methods (Hutch++)

Preconditioning

Randomized Quasi Monte-Carlo (RQMC)

Comparison

Bayesian Estimation

Benefits and Pitfalls of Bayesian Estimation

Bayesian Trace Estimation

Adaptive Experimental Design

Difficulties of the Bayesian approach

Conclusion and Future Directions

Appendix

Conditioning a multivariate Normal on a subspace

Longevity and Supplements

Healthy Ageing

Alpha-Ketoglutarate (AKG)

Glucosamine Sulphate

NAD+ boosters: NMN / NR

Resveratrol

Multivitamin Supplementation

Summary

Not included

Debiasing Approximate Inference

MLSS 2018 in Madrid

Do Bayesians Overfit?

Overfitting

Defining Overfitting

The Bayesian Case

A Simple Experiment

Maximum Aposteriori (MAP) and Maximum Likelihood (MLE)

WAIC: Widely Applicable Information Criterion

WAIC with Approximate Posteriors

Conclusion

Stable GAN Models and Creative Machines

NIPS 2016 Generative Adversarial Training workshop talk

\(f\)-GAN Talk Slides

Book Review: Computer Age Statistical Inference

Review

Criticism

Summary

Streaming Log-sum-exp Computation

Standard Batch Solution

Streaming log-sum-exp Computation

Example

Where will Artificial Intelligence come from?

1. Composable Differentiable Architectures (aka Deep Learning)

2. Brain Simulations

3. Algorithmic Information Theory and Universal Intelligence

4. Artifical Life

5. Robotics and Autonomous Systems

6. Game Playing

7. Knowledge Bases

Conclusion

The Best of Unpublished Machine Learning and Statistics Books

"Deep Learning"

"Advanced Data Analysis from an Elementary Point of View"

"Monte Carlo theory, methods and examples"

"A Course in Machine Learning"

"Introduction to Machine Learning"

The Fair Price to Pay a Spy: An Introduction to the Value of Information

The Fair Price to Pay a Spy

The Fair Price to Pay an Expert

Recipe for Value of Information Computation

Application of the Recipe to our Example

Computation

Summary

Limitations

Further Reading

ICCV 2015, Day 4

ICCV 2017 and 2019

Parties

Interesting Papers

Polarized 3D: High-Quality Depth Sensing with Polarization Cues

ICCV 2015, Day 3