Multilevel Splitting

Sebastian Nowozin - Fri 10 July 2015 -

This article is about multilevel splitting, a method for estimating the probability of rare events.

Estimating the probability of rare events is important in many fields. One vivid example is in the study of reliability of systems; imagine for example, that we are responsible for building a mechanical structure such as a bridge and we aim to design it to last one hundred years. To provide any kind of guarantee we need to have a model of what could happen in these 100 years, for example how the bridge will be used during that time, what weight it will have to bear, how strong winds and floods may be, how corrosion and other processes deteriorate the structure, etc. Considering all these factors may only be possible approximately via a simulation of the structure under different effects.

For concreteness let's say we denote by \(X\) the random variable that represents the maximum force that is applied to the bridge during the 100 years lifetime. Each simulation allows us to obtain a sample \(X_i \sim P\) of this force, where \(P\) is a probabilistic model of everything that can happen during the 100 years. Given that we designed the bridge to widthstand a certain force, the question is now to make statements of the form

$$P(X \geq \delta) \leq \epsilon.$$

Often we want the probability of something bad happening (the event \(X \geq \delta\)) to be exceptionally small, say \(\epsilon = 10^{-9}\).

Another common example is the computation of P-values, where we observe a sample \(x\) and compute a test statistic \(t=T(x)\). Given a null model in the form of a distribution \(P(X)\) we are interested in the P-value, that is, the probability of the event \(P(T(X) \geq t)\). This number is the probability under the null of observing a test statistic at least as extreme as the one actually observed. Using the multilevel splitting idea we can hope to accurately compute the P-value as long as we can run an MCMC chain on the null model. Also, more general P-values for composite null models, such as the posterior predictive P-value are computable. So if this sounds good, how does multilevel splitting work and why is it needed in the first place?

In the absence of an analytic form for \(P\), a naive simulation approach is to repeatedly draw samples \(X_i \sim P\) and to count how often the bad event happens. For rare events as the one above this does not work very well and if we would exactly meet the guarantee of the above example, \(\epsilon = 10^{-9}\), then we would on average have to draw around \(1/\epsilon = 10^9\) samples just to see a single bad event. But because we would like to estimate the rare event probability we need even more samples.

There are a number of custom methods for accurate estimation of rare event probabilities. The remainder of the article discusses multilevel splitting, but at this point I would like to mention that another popular set of methods for rare events is based on adaptive importance sampling which is described in detail in Rubinstein and Kroese's book on Monte Carlo methods.

Multilevel Splitting

John von Neumann had an idea better than naive simulation on how to address the problem of estimating rare event probabilities. He named his solution multilevel splitting. The first published description of multilevel splitting is due to Kahn and Harris in 1951 (who attribute it to John von Neumann).

The basic idea of multilevel splitting is to steer an iterative simulation process towards the rare event region by removing samples far away from the rare event and splitting samples closer to the rare event.

The application considered in the 1951 paper is interesting in this regard in that it clearly relates to nuclear weapon research:

"We wish to estimate the probability that a particle is transmitted through a shield, when this probability is of the order of \(10^{-6}\) to \(10^{-10}\), and we wish to do this by sampling about a thousand life histories." ... "In one method of applying this, one defines regions of importance in the space being studied, and, when the sampled particle goes from a less important to a more important region, it is split into two independent particles, each one-half the weight of the original."

Back in 1951 the algorithm was somewhat adhoc but effective. In a recent 2011 paper by Guyader, Hengartner, and Matzner-Lober the authors propose a more practical variant of the same idea and provide theoretical results.

Setup

The general setup is as follows. We have a distribution \(P\) defining our system. We have \(X \in \mathcal{X}\) for the realizations \(X \sim P\). A continuous map \(\phi: \mathcal{X} \to \mathbb{R}\) defines the quantity of interest. We are interested in computing the probability \(P(\phi(X) \geq q)\). To this end we assume we can approximately simulate from \(P\) using a Markov chain, which is typically possible even in complex models.

The basic idea of the original 1951 algorithm is to fix a set of levels \(-\infty = L_0 < L_1 < L_2 < \dots < L_k = q\). Then we can formally write

$$P(\phi(X) \geq q) = \prod_{i=1}^k P(\phi(X) \geq L_i \:|\: \phi(X) \geq L_{i-1}).$$

The above product can be estimated term-by-term as follows: we use a set of \(N\) particles \((X_1,\dots,X_N)\) and simulate these according to \(X_i \sim P(X)\). Then we estimate the fraction

$$P(\phi(X) \geq L_1 \:|\: \phi(X) \geq L_0) = P(\phi(X) \geq L_1) \approx \frac{\sum_{i=1}^N 1_{\{\phi(X_i) \geq L_1\}}}{N}.$$

Afterwards we discard all particles with \(\phi(X_i) < L_1\) and use the remaining particles to resample a set of \(N\) particles (the splitting). Finally, we update all particles using a number of steps of our MCMC kernel, but this time restricted to \(\phi(X_i) \geq L_1\), that is, we reject all proposals that would violate this condition. This is one level, and for the multilevel scheme we repeat the above procedure with the next level. Eventually, when we reach the final level \(L_k\), we take the product of the estimated probabilities as the estimate of the rare event probability. Upon reaching the final level the surviving particles are properly distributed conditional on the restriction \(\phi(X) \geq q\).

The above algorithm is effective but has the major drawback of having to fix a ladder of levels apriori. It would be more practical to instead have an automatic method to create these levels or to get rid of them entirely. The algorithm of Guyader et al. achieves this automatic selection by keeping the particles sorted according to \(\phi\), with the lowest particle defining the current level, at the cost of having a random runtime of the algorithm.

The 2011 paper is quite rich in that it also contains an approximate confidence interval for the true probability as well as an analysis of the random runtime and an interesting application of estimating the false positive rate of watermark detection schemes (which ideally should be very small). Also, a variant of their method can solve for the quantile, that is, given \(p\) in \(p = P(\phi(X) \geq q)\), solve for \(q\). (Unfortunately, in the paper, as is often the case with many statistics and applied math papers, the algorithm (in Section 3.2) is not presented very clearly compared to a typical CS or ML paper.)

Example

The following is an implementation in the Julia language that estimates \(P(X \geq 16.5)\) where \(X \sim \mathcal{N}(0,1)\) is a standard Normal random variable.

using Distributions

N=2000  # number of particles
T=10   # number of MCMC steps
q=16.5 # quantile
target=Normal(0.0, 1.0)
K=Normal(0.0, 0.2)  # Markov kernel

m=1
X=sort(rand(target, N))
L=X[1]

while L < q   # as long as there are particles below q
    X[1] = X[rand(2:N)]

    # Run a Markov chain on the lowermost sample
    for t=1:T
        y = X[1] + rand(K)
        log_alpha = logpdf(target, y) - logpdf(target, X[1])
        if log(rand()) <= log_alpha && y > L
            X[1] = y
        end
    end
    X = sort(X)
    L = X[1]
    m += 1
end
phat = (1.0-1.0/N)^(m-1)
# Estimate, Truth
phat, ccdf(target, q)

Giving the output

(1.4581487078794118e-61,1.8344630031647276e-61)

where the first number is the estimate and the second number is the ground truth, known in this case analytically. The relative estimation accuracy in this case is remarkably, given that this event occurs on average only once every \(10^{61}\) samples. For this simulation a total of \(m=280,092\) sample updates have been performed until the algorithm stopped.

Conclusion

Multilevel splitting is a useful algorithm for estimating the probability of rare events and the recent algorithm of Guyader et al. is practical in that it can be implemented on top of an arbitrary MCMC sampler.

There are caveats, however. In the above example, the problem structure is almost ideal for the application of multilevel splitting: a slowly varying continuous function \(\phi\) whose level sets are topologically connected. This means that the MCMC sampler can mix easily in the restricted subsets and the resulting rare event probabilities can be accurately estimated. If these assumptions are not satisfied the algorithm may fail to work, and current research addresses these more general situations, see, for example this recent paper by Walter.

In summary, although some care is required for the application of multilevel splitting to real problems it is likely to be orders of magnitude more efficient than naive approaches.