This article is about *multilevel splitting*, a method for estimating the
probability of rare events.

Estimating the probability of *rare events* is important in many fields.
One vivid example is in the study of reliability of systems; imagine for
example, that we are responsible for building a mechanical structure such as a
bridge and we aim to design it to last one hundred years.
To provide any kind of guarantee we need to have a model of what could happen
in these 100 years, for example how the bridge will be used during that time,
what weight it will have to bear, how strong winds and floods may be, how
corrosion and other processes deteriorate the structure, etc.
Considering all these factors may only be possible approximately via a
*simulation* of the structure under different effects.

For concreteness let's say we denote by \(X\) the random variable that represents the maximum force that is applied to the bridge during the 100 years lifetime. Each simulation allows us to obtain a sample \(X_i \sim P\) of this force, where \(P\) is a probabilistic model of everything that can happen during the 100 years. Given that we designed the bridge to widthstand a certain force, the question is now to make statements of the form

Often we want the probability of something bad happening (the event \(X \geq \delta\)) to be exceptionally small, say \(\epsilon = 10^{-9}\).

Another common example is the computation of P-values,
where we observe a sample \(x\) and compute a *test statistic* \(t=T(x)\). Given
a *null model* in the form of a distribution \(P(X)\) we are interested in the
*P-value*, that is, the probability of the event \(P(T(X) \geq t)\). This
number is the probability under the null of observing a test statistic at
least as extreme as the one actually observed.
Using the multilevel splitting idea we can hope to accurately compute the
P-value as long as we can run an MCMC chain on the null model.
Also, more general P-values for composite null models, such as the *posterior
predictive P-value* are computable.
So if this sounds good, how does multilevel splitting work and why is it
needed in the first place?

In the absence of an analytic form for \(P\), a naive simulation approach is to repeatedly draw samples \(X_i \sim P\) and to count how often the bad event happens. For rare events as the one above this does not work very well and if we would exactly meet the guarantee of the above example, \(\epsilon = 10^{-9}\), then we would on average have to draw around \(1/\epsilon = 10^9\) samples just to see a single bad event. But because we would like to estimate the rare event probability we need even more samples.

There are a number of custom methods for accurate estimation of rare event
probabilities. The remainder of the article discusses *multilevel splitting*,
but at this point I would like to mention that another popular set of methods
for rare events is based on adaptive importance sampling which is described in
detail in Rubinstein and Kroese's
book on Monte Carlo
methods.

# Multilevel Splitting

John von Neumann had an idea
better than naive simulation on how to address the problem of estimating rare
event probabilities. He named his solution *multilevel splitting*. The first
published description of multilevel splitting is due to Kahn and Harris in
1951 (who
attribute it to John von Neumann).

The basic idea of multilevel splitting is to steer an iterative simulation
process towards the rare event region by removing samples far away from the
rare event and *splitting* samples closer to the rare event.

The application considered in the 1951 paper is interesting in this regard in that it clearly relates to nuclear weapon research:

"We wish to estimate the probability that a particle is transmitted through a shield, when this probability is of the order of \(10^{-6}\) to \(10^{-10}\), and we wish to do this by sampling about a thousand life histories." ... "In one method of applying this, one defines regions of importance in the space being studied, and, when the sampled particle goes from a less important to a more important region, it is split into two independent particles, each one-half the weight of the original."

Back in 1951 the algorithm was somewhat adhoc but effective. In a recent 2011 paper by Guyader, Hengartner, and Matzner-Lober the authors propose a more practical variant of the same idea and provide theoretical results.

## Setup

The general setup is as follows. We have a distribution \(P\) defining our system. We have \(X \in \mathcal{X}\) for the realizations \(X \sim P\). A continuous map \(\phi: \mathcal{X} \to \mathbb{R}\) defines the quantity of interest. We are interested in computing the probability \(P(\phi(X) \geq q)\). To this end we assume we can approximately simulate from \(P\) using a Markov chain, which is typically possible even in complex models.

The basic idea of the original 1951 algorithm is to fix a set of levels \(-\infty = L_0 < L_1 < L_2 < \dots < L_k = q\). Then we can formally write

The above product can be estimated term-by-term as follows: we use a set of \(N\) particles \((X_1,\dots,X_N)\) and simulate these according to \(X_i \sim P(X)\). Then we estimate the fraction

Afterwards we discard all particles with \(\phi(X_i) < L_1\) and use the remaining
particles to resample a set of \(N\) particles (the *splitting*).
Finally, we update all particles using a number of steps of our MCMC kernel,
but this time restricted to \(\phi(X_i) \geq L_1\), that is, we reject all
proposals that would violate this condition.
This is one level, and for the multilevel scheme we repeat the above procedure
with the next level. Eventually, when we reach the final level \(L_k\), we take
the product of the estimated probabilities as the estimate of the rare event
probability. Upon reaching the final level the surviving particles are
properly distributed conditional on the restriction \(\phi(X) \geq q\).

The above algorithm is effective but has the major drawback of having to fix a
ladder of levels apriori. It would be more practical to instead have an
automatic method to create these levels or to get rid of them entirely. The
algorithm of *Guyader et al.* achieves this automatic selection by keeping the
particles sorted according to \(\phi\), with the lowest particle defining the
current level, at the cost of having a random runtime of the algorithm.

The 2011 paper is quite rich in that it also contains an approximate
confidence interval for the true probability as well as an analysis of the
random runtime and an interesting application of estimating the false positive
rate of watermark detection schemes (which ideally should be very small).
Also, a variant of their method can solve for the *quantile*, that is, given
\(p\) in \(p = P(\phi(X) \geq q)\), solve for \(q\).
(Unfortunately, in the paper, as is often the case with many statistics and
applied math papers, the algorithm (in Section 3.2) is not presented very
clearly compared to a typical CS or ML paper.)

## Example

The following is an implementation in the Julia language that estimates \(P(X \geq 16.5)\) where \(X \sim \mathcal{N}(0,1)\) is a standard Normal random variable.

```
using Distributions
N=2000 # number of particles
T=10 # number of MCMC steps
q=16.5 # quantile
target=Normal(0.0, 1.0)
K=Normal(0.0, 0.2) # Markov kernel
m=1
X=sort(rand(target, N))
L=X[1]
while L < q # as long as there are particles below q
X[1] = X[rand(2:N)]
# Run a Markov chain on the lowermost sample
for t=1:T
y = X[1] + rand(K)
log_alpha = logpdf(target, y) - logpdf(target, X[1])
if log(rand()) <= log_alpha && y > L
X[1] = y
end
end
X = sort(X)
L = X[1]
m += 1
end
phat = (1.0-1.0/N)^(m-1)
# Estimate, Truth
phat, ccdf(target, q)
```

Giving the output

```
(1.4581487078794118e-61,1.8344630031647276e-61)
```

where the first number is the estimate and the second number is the ground truth, known in this case analytically. The relative estimation accuracy in this case is remarkably, given that this event occurs on average only once every \(10^{61}\) samples. For this simulation a total of \(m=280,092\) sample updates have been performed until the algorithm stopped.

## Conclusion

Multilevel splitting is a useful algorithm for estimating the probability of rare events and the recent algorithm of Guyader et al. is practical in that it can be implemented on top of an arbitrary MCMC sampler.

There are caveats, however. In the above example, the problem structure is almost ideal for the application of multilevel splitting: a slowly varying continuous function \(\phi\) whose level sets are topologically connected. This means that the MCMC sampler can mix easily in the restricted subsets and the resulting rare event probabilities can be accurately estimated. If these assumptions are not satisfied the algorithm may fail to work, and current research addresses these more general situations, see, for example this recent paper by Walter.

In summary, although some care is required for the application of multilevel splitting to real problems it is likely to be orders of magnitude more efficient than naive approaches.