History of Monte Carlo Methods - Part 2

Sebastian Nowozin - Fri 30 October 2015 -

This is the second part of a three part post. The last part covered the early history of Monte Carlo and the rejection sampling method.

Part 2

In this part we are going to look at importance sampling and sequential Monte Carlo.

The video files are also available for offline viewing in MP4/H.264, WebM/VP8, and WebM/VP9 formats.

(Click on the slide to advance, or use the previous/next buttons.)

Transcript

(This is a slightly edited and link-annotated transcript of the audio in the above video. As this is spoken text, it is not as polished as my writing.)

Speaker: Okay, so rejection sampling works only for short chain lengths. And then, in light of this finding the next method, sequential importance sampling, was introduced independently by two groups. John Michael Hammersley, who actually did his PhD here in Cambridge and then moved to Oxford to become professor, and by Marshall Rosenbluth and his wife Arianna Rosenbluth. They did it independently. They called it different by different names. I think Hammersley called it inversely restricted sampling and the Rosenbluth called it biased sampling and both these names did not really stick. So now, nowadays it is called sequential importance sampling. In different communities, it is also called the growth method or the Rosenbluth method.

How does this method work? It is based on the idea that was suggested by the audience just before (in the first part). Remember when we are growing these chains step by step and we make the step and we would have to reject the sample? Well, we could just prevent making that step, all right. We could just say, "Look, there are two possibilities but you would not reject that sample. So why not just take one of these?"

Of course the method is not fail-safe. So if we, in such a situation, no matter what we do if we keep growing we will still run into trouble. Right, so the method is still myopic. We still only make one inference at a time. But the real problem with this method is that we no longer sample uniformly from the set that interest us. In fact, we favor more compact configurations and what Hammersley, and Morton, and the Rosenbluth's realized is a method to systematically compensate for that bias.

So let me first talk about how this in general is done and then how it is done in our specific example. So in general, we have this expectation expression that we want to approximate. And what they said is, well, we assign one weight to each sample and if the weight would be one that would be the original expectation. But we assign a weight to the sample and we choose the weight in such a way that we compensate for that bias, that we favor some configurations. So whenever we favor these configurations, we down-weight them, and whenever some configurations is rare but we generate them, we up-weight them.

In practice, we would generate a few samples and if every sample would have a weight, in this particular instance how it works is, well we just count the number of possibilities we have in each step. So we basically decompose a sampling distribution. So in the first step we have four possibilities, four points free adjacent to it and the next one we have only three available. And so we just unroll basically our decisions. Here we only have two available, two choices available, all right? So because we grow the chain sequentially, the probability of generating that particular configuration also decomposes sequentially. So the final chain would have this probability of being generated. And what they simply do is say, "Okay, this is the distribution by which we generate the sample, but we want to generate it at uniformly at random so we also decompose the weight and just build the weight as a inverse". So when we weight these samples by weights we will systematically de-bias the sampling distribution, so we will get unbiased estimates of the expectation that we are interested in.

Let us take a look at where we were with the rejection sampler. So this is a limit, what we could do with a rejection sampler. Now, with the growth method, we can go to significantly longer chain length, to a chain length of 60 and then again, the uncertainty estimates the confidence intervals blow up. So why is this? I mean why do we say uncertainty intervals blow up in this improved method? Well the thing is, the weights that we compute, they become very unbalanced. And even though we generate maybe a few thousand samples, only few of them will have significant share of the weights.

So here is a visualization of that. So here, I grow 50 chains in parallel, one step at a time, and I show you in each step the normalized weights. So in the beginning, everything is uniform because in the beginning, everything has equal number of possibilities. But over time, as I grow more and more, as append more elements, the weights become very unbalanced so that after 100 steps, actually five elements have weights significantly different from zero. And this means we actually do not have 50 samples on this case, we only have five, and our estimates become very poor. And this only amplifies when you have a few thousands ones. One way to measure the quality of the samples we have generated is to ask, "Okay, I have generated, say 5,000 samples with weights. How much are these worth in terms of computing expectations, in terms of unweighted samples?"

Because unweighted samples are optimal there is a quality measure that you can compute that is an estimated quantity and it is called effective sample size, which exactly measures the worthiness of a weighted sample set. And for this plot I have shown you now with 5,000 samples, you see that it drops and drops and drops until it is almost close to one. So that is a real problem.

You guessed the next step would be another improved Monte Carlo method and indeed it improves on that, and it is generally known as Sequential Monte Carlo Method. The idea is quite simple and natural and it has been reinvented many times in different communities. So the original paper is from 1959 but has been reinvented in the signal processing community as particle filter, it has been reused and pioneered in computer vision by our Andrew Blake for tracking objects and their contours. And it is used across many different communities often under very different names. But generally, Sequential Monte Carlo is the preferred name. And the basic idea here is quite simple. The problem is unbalanced weights. We have to prevent getting unbalanced weights in each step. So how we are going to do this is by introducing a process which removes samples that have low weight and duplicate samples that have high weight. That is called re-sampling.

So say, in one certain timestamp, we grow all the chains in parallel, we grow say 50 chains in parallel. On this example, it is only six chains in parallel. And we have observed that the weights are unbalanced. Then remove some of the low weight instances and we duplicate some of the high weight instances as shown here. The algorithm that corresponds to that is the same as before just weighted sequential importance sampling but we grow all the samples in parallel and monitor the weights. And if the weights are in trouble, if the weights become unbalanced, we enforce balanced weights again, by removing low weight samples and duplicating other.

Attendee: Question.

Speaker: Yes.

Attendee: In your little white chain example, the weights are going to be high when at every step you can take three choices, right? Like three times three times three. And that is going to get bigger. The long ones do not go near each other. So this is going to bias in favor of things that just sort of go off into the distance, all of them curly wurly things, is that right?

Speaker: Right, exactly, because the sampling distribution biases towards compact configurations the weights have to undo that bias by favoring the configurations that walk off and are no longer compact.

Attendee: But isn't that bad because then essentially we'll you dominated by all the ones that go off into the distance and we won't get any curly wurly ones.

Speaker: It would be bad but remember that the sampling distribution that I had above the weight, the sampling distribution exactly has that opposing bias. So we generate from that distribution and we want the weights to compensate for that bias. So the extra samples that we get are more compact than they should be. And that is why we have to downweight these to get a low weight, right? And we have to upweight these samples that are not compact but that go out into basically long chains.

Attendee: I did not get that, sorry. But never mind. I believe you.

Speaker: Okay. So if you do that re-sampling operation to compensate for the unbalanced weight effect, and I plot again the effective sample size and whenever we see that the effective sample size, in this case drops below 2,500, I perform this re-sampling operation and reset the weights to uniform and I enforce this effective sample size to become 5,000 again. You see that basically, I can control how unbalanced the weights become. Here is another visualization in terms of the same plot that I had before and now with the red arrows are indicated whenever I reset the weights to uniform, I perform his re-sampling operation so I always can ensure that my weights are close to the uniform distribution.

So let us compare that again. This was a plot without re-sampling. You will see that the uncertainty estimates indicate that our estimates are very unreliable. And this is basically with re-sampling. The whole family of Sequential Monte Carlo approaches are really state of the art methods. This scales to almost no limits, so people have used generate chains with over a million bonds. It is state of the art for any kind of probabilistic model where you can sequentially decompose the model. For example, time series models, hidden Markov models, state space models, dynamic Bayesian networks, all these kind of models, these methods are applicable and highly efficient.