History of Monte Carlo Methods - Part 3

Sebastian Nowozin - Fri 13 November 2015 -

This is the third part of a three part post. The first part covered the early history of Monte Carlo and the rejection sampling method, the second part covered sequential Monte Carlo.

Part 3

In this part we are going to look at Markov chain Monte Carlo.

The video file is also available for offline viewing as MP4/H.264 (114MB), WebM/VP8 (110MB), and WebM/VP9 (72MB) file.

(Click on the slide to advance, or use the previous/next buttons.)

(Also note there are three additional video visualization below.)

Transcript

(This is a slightly edited and link-annotated transcript of the audio in the above video. As this is spoken text, it is not as polished as my writing.)

Speaker: So this was one of family of Monte Carlo methods. I have too few time remaining but a little bit of time to talk about a completely different family of Monte Carlo methods and you may have heard this abbreviation before. It is called MCMC.

MCMC stands for Markov chain Monte Carlo and it is completely different from importance sampling. The basic difference is instead of growing a configuration or weighting configurations, I always have a certain state and I manipulate that state iteratively, and if I do this long enough then I have obtained a sample that is uniformly distributed. I will get into the details in a minute. I first want to talk briefly about the history.

It was invented by Marshall Rosenbluth but it is called the Metropolis algorithm. Why is that? Well, there were five authors on the paper. The first of which was Nicholas Metropolis. And I roughly sized the pictures according to contributions to the paper.

There are two different historical accounts of how the method came about and they agree on that Edward Teller posed the mathematical problem, and Marshall Rosenbluth solved it, and Arianna Rosenbluth implemented it, and the other two authors did not do anything. They ordered the author names alphabetically and Nicholas Metropolis happened to be head of the group at that time. In any case, two interesting things about it, all of these authors did not use the method in their following research. Also the method is now very, very popular. Actually, Marshall Rosenbluth afterwards founded the field of plasma physics, so he completely went into a different direction. At the turn of the 21st century, Jack Dongarra and Francis Sullivan, two researchers in scientific computing, were asking to compile a list of the top 10 algorithms of the 20th century and this was one of them in their list. And quicksort was another one. So it is really an important algorithm. First, the intuition, then a little bit formalization and how it applies to our problem.

So, here is the intuition. What the method does: it constructs a directed graph where each possible state is a node, and there are simple modifications you can make indicated by these directed arcs to transform one state into another state. For example, in our chain case, we could bend the chain at a certain node or we could maybe change the last state a little bit or something. Some simple transformations you can perform to transform one state into another state. Only a few for each state, only a few arcs that leave each state. And then it performs a random walk on this graph in such a way that if you perform the random walk long enough and then stop it, you are uniformly distributed across the whole graph, or according to some target probability distribution.

The graph is much too large to explicit construct, right? It has exponentially many configurations on our case, but it still it is able to perform a random walk on this graph. It is called Markov chain Monte Carlo because the basic concept is a Markov chain and I want to quickly introduce that concept to you. So here is a simple graph with just three states, A, B, and C. And imagine that you are standing at State B and you follow the very simple rule of sequentially walking along that graph. You see all the edges that leave your current state. They have numbers associated to them and these numbers sum up to one. So if you stand on State B, you have a 40% probability to move to State A, a 50% probability to move to State C, and a 10% probability to stay in State B. So you just follow that rule in each time step and you arrive at a new state.

(This is the video I showed, which is not visible in the slides above. The video file is also available as MP4/H.264 file (1MB).)

And when you do that and you record how often you reach a certain state, then this is a histogram you would obtain. So it seems to converge to some values after just a few hundred steps. And in fact, the limit distribution that you obtain ultimately is given by these numbers here. And it does not depend where you start, but it is completely independent of where you start.

The Metropolis algorithm solves the inverse problem. The inverse problem is the following. So here we had the rules how to walk and we just recalled that the limit distribution. What if you have a graph structure and you have some target probabilities that you want to realize? How should you put numbers at the edges to reach that limiting distribution? That was the problem that was posed by Edward Teller in essence.

Here we have that target distribution and the Metropolis algorithm is a constructive way to choose its transition probabilities. It assumes that you have a base Markov chain, so some basic random walk on the graph. So let us say, just uniformly go over all outgoing edges on that graph. And we could follow that random walk, right? It would not be the same limit distribution that we are interested in but we could follow that random walk and we would get a different limit distribution. And what it now does is whenever we transition from one state to the other, in addition, modulates that decision and has the option to reject that step. It can only accept or reject steps proposed by the base Markov chain. The final Markov chain of the Metropolis algorithm is this chain \(T\) which is the base Markov chain multiplied with its acceptance rate. And the acceptance rate is calculated according to that formula which has a quite simple interpretation.

The acceptance probability is high when the target probability, \(\pi_j\), is high but your current probability is low, when it's more likely to be in a target state or vice versa, if it is unlikely, you are more likely to stay at the current configuration. Or if the base chain pushes you to some other state, you divide by that probability to compensate for that bias of the base chain. So that is sort of the numerator and denominator have the two effects. And the remaining probability, everything that is sort of modulated down, that is the reject probability to stay in the current state.

(This is the video I showed, which is not visible in the slides above. The video file is also available as MP4/H.264 file (1MB).)

So let us do that simple calculation for our example here. For that limit distribution, we would obtain that values. And let us take a look at whether we converge to that limiting distribution. So the limiting distribution was 0.7, 0.1, 0.2, it converges slower than in the example before, but ultimately, we are guaranteed to converge to the limit distribution that we setup. It is not a unique way to construct such a Markov chain but it is a constructive way to do so.

In our (self-avoiding random walk) example, we want to walk on this graph. And what we are going to do is we just allow simple transformations, we pick a random element on that chain and bend it 90 degrees in a random fashion. So that gives us the arcs on that graph and you can imagine for a chain of a certain length, there are only so many ways which you can enumerate to bending the elements by 90 degrees. And if we happen to bend it in such a way that it is actually no longer self-avoiding, we can remove that and add that back to the probability of staying in a certain state. And remember we wanted to sample uniformly overall the states on the graph. So we just plug that into the acceptance rate calculation of the Metropolis algorithm and the \(\pi\), because it is uniformly distributed, just cancels out of that rate. So it is a very simple calculation.

And in practice, it is even simpler to implement that. We have a certain state. We just initialize it with any state we want. We propose a random modification and we accept or reject that, and then we have a new state, and we iterate, we accept or reject that. So we keep doing that and after a certain amount of time, we just keep the number of samples that we have generated and we can compute our expectations with that sample that we have generated. They are no longer independent samples because we have always modified them a little bit but they are a set of samples that we have generated. And this is the estimate, again, same curve that we had before, this time, with the MCMC approach. And I am almost out time but I want to take the last three slides here to show you our last method, the final method of the talk. Any questions on MCMC so far?

Attendee: If we said we accept or reject. We make a bend and we accept them. How do you decide whether to accept or reject, just whether it crosses itself.

Speaker: If it crosses itself, it is no longer a valid state, and we immediately reject. If it does not cross itself, you compute the acceptance probability by that formula and the acceptance probability maybe 0.8, and then you roll a random number between zero and one, uniformly distributed, and if that number is below the acceptance rate, then you accept. If it is above the acceptance rate that you have calculated, you reject. So if the acceptance rate is one, you always accept. If the acceptance rate is 0.5, you flip a coin uniformly to accept or reject.

Attendee: How would I calculate the acceptance rate in the self-avoiding random walk problem?

Speaker: That depends on this graph structure here. So every state in the graph has a certain set of allowed changes, right? For some state which is very compact in number of allowed changes becomes smaller. For some longer chain, you can basically bend it at any element, in any direction and it would still be a self-avoiding work.

Attendee: We have to check them all to count how many were self-intersecting?

Speaker: Yes, in some case you can remove it, I mean, not in this case. In some cases, you always have a probability mass everywhere and then not like a hard constraint like self-avoiding, then it cancels out as well. But in this case, we have to enumerate all of them.

(A warning in this edit: I cannot recommend learning about the simulated annealing by browsing the web as there is lots of misinformation around or special cases are described as simulated annealing, or the base Markov chain is not a reversible Markov chain, etc. See the references at the end of this transcript for good links if you want to learn about the method.)

The final method is called Simulated Annealing. It is a method to convert a Markov chain into an optimization method. As simple as that, a Markov chain into an optimization method. It was proposed in 1983 by Scott Kirkpatrick and co-workers and it is a very simple and often quite effective optimization method. And simple to implement. It can optimize over complex state spaces. And for that reason, it is very popular. So this Science paper that they published in 1983 has 28,000 citations (now 35,000, October 2015). And interestingly, Scott Kirkpatrick later in the '90s worked at the IBM T.J. Watson Research Center and there, at least he writes this, he invented the first pen-based tablet computer. So it is nice to see that he is innovative on quite different levels.

So how does it work? Say we have a function that we want to optimize. A very simple function here. There are only 40 possible inputs to that function. So in that case, we could simply enumerate all the 40 possible states and pick the one that is maximal, so we want to maximize that function. But imagine you have a different problem with exponentially many states so we can no longer list them and this is just for illustration. But imagine instead of 40, you would have 2 to the power of 40 or something. What we are going to do is we convert that function into a probability distribution and we do that by what is called a Gibbs distribution. So just a simple formula, where \(Z\) is a normalizing constant and the formula depends on the parameter \(T\), the so called temperature parameter. If the temperature parameter is very high, you divide that function value by a very large value and the argument almost does not matter. So the function value does not matter. In this case, the temperature is 100 and you see that the resulting distribution is almost uniform. Maybe hard to see but it is almost uniform because the temperature is quite high compared to the function value.

Attendee: What is \(Z\)?

Speaker: \(Z\) is a normalizing constant. So it is just a sum over all the possible configurations. But it is a constant and it just depends on \(T\). And interestingly, if we apply this Metropolis algorithm, the \(Z\) constant cancels out, it is not really important. You do not even need to write it down. You can just write \(P\) is proportional to \(X\), (\(P \propto X\)) or something, the constant is just a normalizing constant. If I decrease the temperature, you see, so this is temperature 10, temperature 4, 1, and now I decrease it even further, 0.1, the distribution puts more and more mass on the function values that are higher; and basically, what the simulated annealing does it runs a Markov chain, a Metropolis chain, for example. But, while it runs it, it modifies it by decreasing the temperature.

So it tries to shift all the probability mass in our current state as well to the states that have high function value. And how it does it, well, it chooses its schedule, a temperature schedule so on the x-axis here I have the steps that I take with the Markov chain and on the y-axis I have the temperature that I use, and I just decrease the temperature here, a geometric schedule so I just modify the temperature on each step with 0.99 or something.

For very high temperatures, the Markov chain is basically just a purely random walk; it does not even look at the function value. For very low temperatures, it basically is a local search algorithm; it only accepts improvements in the function value. But, for intermediate temperature values, it is something in between so it tries to optimize but it can still escape local minima.

So that is intuition. There's some theory to it actually, in another famous paper by Stuart and Don Geman, a 1984 paper which is actually famous for a different reason because they proposed another famous Monte Carlo method they Gibbs sampler in that paper but in that very paper they also have some theory of simulated annealing and they prove that if you decrease the temperature slow enough the probability is one to obtain the optimal state. But that optimal schedule is too slow in practice so you cannot use it and that is why we are still stuck with the geometric schedule.

Last minute, let us do simulated annealing and I go back to the more complicated model where we actually have this two types of elements: the black ones that attract each other and the white ones which are neutral. And you see here I plotted whenever two black elements are close to each other on this 2-D grid; I plotted a red line, and I am going to try to optimize the number of red lines so I try to get as many red connections as possible.

(This is the video I showed, which is not visible in the slides above. The video file is also available as MP4/H.264 file (5MB).)

And that is really a model for protein folding, folding in such a way that there are many black to black noncovalent bonds. So here, is an animation of performing simulated annealing exactly with that proposal that I had, bending it at 90 degrees left or right at a random position and, I do 100,000 steps and I show you every 100th step. At high temperatures, this is quite high temperatures still you see it is very stretched out, there are not very many compact structures but as the optimization proceeds the temperature decreases and it favors more and more compact configuration. So I think already at a step of, like now, you would see already quite some compact structures appearing.

So this is a purely random walk. I think it goes until 1,000. I can skip it in the interest of time but, I can show you the result. This is the configuration that we have obtained with 100,000 steps in our Markov chain in simulated annealing and for that model problem there is a paper that analyzes different model problems and the optimal configuration is known, the ground state is known and the ground state is slightly better. It has 23 connections. We only have 21 but actually, with such a simple method we have obtained a quite good solution and that is really the essence of why simulated annealing is popular.

We can often get quite far with allegedly fewer effort both in implementation and runtime. Although, there may be a better method in a specific domain, that is optimized for that. And let us reflect a little bit on what we did. We have solved a rather complicated problem like folding this 2D protein with a very, very simple method with just a random walk that accepts or rejects simple modifications.

And that is basically it. Before I have my last slide, I just want to say a little bit about the literature if you are interested in the Monte Carlo Methods I can highly recommend the first book and if you are interested in the black and white pictures I showed of the people that are relevant to the invention of the Monte Carlo method I can recommend the last book which is the autobiography of Stanislaw Ulam, it is very interesting.

So, thank you very much for your attention.

References

Here are some of the introductionary book references mentioned in the talk.

The historical context and anecdotes are mostly from the autobiography of Stan Ulam, Adventures of a Mathematician. The book is accessible to anyone with a basic high school math background. See also this kind 1978 review of the book.

A great, now somewhat dated introduction to Monte Carlo methods is Jun Liu's Monte Carlo Strategies in Scientific Computing. I learned Monte Carlo through this book and it has a spot in my bookshelf that is at arm-length from my desk.

The Liu book is somewhat dated and it covers a lot of ground; a slightly more formal but up-to-date Monte Carlo book is Art Owen's upcoming book Monte Carlo theory, methods and examples, which is excellent.

A highly accessible and very well written introduction to Markov chains and simple Monte Carlo methods is Olle Häggström's Finite Markov Chains and Algorithmic Applications. I recommend it if you want a most gentle introduction to the theory behind MCMC. Still the most authoritative reference on MCMC is the Handbook of Markov Chain Monte Carlo. In particular, the first twelve chapters cover are on general methodology and contain a wealth of information not found in other textbooks.