Becoming a Bayesian, Part 2 - Sebastian Nowozins slow blog

This post continues the previous post, part 1, outlining my criticism towards a ''naive'' subjective Bayesian viewpoint:

The consequences of model misspecification.
The ''model first computation last'' approach, in this post.
Denial of methods of classical statistics.

The ''Model First Computation Last'' approach

Without a model (not necessarily probabilistic) we cannot learn anything. This is true for science, but it is also true for any machine learning system. The model may be very general and make only a few general assumptions (e.g. ''the physical laws remain constant over time and space''), or it may be highly specific (e.g. ''\(X \sim \mathcal{N}(\mu,1)\)''), but we need a model in order to relate observations to quantities of interest.

But in contrast to science, when we build machine learning systems we are also engineers. We build models not in isolation or on a piece of whiteboard, but instead we build them to run on our current technology.

Many Bayesians adhere to a strict separation of model and inference procedure; that is, the model is independent of any inference procedure. They argue convincingly that the goal of inference is to approximate the posterior under the assumed model, and that for each model there exist a large variety of possible approximate inference methods that can be applied, such as Markov chain Monte Carlo (MCMC), importance sampling, mean field, belief propagation, etc. By selecting a suitable inference procedure, different accuracy and runtime trade-offs can be realized. In this viewpoint, the model comes first and computation comes last, once the model is in place.

In practice this beautiful story does not play out very often. What is more common is that instead of spending time building and refining a model, time is spent on tuning the parameters of inference procedures, such as:

MCMC: Markov kernel, diagnostics, burn-in, possible extensions (annealing, parallel tempering ladder, HMC parameters, etc.);
Importance sampling: selecting the proposal distribution, effective sample size, possible extensions (e.g. multiple importance sampling);
Mean field and belief propagation: message initialization, schedule, damping factor, convergence criterion.

In fact, it seems to me, that many works describing novel models ultimately also describe inference procedures that are required to make their models work. I say this not to diminish the tremendeous progress we as a community have made in probabilistic inference; it is just an observation that the separation of model and inference is not plug-and-play in practice. (Other pragmatic reasons for deviating from the subjective Bayesian viewpoint are provided in a paper by Goldstein.)

Suppose we have a probabilistic model and we are provided an approximate inference procedure for it. Let us draw a big box around these two components and call this the effective model, that is, the system that takes observations and produces some probabilistic output. How similar is this effective model to the model on our whiteboard? I know of only very few results, for example Ruozzi's analysis of the Bethe approximation.

Another practial example along these lines was given to me by Andrew Wilson is to compare an analytically tractable model such as a Gaussian process against a richer but intractable model such as a Gaussian process with Student-T noise. The latter model is certainly more capable formally but requires approximate inference. In this case, the approximate inference implicitly changes the model and it is not clear at all whether it is it worth to give up analytic tractability.

Resource-Constrained Reasoning

It seems that when compared to machine learning, the field of artificial intelligence is somewhat ahead; in 1987 Eric Horvitz had a nice paper at UAI on reasoning and decision making under limited resources. When read liberally the problem of adhering to the orthodox (normative) view he described in 1987 seems to mirror the current issues faced by large scale probabilistic models used in machine learning, namely that exact analysis in any but the simplest models is intractable and resource constraints are not made explicit in the model or the inference procedures.

But some recent work is giving me new hopes that we will treat computation as a first class citizen when building our models, here is some of that work from the computer vision and natural language processing community:

Adrian Barbu's active random fields from 2009, where he explicitly considers the effects of using suboptimal inference procedure in graphical models.
Stoyanov, Ropson, and Eisner's work on predicting with approximate inference procedures at AISTATS 2011; although this is an empirical risk minimization approach.
Justin Domke's work on unrolling approximate inference procedures and training the resulting models end-to-end using backpropagation.

Cheng Soon Ong pointed me to work on anytime probabilistic inference, which I am not familiar with, but the goal of having inference algorithms which adapt to the available resources is certainly desirable. The anytime setting is practically relevant in many applications, particular in the real-time systems.

All these works share the characteristic that they take a probabilistic model and approximate inference procedure and construct a new "effective model" by entangling the model and inference. By doing so the resulting model is tractable by construction and retains to a large extent the specification of the original intractable model. However, the separation between model and inference procedure is lost.

This is the first step towards a computation first approach, and I believe we will see more machine learning works which recognize available computational primitives and resources as equally important to the model specification itself.

Acknowledgements. I thank Jeremy Jancsary, Peter Gehler, Christoph Lampert, Andrew Wilson, and Cheng Soon-Ong for feedback.