Becoming a Bayesian, Part 1

Sebastian Nowozin - Sun 19 April 2015 -

I have used probabilistic models for a number of years now and over this time I have used different paradigms to build my models, to estimate them from data, and to perform inference and predictions.

Overall I have slowly become a Bayesian; however, it has been a rough journey. When I say that "I became a Bayesian" I mean that my default view on problems now is to think about a probabilistic model that relates observables to quantities of interest and of suitable prior distributions for any unknowns that are present in this model. When it comes to solving the practical problem using a computer program however, I am ready to depart from the model on my whiteboard whenever the advantages to do so are large enough, for example in simplicity, runtime speed, tractability, etc. Some recent work to that end:

However, I have remained skeptical of the naive and unconditional adoption of the subjective Bayesian viewpoint. In particular, I object to the viewpoint that every model and every system ought to be Bayesian, or to the view that at the very least, if a statistical system is useful that it should have an approximate Bayesian interpretation. In this post and the following two posts I will try to explain my skepticism.

There is a risk of barking up the wrong tree by attacking a caricature of a Bayesian here, which is not my intention. In fact, to be frank, every one of the researchers I have interacted with in the past few years holds a nuanced view of their principles and methods and more often than not is aware of their principles' limitations and willing to adjust if circumstances require it.

Let me summarize the subjective Bayesian viewpoint. In my experience this view of the world is arguably the most prevalent among Bayesians in the machine learning community, for example at NIPS and at machine learning summer schools.

The Subjective Bayesian Viewpoint

The subjective Bayesian viewpoint on any system under study is as follows:

  • Specify a probabilistic model relating what is known to what is unknown;
  • Specify a proper prior probability distribution over unknowns based on any information that is available to you;
  • Obtain the posterior distribution over unknowns given the known data (using Bayes rule);
  • Draw conclusions based on the posterior distribution; for example, solve a decision problem or select a model among the alternative models.

This approach is used exclusively for any statistical problem that may arise. This approach is strongly advocated, for example by Lindley and in a paper by Michael Goldstein.

Alternative Bayesian views deviate from this recipe. For example, they may allow for improper prior distributions or instead aim to select uninformative prior distributions, or even select the prior as a function of the inferential question at hand.

Criticism

My main criticism towards a ''naive'' subjective Bayesian viewpoint are related to the following three points:

  1. The consequences of model misspecification.
  2. The ''model first computation last'' approach.
  3. Denial of methods of classical statistics.

The Consequences of Model Misspecification

To model some system in the world we often use probabilistic models of the form

$$p(x;\theta),\qquad \theta \in \Theta,$$

where \(x \in \mathcal{X}\) is a random variable of interest and \(\Theta\) is the set of possible parameters \(\theta\). We are interested in \(p(x)\) and thus would like to find a suitable parameter given some observed data \(x_1, x_2, \dots, x_n \in \mathcal{X}\). Because we can never be entirely certain about our parameters we may represent our current beliefs through a posterior distribution \(p(\theta|x_1,\dots,x_n)\).

Misspecification is the case when no parameter in \(\Theta\) leads to a distribution \(p(x;\theta)\) that behaves like the true distribution. This is not exceptional, infact most models of real world systems are misspecified. It also is not a property of any inferential approach but rather a fundamental limitation of building expressive models given our limited knowledge. If we could observe all relevant quantities and know their deterministic relationships we would not need a probabilistic model. Hence the need for a probabilistic model arises because we cannot observe everything and we do not know all the dependencies that exist in the real world. (Alas, as Andrew Wilson pointed out to me, the previous two sentences expose my deterministic world view.) So what can be said about this common case of misspecified models?

Let us talk about calibration of probabilities, and what happens in case your model is wrong. Informally, you are well-calibrated if you neither overestimate nor underestimate the probability of certain events. Crucially, this does not imply a degree of certainty, only that your uncertain statements (forecasted probabilities of events) are on average correct.

For any probabilistic model, being well-calibrated is a desirable goal. There are various methods to assess calibration and to check the forecasts of your model. In 1982 Dawid, in a seminal paper, established a general theorem whose consequence (in Section 4.3 of that paper) is to guarantee that a Bayesian using a parametric model will eventually be well-calibrated.

This is reassuring, except there is one catch: it does not apply in the case when the model is misspecified. Unfortunately, in most practical applications of probabilistic modelling, misspecification is the rule rather than the exception (''All models are wrong''). We could hope for a ''graceful degradation'', in that we are still at least approximately calibrated. But this is not the case.

Calibration and Misspecification

In the misspecified case, there are simple examples due to Brad Delong and Cosma Shalizi where beliefs in a parametric model do not converge and become less-calibrated over time. In their example two contradicting things happen at the same time: the beliefs become very confident, yet a single new observation revises the belief to the other extreme, again confident.

Improving the model?

One can object that in these examples, and more generally, one should revise the model to more accurately reflect the system under study. But then, in order not to end up in an infinite loop of trying to improve a model, how to determine when to stop? Actually, how to even determine the accuracy of the model? Model evidence cannot be used to this end, as it is conditioned on the set of possible models being used. (In fact, in Delong's example the evidence would assure us that everything is fine.) The answers to how model's can be criticised and improved are not simple, and quite likely not Bayesian.

Andrew Gelman and Cosma Shalizi discuss this issue and others in a position paper, and I find myself agreeing with their assessment that there is no answer to wrong model assumptions within the (strictly) subjective Bayesian viewpoint:

"We fear that a philosophy of Bayesian statistics as subjective, inductive inference can encourage a complacency about picking or averaging over existing models rather than trying to falsify and go further. Likelihood and Bayesian inference are powerful, and with great power comes great responsibility. Complex models can and should be checked and falsified."

Non-parametric Models to the Rescue?

Another objection is that this is all well-known and hence we should use non-parametric models which endow us with prior support over essentially all reasonable alternatives.

Unfortunately, while the resulting models are richer and are practically useful in real applications, we now may have other problems: even when there is prior support for the true model simple properties like consistency (which were guaranteed to hold in the parametric case) can no longer be taken for granted. The current literature and basic results on this topic are nicely summarized in Section 20.12 of DasGupta's book.

Conclusion

Misspecification is not a Bayesian problem, and applies equally to other estimation approaches, for example in the case of maximum likelihood estimation see the book by White. However, a subjective Bayesian has no Bayesian means to test for the presence of misspecification and that makes it hard to deal with the consequences.

There are some ideas for applying Bayesian inference in a misspecification-aware manner, for example the Safe Bayesian approach, and an interesting analysis of approximate Bayesian inference using the Bootstrap in a relatively unknown paper of Fushiki.

Are these alternatives practical and do they somehow overcome the misspecification problem? To be frank, I am not aware of any satisfactory solution and common practice seems to be a careful model criticism using tools such as predictive model checking and graphical inspection. But these require first acknowledging the problem.

When the model is wrong ideally it would be reassuring to have,

  • a reliable diagnostic and quantification on how wrong it is (say, an estimate \(D(q\|p^*)\) where \(q\) is the true distribution), and
  • a test for whether the type of model error present will matter for making certain predictions (say, an error bound on the deviation of certain expectations, \(\mathbb{E}_q[f(x)] - \mathbb{E}_{p^*}[f(x)]\) for a given function \(f\)).

To me it appears the (pure) subjective Bayesian paradigm cannot provide the above.

Addendum

Andrew Wilson pointed out to me that in most cases of statistical problems we cannot know the true distribution, even in principle. I agree, and indeed if we pursue such elusive ideal then this may divert our attention away from the practical issue of building a model good enough for the task at hand. I entirely agree with taking such pragmatic stance and this follows Francis Bacon's ideal of assessing the worth of a model (scientific theory in his case) not by an abstract ideal of truthfulness, but instead by its utility.

In machine learning and most industrial applications building the model is easy because we merely focus on predictive performance which can be reliably assessed using holdout data. For scientific discovery however, things are more subtle in that our goal is in establishing the truth of certain statements with sufficient confidence; but this truth is only a conditional truth, conditioned on assumptions we have to make.

A Bayesian makes all assumptions explicit and then proceeds by formally treating them as truth, correctly inferring the consequences. A classical/frequentist approach also makes assumptions by positing a model, but then may be able to make statements that hold uniformly over all possibilities encoded in the model. Therefore, in my mind the Bayesian is an optimist, believing entirely in their assumptions, whereas the classical approach is more pessimistic, believing in their model but then providing worst-case results over all possibilities. Misspecification affects both approaches.

If you want to continue reading, the second part of this post is now available.

Acknowledgements. I thank Jeremy Jancsary, Peter Gehler, Christoph Lampert, and Andrew Wilson for feedback.