I have used probabilistic models for a number of years now and over this time I have used different paradigms to build my models, to estimate them from data, and to perform inference and predictions.

Overall I have slowly become a Bayesian; however, it has been a rough journey. When I say that "I became a Bayesian" I mean that my default view on problems now is to think about a probabilistic model that relates observables to quantities of interest and of suitable prior distributions for any unknowns that are present in this model. When it comes to solving the practical problem using a computer program however, I am ready to depart from the model on my whiteboard whenever the advantages to do so are large enough, for example in simplicity, runtime speed, tractability, etc. Some recent work to that end:

- Our work on informed sampling for generative computer vision models with Varun, Matthew and Peter, where we argue for a generative and Bayesian approach to computer vision problems;
- Our Bayesian NMR work (and here) with Andrew Wilson and collaborators from the Cambridge university chemical department we have taken a full Bayesian viewpoint, with great success over conventional NMR Fourier analysis;
- Our work on using GPs for structured prediction with Sebastien, Novi, and Zoubin, which was motivated by the struggle to scale up a conceptually satisfying model.
- My work on maximum expected utility in some structured prediction models at CVPR 2014, which was motivated by applying basic decision theory, but ended up trying to cope with resulting intractabilities.

However, I have remained skeptical of the naive and unconditional adoption of
the *subjective Bayesian viewpoint*.
In particular, I object to the viewpoint that every model and every system
ought to be Bayesian, or to the view that at the very least, if a statistical
system is useful that it should have an approximate Bayesian interpretation.
In this post and the following two posts I will try to explain my skepticism.

There is a risk of barking up the wrong tree by attacking a caricature of a Bayesian here, which is not my intention. In fact, to be frank, every one of the researchers I have interacted with in the past few years holds a nuanced view of their principles and methods and more often than not is aware of their principles' limitations and willing to adjust if circumstances require it.

Let me summarize the subjective Bayesian viewpoint. In my experience this view of the world is arguably the most prevalent among Bayesians in the machine learning community, for example at NIPS and at machine learning summer schools.

### The Subjective Bayesian Viewpoint

The subjective Bayesian viewpoint on any system under study is as follows:

- Specify a probabilistic model relating what is known to what is unknown;
- Specify a proper prior probability distribution over unknowns based on any information that is available to you;
- Obtain the posterior distribution over unknowns given the known data (using Bayes rule);
- Draw conclusions based on the posterior distribution; for example, solve a decision problem or select a model among the alternative models.

This approach is used exclusively for any statistical problem that may arise. This approach is strongly advocated, for example by Lindley and in a paper by Michael Goldstein.

Alternative Bayesian views deviate from this recipe. For example, they may
allow for *improper* prior distributions or instead aim to select
uninformative prior distributions, or even select the prior as a function of
the inferential question at hand.

# Criticism

My main criticism towards a ''naive'' subjective Bayesian viewpoint are related to the following three points:

- The consequences of model misspecification.
- The ''model first computation last'' approach.
- Denial of methods of classical statistics.

## The Consequences of Model Misspecification

To model some system in the world we often use probabilistic models of the form

where \(x \in \mathcal{X}\) is a random variable of interest and \(\Theta\) is the set of possible parameters \(\theta\). We are interested in \(p(x)\) and thus would like to find a suitable parameter given some observed data \(x_1, x_2, \dots, x_n \in \mathcal{X}\). Because we can never be entirely certain about our parameters we may represent our current beliefs through a posterior distribution \(p(\theta|x_1,\dots,x_n)\).

*Misspecification* is the case when no parameter in \(\Theta\) leads to a
distribution \(p(x;\theta)\) that behaves like the true distribution.
This is not exceptional, infact most models of real world systems are
misspecified.
It also is not a property of any inferential approach but rather a fundamental
limitation of building expressive models given our limited knowledge. If we
could observe all relevant quantities and know their deterministic relationships we
would not need a probabilistic model.
Hence the need for a probabilistic model arises because we cannot observe
everything and we do not know all the dependencies that exist in the real
world. (Alas, as Andrew Wilson pointed out to me, the previous two sentences
expose my deterministic world view.)
So what can be said about this common case of misspecified models?

Let us talk about calibration of probabilities, and what happens in case your
model is wrong.
Informally, you are *well-calibrated* if you neither overestimate nor
underestimate the probability of certain events.
Crucially, this does not imply a degree of certainty, only that your
uncertain statements (forecasted probabilities of events) are on average
correct.

For any probabilistic model, being well-calibrated is a desirable goal. There are various methods to assess calibration and to check the forecasts of your model. In 1982 Dawid, in a seminal paper, established a general theorem whose consequence (in Section 4.3 of that paper) is to guarantee that a Bayesian using a parametric model will eventually be well-calibrated.

This is reassuring, except there is one catch: it does not apply in the case when the model is misspecified. Unfortunately, in most practical applications of probabilistic modelling, misspecification is the rule rather than the exception (''All models are wrong''). We could hope for a ''graceful degradation'', in that we are still at least approximately calibrated. But this is not the case.

### Calibration and Misspecification

In the misspecified case, there are simple examples due to Brad Delong and Cosma Shalizi where beliefs in a parametric model do not converge and become less-calibrated over time. In their example two contradicting things happen at the same time: the beliefs become very confident, yet a single new observation revises the belief to the other extreme, again confident.

### Improving the model?

One can object that in these examples, and more generally, one should revise the model to more accurately reflect the system under study. But then, in order not to end up in an infinite loop of trying to improve a model, how to determine when to stop? Actually, how to even determine the accuracy of the model? Model evidence cannot be used to this end, as it is conditioned on the set of possible models being used. (In fact, in Delong's example the evidence would assure us that everything is fine.) The answers to how model's can be criticised and improved are not simple, and quite likely not Bayesian.

Andrew Gelman and Cosma Shalizi discuss this issue and others in a position paper, and I find myself agreeing with their assessment that there is no answer to wrong model assumptions within the (strictly) subjective Bayesian viewpoint:

"We fear that a philosophy of Bayesian statistics as subjective, inductive inference can encourage a complacency about picking or averaging over existing models rather than trying to falsify and go further. Likelihood and Bayesian inference are powerful, and with great power comes great responsibility. Complex models can and should be checked and falsified."

### Non-parametric Models to the Rescue?

Another objection is that this is all well-known and hence we should use non-parametric models which endow us with prior support over essentially all reasonable alternatives.

Unfortunately, while the resulting models are richer and are practically
useful in real applications, we now may have other problems: even when there
is prior support for the true model simple properties like *consistency*
(which were guaranteed to hold in the parametric case) can no longer be taken
for granted. The current
literature and basic results on this topic are nicely summarized in
Section 20.12 of DasGupta's
book.

### Conclusion

Misspecification is not a Bayesian problem, and applies equally to other estimation approaches, for example in the case of maximum likelihood estimation see the book by White. However, a subjective Bayesian has no Bayesian means to test for the presence of misspecification and that makes it hard to deal with the consequences.

There are some ideas for applying Bayesian inference in a
misspecification-aware manner, for example the *Safe
Bayesian* approach, and an
interesting analysis of approximate Bayesian inference using the Bootstrap in
a relatively unknown paper of
Fushiki.

Are these alternatives practical and do they somehow overcome the misspecification problem? To be frank, I am not aware of any satisfactory solution and common practice seems to be a careful model criticism using tools such as predictive model checking and graphical inspection. But these require first acknowledging the problem.

When the model is wrong ideally it would be reassuring to have,

- a reliable diagnostic and quantification on how wrong it is (say, an estimate \(D(q\|p^*)\) where \(q\) is the true distribution), and
- a test for whether the type of model error present will matter for making certain predictions (say, an error bound on the deviation of certain expectations, \(\mathbb{E}_q[f(x)] - \mathbb{E}_{p^*}[f(x)]\) for a given function \(f\)).

To me it appears the (pure) subjective Bayesian paradigm cannot provide the above.

### Addendum

Andrew Wilson pointed out to me that in most cases of statistical problems we
cannot know the *true distribution*, even in principle. I agree, and indeed
if we pursue such elusive ideal then this may divert our attention away from
the practical issue of building a model good enough for the task at hand.
I entirely agree with taking such pragmatic stance and this follows Francis
Bacon's ideal of assessing the worth of a model (scientific theory in his
case) not by an abstract ideal of truthfulness, but instead by its utility.

In machine learning and most industrial applications building the model is
*easy* because we merely focus on predictive performance which can be
reliably assessed using holdout data.
For scientific discovery however, things are more subtle in that our goal is
in establishing the truth of certain statements with sufficient confidence;
but this truth is only a conditional truth, conditioned on assumptions we have
to make.

A Bayesian makes all assumptions explicit and then proceeds by formally treating them as truth, correctly inferring the consequences. A classical/frequentist approach also makes assumptions by positing a model, but then may be able to make statements that hold uniformly over all possibilities encoded in the model. Therefore, in my mind the Bayesian is an optimist, believing entirely in their assumptions, whereas the classical approach is more pessimistic, believing in their model but then providing worst-case results over all possibilities. Misspecification affects both approaches.

If you want to continue reading, the second part of this post is now available.

*Acknowledgements*. I thank Jeremy Jancsary,
Peter Gehler,
Christoph Lampert, and
Andrew Wilson for feedback.