Bayesian P-Values

Sebastian Nowozin - Sat 27 June 2015 -

P-Values (see also Jim Berger's page on p-values) are probably one of the most misunderstood concepts in statistics and certainly have been abused in statistical practice. Originally proposed as an informal diagnostic by Ronald Fisher, there are many reasons for the bad reputation of p-values, and in many relevant situations good alternatives such as Bayes factors can and should be used instead. One key objection to p-values is that although they provide statistical evidence against an assumed hypothesis, this does not imply that the deviation from the hypothesis is large or relevant. In practice, the largest criticisms are not related to the p-value itself but related to the widespread misunderstanding of p-values and the arbitrariness of accepting formal tests of significance based on p-values in scientific discourse.

In this article I am not going to defend p-values, also because others have done a good job at giving a modern explanation of their benefits in context, as well as refuting some common criticisms, for example the article Two cheers for P-values? by Stephen Senn and the more recent In defense of P values by Paul Murtaugh.

Instead, I will consider a situation which often arises in practice.


Suppose you have decided on a probabilistic model \(P(X)\) or \(P(X|\theta)\), where \(\theta\) is some unknown parameter. With decided I mean that we actually commit and ship our model and we no longer entertain alternative models. Alternatives could be too expensive computationally or it could be too difficult to accurately specify these alternative models. (For example, a more complicated model may involve additional latent variables for which it is difficult to elicit prior beliefs.)

Given such a model but no assumed alternative, and some observed data \(X\), can we identify whether "the model fits the data"? This problem is the classic goodness of fit problem and classical statistics has a repertoire of methods for standard models. These methods have their own problems in that they are often unsatisfactory case-by-case studies or strong results are obtained only in asymptotia. However, it would be too easy to just criticise these methods. The real question is whether the problem they address is an important one, and what alternatives should be used, especially from a Bayesian viewpoint.

Prediction versus Scientific Theories

In machine learning, at least in its widespread current industrial use, we are most often concerned with building predictive models that automatically make decisions such as showing the right advertisements, classifying spam emails, etcetera.

This current focus on prediction may shift in the future, for example due to a revival in artificial intelligence systems or in general more autonomous agent type systems which do not have a single clearly defined prediction task.

But as it currently stands, model checking and goodness of fit is not so relevant for building predictive models.

First, even when the observation does not comply with model assumptions, your prediction may still be correct, in which case the non-compliance does not matter. I.e. the p-value does not use a decision-theoretic viewpoint that includes a task-dependent utility; cf. Watson and Holmes. To know whether the model is "correct" or not may not be important at all for prediction, but even likewise within science, as summarized by Bruce Hill in this comment,

"A major defect of the classical view of hypothesis testing, [...], is that it attempts to test only whether the model is true. This came out of the tradition in physics, where models such as Newtonian mechanics, the gas laws, fluid dynamics, and so on, come so close to being "true" in the sense of fitting (much of) the data, that one tends to neglect issues about the use of the model. However, in typical statistical problems (especially in the biological and social sciences but not exclusively so) one is almost certain a priori that the model taken literally is false in a non-trivial way, and so one is instead concerned whether the magnitude of discrepancies is sufficiently small so that the model can be employed for some specific purpose."

Second, if the deviation from modelling assumptions leads to incorrect predictions you would detect this through simple analysis of incorrect predictions using ground truth holdout data, not through fancy model checking. Checking accuracy of predictions is easy with annotated ground truth data, and is the bread-and-butter basic tool of machine learning.

The only useful application of model checking for predictive systems that I could think of are systems in which a conservative "prefer-not-to-predict" option exists, so that observations which are violating model assumptions could be excluded from further automated processing. Yet, much of this potential benefit may already be accessible through posterior uncertainty of the model. Only the subset of instances for which the model is certain but its predictions are wrong could profit from this special treatment.

In contrast to prediction, in science we build models not purely for prediction, but as a formal approximation to reality. Here I see that model checking is crucial, because it allows falsification of scientific hypotheses, leading hopefully to improved scientific understanding in the form of new models. One historically efficient method to falsify a scientific model is to check the predictions it makes, so a scientific model must normally also be a "predictive model". This viewpoint of establishing a model not just for making good predictions but also to understand mechanisms of reality also seems closer to the field of statistics.

The above separation of prediction versus science is of course not a simple dichotomy, but just a preference of the practitioner.

Bayesian Viewpoints?

So then, what is the Bayesian viewpoint here? The answer is that some well respected figures in the field accept frequentist tests and p-values as a method to criticise and attempt to falsify Bayesian models. One example can be seen in a recent article by Andrew Gelman and Cosma Shalizi where mechanisms to falsify a Bayesian model a discussed, stating

"The main point where we disagree with many Bayesians is that we do not think that Bayesian methods are generally useful for giving the posterior probability that a model is true, or the probability for preferring model A over model B, or whatever. Bayesian inference is good for deductive inference within a model, but for evaluating a model, we prefer to compare it to data (what Cox and Hinkley , 1974, call "pure significance testing") without requiring that a new model be there to beat it."

(They use pure significance tests and frequentist predictive checks, but no p-values in that paper.)

Another example is an article by Susie Bayarri and James Berger, where "Bayesian p-values" are discussed.

A third and maybe more popular pragmatic Bayesian stance is summarized in Bruce Hill's comment on Gelman, Meng, and Stern's article on posterior predictive testing,

"Like many others, I have come to regard the classical p-value as a useful diagnostic device, particularly in screening large numbers of possibly meaningful treatment comparisons. It is one of many ways quickly to alert oneself to some of the important features of a data set. However, in my opinion it is not particularly suited for careful decision-making in serious problems, or even for hypothesis testing. Its primary function is to alert one to the need for making such a more careful analysis, and perhaps to search for better models. Whether one wishes actually to go beyond the p-value depends upon, among other things, the importance of the problem, whether the quality of the data and information about the model and a priori distributions is sufficiently high for such an analysis to be worthwhile, and ultimately upon the perceived utility of such an analysis."

Not arguing by reference to authorities, but given the broad spectrum of contributions of Andrew Gelman, Bruce Hill, and James Berger (many of us learned Bayesian methods from the books Bayesian Data Analysis and Statistical Decision Theory and Bayesian Analysis), it should be clear that if they take frequentist tests and p-values seriously in statistical practice, they may actually be useful.

So let's look again at our goodness-of-fit problem.

Simple Models (simple null model)

The p-value can provide a useful diagnostic of goodness of fit. For the case of a simple model \(P(X)\) with an observation \(X \in \mathcal{X}\) we can pick a test statistic \(T: \mathcal{X} \to \mathbb{R}\) where high values indicate unlikely outcomes, and then compute

$$p_{\textrm{classic}} = \textrm{Pr}_{X' \sim P}(T(X') \geq T(X)),$$

that is, the probability of observing a \(T(X')\) greater than the actually observed \(T(X)\), given the assumed model \(P(X)\). This probability is the p-value and if the probability of observing a more extreme test statistic is small we should righly be suspicious of the assumed model. The choice of test statistic \(T\) is the only degree of freedom and has to be made given the model.

This is the classic p-value and its formal definition is completely unambiguous. One important observation is that if we assume the null hypothesis is true and we treat the p-value as a random variable, then this random variable is uniformly distributed, for any sample size.

Latent Variable Models (composite null model)

Now assume a slightly more general setting, where we have a model \(P(X|\theta)\), and \(\theta \in \Theta\) is some unknown parameter of the model which is not observed.

Because it is not observed, the above definition does not apply. We could apply the definition only if we knew \(\theta\). Classic methods assume that we have an estimator \(\hat{\theta}\) so that we can evaluate the p-value on \(P(X|\hat{\theta})\), fixing the parameter to a value hopefully close to it's true value. The key problem with this approach is that the p-value in general will no longer be uniformly distributed. This diminishes its value as a diagnostic for model misspecification. (Another alternative is to take the supremum probability over all possible parameters, again yielding a non-uniformly distributed p-value under the null.)

Bayesians to the rescue! Twice!

First, assume we would like to compute a p-value in the above setting. What would a Bayesian do? Of course, he would integrate over the unknown parameter, using a prior. This yields the so called posterior predictive p-value going back to the work of Guttman. Assuming a prior \(P(\theta)\) we compute the posterior predictive p-value as

$$p_{\textrm{post}} = \mathbb{E}_{\theta \sim P(\theta|X)}[ \textrm{Pr}_{X' \sim P(X'|\theta)}(T(X') \geq T(X))],$$

where \(P(\theta|X) \propto P(X|\theta) P(\theta)\) is the proper posterior. The definition is simple: take the expectation of the ordinary p-value weighted by the parameter posterior. This definition is very general and typically easy to compute during posterior inference, i.e. it is quite practical computationally.

Unfortunately, it is also overly conservative, as explained in the JASA paper ``Asymptotic Distribution of P Values in Composite Null Models'' by Robins, van der Vaart, and Ventura from 2000. Intuitively this is because the observed data \(X\) is used twice, a violation of the likelihood principle of Bayesian statistics: first it is used to obtain the posterior \(P(\theta|X)\), and then it is used again to compute the p-value.

Bayesians to the rescue again! This time it is Susie Bayarri and Jim Berger, and in their JASA paper P values for Composite Null Models they introduce two alternative p-values which exactly "undo" the effect of using the data twice by conditioning on the information already observed. (I will not discuss the U-conditional predictive p-value proposed by Bayarri and Berger.) Here is the basic idea: let \(X\) be the observed data and \(t=T(X)\) the test statistic. We then define the partial posterior,

$$P(\theta|X \setminus t) \propto \frac{P(X|\theta) P(\theta)}{P(t|\theta)}.$$

To understand this definition remember that random variables are functions from the sample space to another set. Hence, conditioning on \(t\) means that we condition on the event \(\{X' \in \mathcal{X} | T(X') = t\}\). The partial posterior predictive p-value is now defined as

$$p_{\textrm{ppost}} = \mathbb{E}_{\theta \sim P(\theta|X \setminus t)}[ \textrm{Pr}_{X' \sim P(X'|\theta)}(T(X') \geq T(X))].$$

Bayarri and Berger, as well as Robins, van der Vaart, and Ventura analyze the properties of this particular p-value and show that is asymptotically uniformly distributed and thus is neither conservative nor anti-conservative.

If you are a Bayesian and consider providing a general model-fit diagnostic in the absence of a formal alternative hypothesis this partial posterior predictive p-value is the method to use.

However, there are two drawbacks I can see that have affected it's usefulness for me:

  1. It is much harder to compute. Whereas the posterior predictive p-value can be well approximated even with naive Monte Carlo as soon as normal posterior inference is achieved, this is not the case for the partial posterior predictive p-value. The reason is that \(P(t|\theta)\), although typically an univariate density in the test statistic, is the integral over potentially complicated sets in \(\mathcal{X}\), that is \(P(t|\theta) = \int_{\mathcal{X}} 1_{\{T(X)=t\}} P(X|\theta) \,\textrm{d}X\). I have not seen generally applicable methods to compute \(p_{\textrm{ppost}}\) efficiently so far.
  2. The nice results of Bayarri and Berger do not extend to so called discrepancy statistics as proposed by Xiaoli Meng in his 1994 paper. These more general test statistics include the parameter, i.e. we use \(T(X,\theta)\) instead of just \(T(X)\). Why is this useful? For example, and I found this a very useful test statistic, you can directly use the likelihood of the model itself as a test statistic: \(T(X,\theta) = -P(X|\theta)\).

Enough thoughts, let's get our hands dirty with a simple experiment.


We take a simple composite null setting as follows. Our assumed model is

$$X_i \sim \mathcal{N}(\mu, \sigma^2),\qquad i=1,\dots,n.$$

We get to observe \(X=(X_1,\dots,X_n)\) and know \(\sigma\) but consider \(\mu\) unknown.

After some observations we would like to assess whether our model is accurate in light of the data. To this end we would like to use the P-values described above. We will need two ingredients: we need to define a test statistic and we need to work out the posterior inference in our model.

For the test statistic we actually use a generalized test statistic (discrepancy variable in Meng's vocabulary) as

$$T(X,\mu) = - \prod_{i=1}^n p(X_i|\mu) = -\prod_{i=1}^n \mathcal{N}(X_i ; \mu, \sigma^2).$$

For the posterior inference, as Bayesians we place a prior on \(\mu\) and we select

$$\mu \sim \mathcal{N}(\mu_0, \sigma_0).$$

The Bayesian analysis is particularly straightforward in this case, as this note by Kevin Murphy details. In particular, after observing \(n\) samples \(X=(X_1,\dots,X_n)\) the posterior on \(\mu\) has a simple closed form as

$$p(\mu|X) = \mathcal{N}(\mu_n, \sigma^2_n),$$


$$\sigma^2_n = \frac{1}{\frac{n}{\sigma^2}+\frac{1}{\sigma^2_0}},$$


$$\mu_n = \sigma^2_n \left(\frac{\mu_0}{\sigma^2_0} + \frac{n \bar{x}}{\sigma^2}\right),$$

where \(\bar{x} = \frac{1}{n} \sum_i X_i\) is the sample average.

From this simple form of the posterior distribution we can derive the closed form partial posterior \(P(\mu|X\setminus t)\) as well (not shown here, but essentially using known properties of the \(\chi^2\) distribution). Here is a picture of the posterior \(P(\mu|X)\) and the partial posterior \(P(\mu | X \setminus t)\), where the data \(X\) actually comes from the assumed model with true \(\mu=4.5\) and \(n=10\). Interestingly the partial posterior is more concentrated (which makes sense from the theory derived in Robins et al.).

Posterior and partial posterior in mu

Let us generate data from the assumed prior and model and see how our p-values behave. Because the null model is then correct, we can hope that the resulting p-values will be uniformly distributed. Indeed, if they were perfectly uniformly distributed they would be proper frequentist p-values. Because of the paper of Robins et al. we know that they will only be asymptotically uniformly distributed as \(n \to \infty\). But here we are also outside the theory because our test statistic \(T(X,\mu)\) includes the unknown parameter \(\mu\). So, walking on thin theory, let's verify the distribution for \(n=10\) by taking a histogram over \(10^6\) replicates.

Distribution over P-values

This looks good, and the partial posterior predictive p-value is more uniformly distributed obtaining better frequentist properties, in line with the claims in Bayarri and Berger and in Robins et al.

Finally, let us check with data from a model that is different from the assumed model. Here I sample from \(\mathcal{N}(\mu, s^2)\), where \(s \in [0,2]\). For \(s=1\) this is the assumed model, but ideally we can refute the model for values that differ from one by detecting this deviation through a p-value close to zero. The plot below shows, for each \(s\), the average p-value over 1000 replicates.

Deviation experiment

Clearly for \(s < 0.6\) or so we can reliably discover that our assumed model is problematic. Interestingly the partial posterior predictive p-value has significantly more power, in line with the theory.

For \(s > 1\) however, our p-value goes to one! How can this be? Well, remember that the choice of test statistic determines which deviations from our assumptions we can detect and that the p-value cannot verify the correctness of our assumed model but instead may only provide one-sided evidence against the model. With our current test statistic clearly this significant deviation passes undetected. We could replace our test statistic using the negative of our current test statistic and would be able detect the above deviation for \(s > 1\), but this implicitly more or less starts the process of thinking about alternative models, a point Bruce Hill mentioned above.

If we would like to consider alternative models we should ideally consider them in a formal way, and as a result we would be better off using a fully Bayesian approach over an enlarged model class.