Becoming a Bayesian, Part 3

Sebastian Nowozin - Fri 15 May 2015 -

This post continues the previous post, part 1 and part 2, outlining my criticism towards a ''naive'' subjective Bayesian viewpoint:

  1. The consequences of model misspecification.
  2. The ''model first computation last'' approach.
  3. Denial of methods of classical statistics, in this post.

Denial of the Value of Classical Statistics

Suppose for the sake of a simple example that our task is to estimate the unknown mean \(\mu\) of an unknown probability distribution \(P\) with bounded support over the real line. To this end we receive a sequence of \(n\) iid samples \(X_1\), \(X_2\), \(\dots\), \(X_n\).

Now suppose that after receiving these \(n\) samples I do not use the obvious sample mean estimator but I take only the first sample \(X_1\) and estimate \(\hat{\mu} = X_1\). Is this a good estimator? Intuition tells us that it is not, because it ignores part of the useful input data, namely \(X_i\) for any \(i > 1\), but how can we formally analyze this?

From a subjective Bayesian viewpoint the likelihood principle does not permit us to ignore evidence which is already available. If we posit a model \(P(X_i|\theta)\) and a prior \(P(\theta)\) we have to work with the posterior

$$P(\theta | X_1,\dots,X_n) \propto \prod_{i=1,\dots,n} P(X_i|\theta) P(\theta).$$

Therefore our estimator \(\hat{\mu}=X_1\) cannot correspond to a Bayesian posterior mean of any non-trivial model for all parameters. This is of course a very strict viewpoint and one may object that we can talk about properties of the sequence of posteriors \(P(\theta | X_1)\), \(P(\theta | X_1, X_2)\), etc. But even in this generous view, after observing all samples we are not permitted to ignore part of them. (If you are still not convinced, consider the estimator defined by \(\hat{\mu} = X_2\) if \(X_1 > 0\), and \(\hat{\mu} = X_3\) otherwise.) So Bayesian statistics does not offer us a method to analyze our proposed estimator.

A classical statistician can analyze pretty much arbitrary procedures, including ones of the silly type \(\hat{\mu}\) that we proposed. The analysis may be technically difficult or apply only in the asymptotic regime but does not rule out any estimator apriori. Typical results may take the form of a derivation of the variance or bias of the estimator. In our case we have an unbiased estimate of the mean, \(\mathbb{E}[\hat{\mu}]-\mu = 0\). As for the variance, because we only take the first sample, even as \(n \to \infty\) the variance \(\mathbb{V}[\hat{\mu}]\) remains constant, so the estimator is inconsistent, a clear indication that our \(\hat{\mu}\) is a bad estimator.

Another typical result is in the form of a confidence interval of a parameter of interest. One can argue that confidence intervals are not exactly answering the question of interest (that is, whether the parameter really is in the given interval), but if they are of interest, one can sometimes obtain them also from a Bayesian analysis.

There exist cases where existing statistical procedures can be reinterpreted from a Bayesian viewpoint. This is achieved by proposing a model and prior such that inferences under this model and prior exactly or approximately match the answers of the existing procedure or at least have satisfying frequentist properties. Two cases of this are the following:

  • Matching priors, where in some cases it is possible to establish an exact equivalence for simple parametric models without latent variables. One recent example for even a non-parametric model is the Good-Turing estimator for the missing mass, where an asymptotic equivalence between the classic Good-Turing estimator and a Bayesian non-parametric model is established.
  • Reference priors, a generalization of the Jeffrey prior, in which the prior is constructed to be least informative. Here least informative is in the sense that when you sample from the prior and consider the resulting posterior using the sample, the divergence to the original prior should be large in expectation; that is, samples from the prior should be able to change your beliefs to the maximum possible extend. When it is possible to derive reference priors, these typically have excellent frequentist robustness properties, and are useful default prior choices. Unfortunately, in models with multiple parameters there is no unique reference prior, and generally the set of known reference priors seems to be quite small. This problematic case-by-case state is nicely summarized in this recent work on overall objective priors.

Should we care at all about these classic notions of qualities of an estimators? I have seen Bayesians dismiss properties such as unbiasedness and consistency as unimportant, but I cannot understand this stance. For example, an unbiased estimator operating on iid sampled data immediately implies a scalable parallel estimator applicable to the big data setting, simply by separately estimating the quantity of interest, then taking the average of estimates. This is a practical and useful consequence of the unbiasedness property. Similarly, consistency is at least a guarantee that when more data is available the qualities of your inferences are improving, and this should be of interest to anyone whose goal it is to build systems which can learn. (There do exist some results on Bayesian posterior consistency, for a summary see Chapter 20 of DasGupta's book.)

Let me summarize. Bayesian estimators are often superior to alternatives. But the set of procedures yielding Bayesian estimates is strictly smaller than the set of all statistical procedures. We need methods to analyze the larger set, in particular to characterize the subset of useful estimators, where useful is application dependent.

Acknowledgements. I thank Jeremy Jancsary, Peter Gehler, Christoph Lampert, and Cheng Soon-Ong for feedback.