This post continues the previous post, part 1 and part 2, outlining my criticism towards a ''naive'' subjective Bayesian viewpoint:

- The consequences of model misspecification.
- The ''model first computation last'' approach.
- Denial of methods of classical statistics, in this post.

## Denial of the Value of Classical Statistics

Suppose for the sake of a simple example that our task is to estimate the unknown mean \(\mu\) of an unknown probability distribution \(P\) with bounded support over the real line. To this end we receive a sequence of \(n\) iid samples \(X_1\), \(X_2\), \(\dots\), \(X_n\).

Now suppose that *after* receiving these \(n\) samples I do not use the obvious
sample mean estimator but I take only the first sample \(X_1\) and estimate
\(\hat{\mu} = X_1\).
Is this a good estimator?
Intuition tells us that it is not, because it ignores part of the useful input
data, namely \(X_i\) for any \(i > 1\), but how can we formally analyze this?

From a subjective Bayesian viewpoint the likelihood principle does not permit us to ignore evidence which is already available. If we posit a model \(P(X_i|\theta)\) and a prior \(P(\theta)\) we have to work with the posterior

Therefore our estimator \(\hat{\mu}=X_1\) cannot correspond to a Bayesian
posterior mean of any non-trivial model for all parameters.
This is of course a very strict viewpoint and one may object that we
*can* talk about properties of the sequence of posteriors \(P(\theta | X_1)\),
\(P(\theta | X_1, X_2)\), etc.
But even in this generous view, *after* observing all samples we are not
permitted to ignore part of them. (If you are still not convinced, consider
the estimator defined by \(\hat{\mu} = X_2\) if \(X_1 > 0\), and \(\hat{\mu} = X_3\)
otherwise.)
So Bayesian statistics does not offer us a method to analyze our proposed
estimator.

A classical statistician *can* analyze pretty much arbitrary procedures,
including ones of the silly type \(\hat{\mu}\) that we proposed.
The analysis may be technically difficult or apply only in the asymptotic
regime but does not rule out any estimator apriori.
Typical results may take the form of a derivation of the variance or bias of
the estimator.
In our case we have an
*unbiased*
estimate of the mean, \(\mathbb{E}[\hat{\mu}]-\mu = 0\).
As for the variance, because we only take the first sample, even as \(n \to
\infty\) the variance \(\mathbb{V}[\hat{\mu}]\) remains constant, so the
estimator is
*inconsistent*, a clear
indication that our \(\hat{\mu}\) is a bad estimator.

Another typical result is in the form of a confidence interval of a parameter of interest. One can argue that confidence intervals are not exactly answering the question of interest (that is, whether the parameter really is in the given interval), but if they are of interest, one can sometimes obtain them also from a Bayesian analysis.

There exist cases where existing statistical procedures can be reinterpreted from a Bayesian viewpoint. This is achieved by proposing a model and prior such that inferences under this model and prior exactly or approximately match the answers of the existing procedure or at least have satisfying frequentist properties. Two cases of this are the following:

- Matching priors, where in some cases it is possible to establish an exact equivalence for simple parametric models without latent variables. One recent example for even a non-parametric model is the Good-Turing estimator for the missing mass, where an asymptotic equivalence between the classic Good-Turing estimator and a Bayesian non-parametric model is established.
- Reference priors, a generalization of the Jeffrey prior, in which the prior is constructed to be least informative. Here least informative is in the sense that when you sample from the prior and consider the resulting posterior using the sample, the divergence to the original prior should be large in expectation; that is, samples from the prior should be able to change your beliefs to the maximum possible extend. When it is possible to derive reference priors, these typically have excellent frequentist robustness properties, and are useful default prior choices. Unfortunately, in models with multiple parameters there is no unique reference prior, and generally the set of known reference priors seems to be quite small. This problematic case-by-case state is nicely summarized in this recent work on overall objective priors.

Should we care at all about these classic notions of qualities of an
estimators?
I have seen *Bayesians* dismiss properties such as *unbiasedness*
and *consistency* as unimportant, but I cannot understand this stance.
For example, an unbiased estimator operating on iid sampled data immediately
implies a scalable parallel estimator applicable to the big data setting,
simply by separately estimating the quantity of interest, then taking the
average of estimates. This is a practical and useful consequence of the
unbiasedness property. Similarly, *consistency* is at least a guarantee that
when more data is available the qualities of your inferences are improving,
and this should be of interest to anyone whose goal it is to build systems
which can learn. (There do exist some results on Bayesian posterior
consistency, for a summary see Chapter 20 of
DasGupta's
book.)

Let me summarize.
Bayesian estimators are often superior to alternatives.
But the set of procedures yielding Bayesian estimates is strictly smaller than
the set of all statistical procedures.
We need methods to analyze the larger set, in particular to characterize the
subset of useful estimators, where *useful* is application dependent.

*Acknowledgements*. I thank Jeremy Jancsary,
Peter Gehler,
Christoph Lampert, and
Cheng Soon-Ong for feedback.