How good are your beliefs? Part 1: Scoring Rules

Sebastian Nowozin - Fri 04 September 2015 -

This article is the first of two on proper scoring rules, a specific type of loss function defined on probability distributions or functions of probability distributions.

If this article sparks your interest, I recommend the gentle introduction to scoring rules in the context of decision theory in Chapter 10 of Parmigiani and Inoue's "Decision Theory" book, which is a great book to have on your data science bookshelf in any case and it deservedly won the DeGroot prize in 2009.

Scoring Rules

Consider the following forecasting setting. Given a set of possible outcomes \(\mathcal{X}\) and a class of probability measures \(\mathcal{P}\) defined on a suitably constructed \(\sigma\)-algebra, we consider a forecaster which makes a forecast in the form of a probability distribution \(P \in \mathcal{P}\). After the forecast is fixed, a realization \(x \in \mathcal{X}\) is revealed and we would like to assess quality of the prediction made by the forecaster.

A scoring rule is a function \(S\) such that \(S(P,x)\) is taken to mean the quality of the forecast. Hence the function has the form \(S: \mathcal{P} \times \mathcal{X} \to \mathbb{R} \cup \{-\infty,\infty\}\). There are two variants popular in the literature: the positively-orientied scoring rules assign higher values to better forecasts, the negatively-oriented scoring rules behave like loss functions, taking smaller values for better forecasts.

A proper scoring rule has desirable behaviour, to be made precise shortly. Let us first think what could be desirable in a scoring rule. Intuitively we would like to make "cheating" difficult, that is, if we really subjectively believe in \(P\), we should have no incentive to report any deviation from \(P\) in order to achieve a better score. Formally, we first define the expected score under distribution \(Q\),

$$S(P,Q) = \mathbb{E}_{x \sim Q}[S(P,x)].$$

So that if we believe in any prediction \(P \in \mathcal{P}\), then we should demand that (for negatively-oriented scores)

$$S(P,P) \leq S(P,Q),\qquad \forall P,Q \in \mathcal{P}.$$

For strictly proper scoring rules the above inequality holds strictly except for \(Q=P\). For a proper scoring rule the above inequality means that in expectation the lowest possible score can be achieved by faithfully reporting our true beliefs. Therefore, a rational forecaster who aims to minimize expected score (loss) is going to report his beliefs.

Key uses of scoring rules are:

  • Evaluating the predictive performance of a model;
  • Eliciting probabilities;
  • Using them for parameter estimation.

Let us look briefly at the different uses.

Model Evaluation

For assessing the model performance, we simply use the scoring rule as a loss function and measure the predictive performance on a holdout data set.

Probability Elicitation

For probability elicitation we can use a scoring rule as follows: we ask a user to make predictions and we tell him that we will reward him proportionally to the value achieved by the scoring rule once the prediction can be scored. Assuming that the user is rational and aims to maximize his reward, if we use a proper scoring rule, then he can maximize his expected reward by making predictions according to the true beliefs he holds. However, while the existence of a strictly proper scoring rule roughly means that elicitation of a quantity is possible, more efficient methods for probability elicitation may exist. Infact, Simon French and David Rios Insua argue in their book Statistical Decision Theory, page 76, that

"de Finetti (1974; 1975) and others have championed the use of scoring rules to elicit probabilities of events. ... Scoring rules are important in de Finetti's development of subjective probability, but it is not clear that they have a practical use in statistical or decision analysis. ... Scoring rules could provide a very expensive method of eliciting probabilities. In training probability assessors, however, they can have a practical use."

If you wonder what more efficient alternatives French and Insua have in mind, they do propose several methods to elicit probabilities, such as an idealized "probability wheel" the user can configure and spin, and a sequence of proposed gambles in order to find a fair value accepted by the user.

In general it seems to me (as an outsider of this field), that probability elicitation is as much about theoretically sound methods as it is about human psychology and biases, and how to avoid them. The human aspect of probability elicitation is discussed in the Roger Cooke's book-length monograph on the topic, and the recent study of (Goldstein and Rothschild, "Lay understanding of probability distributions", 2014) (thanks to Ian Kash for pointing me to this study!).

Estimation

For parameter estimation we perform empirical risk minimization on a probabilistic model using the scoring rule as a loss function, an approach dating back to (Pfanzagl, 1969). This is a special case of M-estimation but generalizes maximum likelihood estimation (MLE), where the log-probability scoring rule is used.

If the model class contains the true generating model this yields a consistent estimator but for misspecified models this can yield answers different from the MLE, and these answers may be preferable; for example, if model assumptions are violated and for any choice of parameter the model would put have a low density on some observations these tend to influence the MLE severely because the log-prob scoring rule assigns a large penalty to these observations. Using a suitable scoring rule cannot prevent misspecification of course but the consequences can be made less severe.

It should also be said that for estimation problems the log-prob scoring rule is the most principled in that it is the only one that can be justified from the likelihood principle.

Scoring Rule Examples

Here are a few examples of common and not so common scoring rules both for discrete and continuous outcomes.

Scoring Rule Example: Brier Score

This scoring rule was historically the first, proposed by Glenn Wilson Brier (1913-1998) in his seminal work (Brier, "Verification of Forecasts Expressed in Terms of Probability", 1950) as a means to verify weather forecasts.

Given a discrete outcome set \(\{1,2,\dots,K\}\) the forecaster specifies a distribution \(P=(p_1,\dots,p_K)\) with \(p_i \geq 0\) and \(\sum_i p_i = 1\). Then, when an outcome \(j\) is realized we score the forecaster according to the Brier score,

$$S_B(P,j) = \sum_{i=1}^K (1_{\{i=j\}} - p_i)^2.$$

The Brier score is extensively discussed in (DeGroot and Fienberg, 1983) and they show that it can be decomposed into two terms measuring calibration and refinement, respectively. Here, refinement measures the information available to discriminate between different outcomes that is contained in the prediction.

For the case with binary classes, the definite work is (Buja, Stuetzle, Shen, 2005) in which a class of scoring rules is proposed based on the Beta distribution which generalizes both the Brier score and the log-probability score.

Scoring Rule Example: Log-Probability

The most common scoring rule in estimation problems is the log-probability, also known as the log-loss in machine learning. Maximum likelihood estimation can be seen as optimizing the log-probability scoring rule.

For the discrete outcome case it is given simply by

$$S_{\textrm{log}}(P,i) = -\log p_i.$$

If \(p_i = 0\) the score \(S_{\textrm{log}}(P,i) = \infty\). The log-probability is a proper scoring rule, but what really distinguishes it is that it is local in that when outcome \(j\) realizes only the predicted value \(p_j\) is used to compute the score. Intuitively this is a desirable property because if \(j\) happens, why should we care about the precise distribution of probability mass for the other events?

It turns out that this local property is unique to the log-probability scoring rule. (For the result and proof see Theorem 10.1 in Parmigiani and Inoue's book.)

Scoring Rule Example: Energy Statistic

This scoring rule is for predicting a distribution in \(\mathbb{R}^d\) and is defined for \(\beta \in (0,2)\), realization \(x \in \mathbb{R}^d\), and distribution \(P\) on \(\mathbb{R}^d\) as

$$S_E(P,x) = \mathbb{E}_{X \sim P}[\|X-x\|^\beta] - \frac{1}{2} \mathbb{E}_{X,X' \sim P}[\|X-X'\|^\beta].$$

This score has an intuitive interpretation: the score is the expected distance to the realization minus half the expected pairwise sample distance. Let us think about a few cases: if \(P\) is a point mass, then the first term is just the distance to the realization and the second term is zero; in particular for \(\beta \to 2\) the score recovers the squared Euclidean norm loss. The original definition is from (Gneiting and Raftery, 2007) except for the sign change, but is based on Szekely's energy statistic which also independently found its way into machine learning through the Hilbert-Schmidt independence criterion.

For \(\beta \in (0,2)\) the energy score is a strictly proper scoring function for all Borel measures with finite moment \(\mathbb{E}_P[\|X\|^\beta] < \infty\).

Here is a visualization, where \(P = \mathcal{N}([0,0]^T, \textrm{diag}([1/2, 5/2]))\) is given by the 10k samples and the red marker corresponds to the realization \(x\). Here we have \(\beta=1\). We can see that the Euclidean nature of the scoring rule seems to dominate the anisotropic distribution \(P\), that is, a realization that is unlikely under our belief distribution (leftmost plot) achieves a lower score than a sample with higher density (second leftmost plot).

Energy score for beta equal to
one

As a practical manner, the energy score is simple to evaluate even when you have only predictive Monte Carlo realizations of your model, compared to the log-probability rule which requires the normalizer of the predictive distribution.

Scoring Rule: Check Loss

The check loss, also known as quantile loss or tick loss, is a loss function used for quantile regression, where we would like to learn a model that directly predicts a quantile of a distribution, but we are given only samples of the distribution at training time.

This scoring rule is somewhat different in that a specific property of a belief distribution is scored, namely the quantile of the distribution. Being proper here means that the lowest expected loss is achieved by predicting the corresponding quantile of your belief. (Interestingly proper scoring rules exist only for some functions of the distribution, see (Gneiting, 2009).)

You may know a special case of the check loss already: when using an absolute value loss, your expected risk is minimized by taking the median of your belief distribution, that is, the \(\frac{1}{2}\)-quantile. The check loss generalizes this to a richer family of loss functions such that the expected minimizer corresponds to arbitrary quantiles, not just the median. Thus, instead of scoring an entire belief distribution \(P\) we only score its quantile statistics.

The check loss is defined as

$$S_{\textrm{c}}(r,x,\alpha) = (x-r) (1_{\{x \leq r\}} - \alpha),$$

where \(r\) is our predicted \(\alpha\)-quantile and \(x \sim Q\) is a sample from the true unknown distribution \(Q\).

Plotting this loss explains the name check loss and tick loss, because it looks like two tilted lines. I show it for a sample realization of \(x=5\), and the horizontal axis denotes the quantile estimate.

Check loss, a popular quantile loss

For any belief distribution, taking the minimum expected risk decision yields the matching quantile. For example, if your beliefs are distributed according to \(X \sim N(5,1)\), then you would consider the expected risk

$$R_{\alpha}(r,\alpha) = \mathbb{E}_{X \sim N(5,1)}[-S_c(r,X,\alpha)].$$

This convolves the check loss function with the belief distribution, in this case corresponding to a Gaussian kernel. The minimizer over \(r\) of this expected risk function would correspond to your optimal decision.

Integrated risk under the check loss

The above plot marks the 10/50/90 quantiles and these correspond to the minimizers of the expected risks of the respective check losses.

Conclusion

The above is only a small peek into the vast literature on scoring rules. If you are mathematically inclined, I highly recommend (Gneiting and Raftery, 2007) as an enjoyable further read and (Frongillo and Kash, 2015) for the most recent general results; everyone else may enjoy the book mentioned in the introduction.

In the second part we are going to put your forecasting skills to the test via an interactive quiz!

Acknowledgements. I thank Ian Kash for further insightful discussions on scoring rules and pointing me to relevant literature.