How to report uncertainty - Sebastian Nowozins slow blog

Error bars and the $\pm$-notation are used to quantitatively convey uncertainty in experimental results. For example, you would often read statements like $140.7 \textrm{Hz} \pm 2.8 \textrm{ Hz SEM}$ in a paper to report both an experimental average and its uncertainty.

Unfortunately, in many fields (such as computer vision, and, to a lesser extent, machine learning) researchers often do not report uncertainty or if they do, they may do it wrong.

Of course, dear reader, I am sure you always do report it properly, so the following remarks may only serve as a reminder to your common practice.

First, when reporting a quantitative measurement of uncertainty, it is important to establish the goal of doing so. The two popular goals are as follows.

1. Convey Variability

Here the focus is on the variability itself. For example, take a look at this table of food intake of US teenagers. The variability among the participants of the study is reported through the standard deviation, the square root of the variance.

The reason why the standard deviation (SD) is prefered over the variance is that the SD is on the same scale as the original values. That is, if the original measurements were in $\textrm{Hz}$ the standard deviation is also in the unit of $\textrm{Hz}$, whereas the variance is the square.

One easy question you can ask yourself when thinking about the results you would like to report in an experiment is this: Do you expect the error bars to shrink with more available data? If your goal is to convey variability they would not shrink but remain of a certain size, no matter how many samples are available.

The correct wording to report this type of uncertainty is something similar to

"We report the mean and one unit standard deviation."

2. Convey Uncertainty about an Unknown Parameter

Here the focus is on your remaining uncertainty about a fixed quantity which does not vary. For example, take a look at Table 1 in Ogden et al., 2004 where the average weight of US children is reported. Together with the mean weight in pounds the authors report the standard error of the mean. (Sometimes this is just called standard error.)

Here the uncertainty represents a measurement of uncertainty about the average weight. It is related to the standard deviation $\sigma$ by means of

$$\textrm{SEM} = \frac{\sigma}{\sqrt{n}},$$

where $n$ is the sample size of the experiment. For example, in Table 1 of the above paper the authors report that between 1963 and 1965 for boys of age 6 years living in the USA the average weight was $\hat{\mu}=48.4$ pounds with $\textrm{SEM}=0.3$ standard error of the mean and a sample size of $n=575$. Using the above formula this immediately gives

$$\sigma \approx \sqrt{n} \textrm{SEM} = \sqrt{575} \cdot 0.3 \approx 7.19.$$

What is the use of the standard error? Because of the central limit theorem for independent samples the standard error provides approximate confidence intervals for the unknown true mean of the population, as

$$[\hat{\mu} - 1.96 \textrm{SEM}, \hat{\mu} + 1.96 \textrm{SEM}].$$

Using the above numbers we then know that with 95% confidence over the sampling variation the true average weight $\mu \in [47.8,49.0]$. (Note that for a single experiment this does not mean we cover the true value with a certain probability, because either we cover it or we do not cover it. The 95% probability is the probability associated to a (hypothetical) repetition of the experiment.)

The correct wording to report this type of uncertainty is

"We report the average of $n=123$ samples and the standard error of the mean."

How many digits to report?

When writing out numbers a natural question that arises is how many significant digits to report. Richard Clymo has some advice on how many digits to report.

Most bioscientists need to report mean values, yet many have little idea of how many digits are significant, and at what point further digits are mere random junk. Thus a recent report that the mean of 17 values was 3.863 with a standard error of the mean (SEM) of 2.162 revealed only that none of the seven authors understood the limitations of their work. The simple rule derived here by experiment for restricting a mean value to its significant digits (sig-digs) is this: the last sig-dig in the mean value is at the same decimal decade as the first sig-dig (the first non-zero) in the SEM. ... For the example above the reported values should be a mean of 4 with SEM 2.2. Routine application of these simple rules will often show that a result is not as compelling as one had hoped.

Let's compare with the numbers from before: the average height was reported as 48.4 and the SEM as 0.3. The last significant digit in the mean is the four after the decimal point, and this is the same decimal decade as the first significant digit of the SEM. So the study did it right.

Clymo develops the following simple-to-follow rules for reporting the sample average and SEM:

Rule 1 (for determining the significant digits in the reported mean): the last significant digit in the mean is in the same decade as the first non-zero digit in the SEM.
Rule 2 (for determining significant digits in the reported SEM): depending on the sample size $n$, as per the following table:

Sample size $n$	Significant digits to report
$2 \leq n \leq 6$	1
$7 \leq n \leq 100$	2
$101 \leq n \leq 10,000$	3
$10,001 \leq n \leq 10^6$	4
$n > 10^6$	5

Quiz

Ok, that is enough information. Let's practice.

Question 1

You sample the height of male students in a German school class (grade 6) in centimeters: 148, 148, 137, 152, 140, 149, 152, 152, 159, 155. Report your estimate of the population height (here the population is all German male students in grade 6).

Answer: $149\textrm{cm} \pm 2.1\textrm{cm}$ SEM. Explanation: we are interested in the population mean and hence would like to convey the remaining uncertainty of our estimate. The sample mean is $\hat{\mu} \approx 149.24467\textrm{cm}$, the standard deviation is $6.579429\textrm{cm}$, and the sample size is $n=10$. This gives a $\textrm{SEM} = 6.579429/\sqrt{10} \approx 2.080598$. Applying the above rules: Rule 1 tells us that the first significant digit is in the $10^0$ decade, so we report $149\textrm{cm}$ as mean. Rule 2 tells us that for a sample size of $n=10$ we should report two digits in the SEM, which needs to be properly rounded to $2.1\textrm{cm}$.

Question 2

You run a company and regularly send bills to customers for payment. You measure the time in days between sending the bill and receiving the payment: 10, 7, 10, 7, 12, 10, 8, 4, 15, 3, 9, 4. Report the average and variability.

Answer: $8 \pm 3.5$ SD. Explanation: we are interested in the average time and the variability, so a standard deviation is appropriate. Rule 1 from Clymo still applies and we truncate the sample mean of $8.25$ after the first digit. Rule 2 does not apply (this is the standard deviation, not the SEM), but because we have truncated the mean it makes no sense to be more accurate than the mean except for one additional digit.

Sample size \(n\)	Significant digits to report
\(2 \leq n \leq 6\)	1
\(7 \leq n \leq 100\)	2
\(101 \leq n \leq 10,000\)	3
\(10,001 \leq n \leq 10^6\)	4
\(n > 10^6\)	5