Streaming Mean and Variance Computation

Sebastian Nowozin - Sun 25 January 2015 -

Given a sequence of observed data we would often like to estimate simple quantities like the mean and variance.

Sometimes the data is available in a streaming setting, that is, we are given one sample at a time. For example, this is the case when

  • the number of samples is apriori unknown,
  • we have to perform some stopping test after each sample,
  • the number of samples is very large and we cannot store all samples.

More formally, given weighted observations \(X_1\), \(X_2\), \(\dots\), with \(X_i \in \mathbb{R}\), and \(w_1\), \(w_2\), \(\dots\), with \(w_i \geq 0\) we would like to calculate simple statistics like the weighted mean or weighted variance of the sample without having to store all samples, and by processing them one-by-one.

In this situation we can compute the mean and variance of a sample (and, more generally, any higher-order moments) using a streaming algorithm. Many possibilities exist but because of the incremental computation particular attention needs to be paid to numerical stability. If we were to ignore numerical accuracy we could use a simple derivation to show that the following updates for \(i=1,2,\dots\) are correct, when initializing \(S^{(0)} = T^{(0)} = U^{(0)} = 0\):

$$S^{(i+1)} = S^{(i)} + w_i$$
$$T^{(i+1)} = T^{(i)} + w_i X_i$$
$$U^{(i+1)} = U^{(i)} + w_i X_i^2$$

Then \(\hat{\mu} = T^{(n)} / S^{(n)}\) is the weighted sample mean, and \(\hat{\mathbb{V}} = \frac{n}{(n-1) S^{(n)}} (U^{(n)} - S^{(n)} \hat{\mu}^2)\) is an unbiased estimate of the weighted variance.

The problem with this naive derivation arises when \(n\) is very large. Then the in all three updates the summation may sum quantities of very different magnitude, leading to large round-off errors. By the way, this can even arise when one is computing the simple sum of many numbers, and a classic solution in that case is Kahan summation.

A clever solution to this problem for streaming mean and variance computation was proposed by West in 1979. In his algorithm the summed quantities are controlled to be on average of comparable size. (It is not the only alternative, for a detailed numerical study of possible options, see the paper linked below.)

The West algorithm supports mean and variance computation for positively weighted samples \((w_i, X_i)\) with \(w_i \geq 0\), \(X_i \in \mathbb{R}\) and the original paper is

It outputs

  1. The weighted unbiased mean estimate, \(\hat{\mu} = (\sum_i w_i X_i) / (\sum_i w_i)\),
  2. The weighted unbiased variance estimate, \(\hat{\mathbb{V}} = \left(\sum_i w_i (X_i - \mu)^2\right) / (\frac{n-1}{n} \sum_i w_i)\).

Here is an implementation for the Julia programming language.

type MeanVarianceAccumulator

    function MeanVarianceAccumulator()
        new(0.0, 0.0, 0.0, 0)
function observe!(mvar::MeanVarianceAccumulator, value, weight)
    @assert weight >= 0.0
    q = value - mvar.wmean
    temp_sumw = mvar.sumw + weight
    r = q*weight / temp_sumw

    mvar.wmean += r
    mvar.t += q*r*mvar.sumw
    mvar.sumw = temp_sumw
    mvar.n += 1

count(mvar::MeanVarianceAccumulator) = mvar.n
mean(mvar::MeanVarianceAccumulator) = mvar.wmean
var(mvar::MeanVarianceAccumulator) = (mvar.t*mvar.n)/(mvar.sumw*(mvar.n-1))
std(mvar::MeanVarianceAccumulator) = sqrt(var(mvar))

You would call it as follows (tested with Julia version 0.3.5):

X = [5.0, -1.5, 3.33]
w = [0.5, 1.0, 0.1]

n = length(X)
mu_exact = sum(w.*X) / sum(w)
V_exact = sum(w .* ((X .- mu_exact).^2)) / (((n-1)/n) * sum(w))

mvar = MeanVarianceAccumulator()
for i=1:n
    observe!(mvar, X[i], w[i])
mean(mvar), mu_exact, var(mvar), V_exact

This gives the correct output (running mean, mean, running variance, variance):


Alternative algorithms and variants for higher-order moments can be found on the excellent Wikipedia page on the topic.

Addendum: (October 2015) A recent paper by (Meng, 2015) contains a variant of the above algorithm for the unweighted case to compute the first four central moments in a numerically stable manner. Meng provides a simple implementation requiring only 24 floating point operations per observation.

Acknowledgements. I thank Amit Adam for reading a draft and providing comments that improved clarity.