Sebastian Nowozins slow bloghttp://www.nowozin.net/sebastian/blog/2016-12-10T22:30:00+01:00NIPS 2016 Generative Adversarial Training workshop talk2016-12-10T22:30:00+01:00Sebastian Nowozintag:www.nowozin.net,2016-12-10:sebastian/blog/nips-2016-generative-adversarial-training-workshop-talk.html<p>The biggest AI conference of the year has just ended:
<a href="http://nips.cc/">NIPS</a> in Barcelona broke all records this year and the
program was exciting as always. It certainly remains my favorite conference
to attend.</p>
<p>One of the best things about NIPS are the numerous high-quality workshops;
this year <a href="http://lopezpaz.org/">David Lopez-Paz</a>,
<a href="https://twitter.com/alecrad">Alex Radford</a>, and
<a href="http://leon.bottou.org/">Léon Bottou</a> put together a workshop on
<a href="https://sites.google.com/site/nips2016adversarial/">Adversarial Training</a>,
with most of the content related to <em>generative adversarial networks</em> (GAN).</p>
<p>If you have not heard of GANs before, <a href="http://www.iangoodfellow.com/">Ian
Goodfellow</a> gave a detailed <a href="https://nips.cc/Conferences/2016/Schedule?showEvent=6202">tutorial on
GANs</a>, <a href="http://www.iangoodfellow.com/slides/2016-12-04-NIPS.pdf">slides
here</a>, earlier in the
week and certainly GANs were the hot topic of this years NIPS.</p>
<h2><span class="math">\(f\)</span>-GAN Talk Slides</h2>
<p>I gave an invited talk at the GAN workshop on the NIPS 2016 paper on
<a href="http://www.nowozin.net/sebastian/papers/nowozin2016fgan.pdf">f-GAN</a>, authored
by <a href="https://www.microsoft.com/en-us/research/people/ryoto/">Ryota Tomioka</a>,
Botond Cseke, and myself.</p>
<p>Here is the slide deck I used during the talk.
<iframe
src="https://onedrive.live.com/embed?cid=6B87C0D396848478&resid=6B87C0D396848478%21282256&authkey=AMnj1btWAs8J-aU&em=2"
width="402" height="327" frameborder="0" scrolling="no"></iframe></p>
<p>Please let me know your feedback.</p>
<p><em>Acknowledgments.</em> This is joint work with Ryota Tomioka and Botond Cseke.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Book Review: Computer Age Statistical Inference2016-11-23T21:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2016-11-23:sebastian/blog/book-review-computer-age-statistical-inference.html<p><img alt="Book cover: Computer Age Statistical
Inference" src="http://www.nowozin.net/sebastian/blog/images/computer-age-statistical-inference.jpg" /></p>
<p>A new book, <a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/computer-age-statistical-inference-algorithms-evidence-and-data-science">Computer Age Statistical Inference: Algorithms, Evidence, and Data
Science</a> by
<a href="http://statweb.stanford.edu/~ckirby/brad/">Bradley Efron</a> and
<a href="http://web.stanford.edu/~hastie/">Trevor Hastie</a>, was released in July this
year. I finished reading it a few weeks ago and this is a short review from
the point of view of a machine learning researcher.</p>
<p>Living in Cambridge I indulge myself every once in a while by taking a break
at the <a href="http://www.cambridge.org/about-us/visit-bookshop/history-bookshop">Cambridge University Press
bookstore</a>
at the market square, located just opposite of King's College it is the oldest
book shop in England.
Besides having an excellent collection of mathematics and computer science
books, at the entrance of the shop they showcase new releases from Cambridge
University Press.
Most of these new books fall outside my interest, but what a pleasure it was
to discover a new bold book on the broad theme of statistics in the modern
age, written by two experts in the field!
I took a look at the table of contents and a minute later purchased the book.</p>
<h1>Review</h1>
<p>The book examines statistics broadly through three lenses.</p>
<p><em>First</em>, it tells the history of the field of statistics, often with
interesting remarks about the prevalent views at the time a method was
invented.
<em>Second</em>, correlated with the chronological order, the authors classify
methods by their use of computation. Classic methods use few to none
computation but often leverage asymptotic arguments. Newer methods
are increasingly realistic in their assumptions but rely on heavy use of
machine computation.
<em>Third</em>, the flavour of the presented methods is interpreted as <em>Fisherian</em>,
<em>frequentist</em>, or <em>Bayesian</em>.</p>
<p>The terminology in the book is easily accessible to a person with basic
statistics training, perhaps with the exception of the word "<em>Inference</em>" in
the title.
In the book the authors use "inference" to describe the means by which
legitimacy of statistical results can be established.
This sense is different from the common use of the word in the machine
learning community, where it would usually refer in a broad sense to "perform
computation of consequences given a model and observations".</p>
<p>From a machine learner's perspective the most interesting parts of the book
are the wide applicability of the empirical Bayes methodology, which is
demonstrated in a number of generally relevant applications including
<em>large-scale testing</em> and <em>deconvolution</em>.</p>
<p>Another benefit for someone with a machine learning background is the modern
view on classic methods such as resampling methods (bootstrap and jackknife),
a readable motivation for topics and applications which are popular in
statistics but not popular in machine learning (survival analysis, large-scale
testing, confidence intervals, etc.), and the historical remarks and
subjective commentary on developments in the field.</p>
<p>The subjective commentary in the <em>Epilogue</em> makes predictions about the field
of statistics and data science as a whole, with the main trends being a
branching out into applications and an increased reliance on computation.</p>
<h2>Criticism</h2>
<p>The book is a wonderful book and many readers will enjoy reading it, as I did.
There are only two minor points where I feel the book could be improved.</p>
<p><em>First</em>, while the authors readily acknowledge that many topics could have
been added to the book, I feel that certain topics should have been included
due to their broad applicability and heavy use of computation in many
successful models:
variational Bayesian inference, approximate Bayesian computation (ABC),
kernel methods more generally, and Bayesian nonparametrics.
Perhaps variational inference and kernel methods have not reached the core
statistics community yet, but ABC and Bayesian nonparametrics originate with
them and are only possible because of the massive computation available today.</p>
<p><em>Second</em>, in the description of solutions to statistical problems throughout
the book there is a strong emphasis on empirical Bayes and the bootstrap.</p>
<h1>Summary</h1>
<p>If you enjoy statistics, computation, or machine learning, get the book!
The breadth of topics and the independence between the chapters will make it
easy for you to find something interesting.</p>
<p><em>Acknowledgements.</em> Thanks to Diana Gillooly for corrections.</p>Streaming Log-sum-exp Computation2016-05-08T21:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2016-05-08:sebastian/blog/streaming-log-sum-exp-computation.html<p>A common numerical operation in statistical computing is to compute</p>
<p>
<div class="math">$$\log \sum_{i=1}^n \exp x_i,$$</div>
</p>
<p>where <span class="math">\(x_i \in \mathbb{R}\)</span>, and <span class="math">\(n\)</span> is potentially very large.</p>
<p>We can implement the above computation by exponentiating each number, then
summing them, then taking a logarithm as follows (written in
<a href="http://julialang.org/">Julia</a>).</p>
<div class="highlight"><pre><span></span><span class="n">logsumexp_naive</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="o">=</span> <span class="n">log</span><span class="p">(</span><span class="n">sum</span><span class="p">(</span><span class="n">exp</span><span class="p">(</span><span class="n">X</span><span class="p">)))</span>
</pre></div>
<p>When the above function returns a finite number then it is numerically
accurate. However, the above computation is not robust if one of the elements is very large (say, larger than 710 for double precision IEEE floating point).
Then <span class="math">\(\exp(x_i)\)</span> returns a floating point infinity and the entire computation
returns a floating point infinity as well.</p>
<h2>Standard Batch Solution</h2>
<p>The standard solution to this problem is to use the mathematical identity</p>
<p>
<div class="math">$$\log \sum_{i=1}^n \exp x_i = \alpha + \log \sum_{i=1}^n \exp (x_i - \alpha),$$</div>
</p>
<p>which holds for any <span class="math">\(\alpha \in \mathbb{R}\)</span>.
By selecting <span class="math">\(\alpha = \max_{i=1,\dots,n} x_i\)</span> no argument to the
<span class="math">\(\exp\)</span>-function will be larger than zero and the above naive computation can
be applied on the transformed numbers.
The code is as follows.</p>
<div class="highlight"><pre><span></span><span class="k">function</span><span class="nf"> logsumexp_batch</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="n">maximum</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="c"># Find maximum value in X</span>
<span class="n">log</span><span class="p">(</span><span class="n">sum</span><span class="p">(</span><span class="n">exp</span><span class="p">(</span><span class="n">X</span><span class="o">-</span><span class="n">alpha</span><span class="p">)))</span> <span class="o">+</span> <span class="n">alpha</span>
<span class="k">end</span>
</pre></div>
<p>Code such as the above is used in almost all packages for performing
statistical computation and is described as the standard solution, see e.g.
<a href="https://hips.seas.harvard.edu/blog/2013/01/09/computing-log-sum-exp/">here</a>
and <a href="https://en.wikipedia.org/wiki/LogSumExp">here</a>.</p>
<p>However, there are the following problems:</p>
<ol>
<li>It requires two scans over the data array, one to find the maximum, one to
compute the summation. For modern systems and large input arrays the above
computation is memory-bandwidth limited so two memory scans mean twice the
runtime.</li>
<li>It requires knowledge of the number of elements in the sum prior to
computation.</li>
</ol>
<h2>Streaming log-sum-exp Computation</h2>
<p>The solution is to also compute the maximum element in a streaming manner and
to correct a running estimate whenever a new maximum is found.
I have not seen this solution elsewhere, but I hope you may find it useful.</p>
<p>First, here is the code.</p>
<div class="highlight"><pre><span></span><span class="k">function</span><span class="nf"> logsumexp_stream</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="o">-</span><span class="nb">Inf</span>
<span class="n">r</span> <span class="o">=</span> <span class="mf">0.0</span>
<span class="k">for</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X</span>
<span class="k">if</span> <span class="n">x</span> <span class="o"><=</span> <span class="n">alpha</span>
<span class="n">r</span> <span class="o">+=</span> <span class="n">exp</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">alpha</span><span class="p">)</span>
<span class="k">else</span>
<span class="n">r</span> <span class="o">*=</span> <span class="n">exp</span><span class="p">(</span><span class="n">alpha</span> <span class="o">-</span> <span class="n">x</span><span class="p">)</span>
<span class="n">r</span> <span class="o">+=</span> <span class="mf">1.0</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="n">x</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">log</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">+</span> <span class="n">alpha</span>
<span class="k">end</span>
</pre></div>
<p>As you can see by glancing over the code, only one linear access over the
input is required and we do not need to know the number of elements.</p>
<p>To understand how the code works, assume we maintain two quantities.
The first is the largest value seen after <span class="math">\(i\)</span> elements,</p>
<p>
<div class="math">$$\alpha_i := \max_{j = 1,\dots,i} x_i.$$</div>
</p>
<p>The second is the accumulated sum so far with the current maximum subtracted,</p>
<p>
<div class="math">$$r_i := \sum_{j=1}^i \exp(x_j - \alpha_i).$$</div>
</p>
<p>Now when we visit a new element <span class="math">\(x_{i+1}\)</span> there are two cases that can happen.
If <span class="math">\(x_{i+1} \leq \alpha_i\)</span> then <span class="math">\(\alpha_{i+1} = \alpha_i\)</span> and we simply update</p>
<p>
<div class="math">$$r_{i+1} = r_i + \exp(x_{i+1} - \alpha_{i+1}).$$</div>
</p>
<p>However, if we see a new largest element, we can write <span class="math">\(r_i\)</span> as</p>
<p>
<div class="math">$$r_i := \sum_{j=1}^i \exp(x_j - \alpha_i) = \exp(-\alpha_i) \sum_{j=1}^i \exp(x_j).$$</div>
</p>
<p>We correct this estimate in order to use the new maximum <span class="math">\(x_{i+1}\)</span> and
cancelling the old maximum <span class="math">\(\alpha_i\)</span>,</p>
<p>
<div class="math">$$r'_{i+1} = \exp(\alpha_i - x_{i+1}) \, r_i.$$</div>
</p>
<p>The factor is always smaller than one.
Then we proceed to accumulate as normal to obtain</p>
<p>
<div class="math">$$r_{i+1} = r'_{i+1} + \exp(x_{i+1} - \alpha_{i+1}) = r'_{i+1} + 1.$$</div>
</p>
<p>The above code is as numerically robust as the commonly used batch version and
for large arrays can be twice as fast.</p>
<h2>Example</h2>
<p>Running</p>
<div class="highlight"><pre><span></span><span class="n">n</span> <span class="o">=</span> <span class="mi">10_000_000</span>
<span class="n">X</span> <span class="o">=</span> <span class="mf">500.0</span><span class="o">*</span><span class="n">randn</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">logsumexp_naive</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="n">logsumexp_batch</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="n">logsumexp_stream</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
</pre></div>
<p>gives the following output</p>
<div class="highlight"><pre><span></span><span class="p">(</span><span class="nb">Inf</span><span class="p">,</span><span class="mf">2686.7659554831052</span><span class="p">,</span><span class="mf">2686.7659554831052</span><span class="p">)</span>
</pre></div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Where will Artificial Intelligence come from?2016-04-20T23:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2016-04-20:sebastian/blog/where-will-artificial-intelligence-come-from.html<p>Artificial Intelligence (AI) is making progress in great strides, or at least
it appears so!
Almost no week passes by without some major announcements of new challenges
solved by AI technology or new products powered by AI.</p>
<p>Indeed many quantifiable factors attest an unprecedented level of activity:
capital investments, number of academic papers, number of products involving AI technology, they all are on a steep rise in the past five years.</p>
<p>Computers are already very capable at some specialized tasks that require
reasoning and other abilities that we typically associate with intelligence.
For example, computers can play a decent game of chess or can help us order
our holiday photos.
Despite this genuine progress, we are still a long way from human level
intelligence because our best artificial intelligence systems are not
<em>general</em> purpose.
They cannot quickly adapt to novel tasks the way most humans can do.</p>
<p>When talking about artificial intelligent systems there is a risk of
emphasizing humans too much. Computers are <em>already</em> more capable than any
human at many tasks, for example in numerical computation and search.
Yet, in discussions about artificial intelligence we emphasize the shrinking
set of abilities where humans still outperform machines.
For a nice and more balanced recent discussion on issues surrounding
artificial intelligence I recommend reading the <a href="http://edge.org/responses/q2015">edge
contributions towards the Edge 2015
question</a>.</p>
<p>As artificial intelligence continues to make progress, I would like to ask the
following question:</p>
<p><strong>Where will the next major advance towards general purpose artificial
intelligence come from?</strong></p>
<p>Below I list seven possible areas which I believe could be the answer to this
question; these answers are highly subjective and biased and they may be all
wrong, but hopefully they do contain some interesting pointers for everyone.</p>
<p>The point of this exercise is to show that there are many strands of active
research that could result in major AI advances.
So here are they, the seven areas where a major general purpose AI
breakthrough could come from.</p>
<h1>1. Composable Differentiable Architectures (aka Deep Learning)</h1>
<p><em>Composable differentiable architectures</em> describes current state-of-the-art
deep learning systems.
Frameworks such as
<a href="http://caffe.berkeleyvision.org/">Caffe</a>,
<a href="http://deeplearning.net/software/theano/">Theano</a>,
<a href="http://torch.ch/">Torch</a>,
<a href="http://chainer.org/">Chainer</a>, all allow the specification of function
classes and to automatically compose and differentiate such functions.
Because of this mix-and-match composability there is a frictionless and rapid
diffusion of components and (sub-)models across application domains.</p>
<p>This <em>commoditizes</em> machine learning and allows customization to specific
applications;
it commoditizes machine learning because the level of knowledge required to
leverage modern deep learning frameworks is low.
These deep learning frameworks also allow for easy
<em>customization</em> of the model to the application at hand.
Years ago, this <em>was</em> the unattained dream for graphical models, but today it
<em>is</em> achieved by deep learning frameworks where bespoke models are build for
most applications.</p>
<p><img alt="Deep tree" src="http://www.nowozin.net/sebastian/blog/images/ai-deep-tree.png" /></p>
<p>But is it enough for general purpose AI? What is missing?</p>
<p>I believe there are two obstacles;
<em>first</em>, almost all deep learning systems
require large amounts of supervised data to work.
For high-value industrial applications this may be okay because the required
label data can be collected.
However, there is a long tail of useful applications where label data is rare
but unlabeled data is abundant.
Future AI systems need to be able to leverage this abundant data source.</p>
<p><em>Second</em>, what is missing are general architectures for reasoning, and an
intense search for such building blocks is currently taking place. Maybe
classic ideas from AI, such as <em>blackboard systems</em>, could be adapted and made
differentiable to enable reasoning, or maybe some entirely unexpected new
building block will appear.</p>
<p>Besides better models, the key novel technology to look out for in deep
learning is custom hardware and novel engineering abstractions.
Custom hardware could enable energy savings, or increased speed, or both.
Current deep learning piggybacks on GPU development funded largely by the
gaming industry. This is a great thing because developing a new GPU
generation such as Nvidia's new Pascal GPU requires very large research and
development budgets.
Novel engineering abstractions in the form of next generation deep learning
frameworks could enable automatic scalability, distributed computation, or
offer help in identifying the right architecture for the task.</p>
<p><em>Scalability</em> is important beyond just training speed. For example, consider
<a href="http://people.idsia.ch/~juergen/raw.html">basic estimates of the computing power of the human
brain</a> or the following quote from a
<a href="http://www.macleans.ca/society/science/the-meaning-of-alphago-the-ai-program-that-beat-a-go-champ/">recent interview with Geoff
Hinton</a>.</p>
<blockquote>
<p>"So in the brain, you have connections between the neurons called synapses,
and they can change. All your knowledge is stored in those synapses. You
have about 1,000-trillion synapses-10 to the 15, it's a very big number. So
that's quite unlike the neural networks we have right now. They're far, far
smaller, the biggest ones we have right now have about a billion synapses.
That's about a million times smaller than the brain."</p>
</blockquote>
<p>This puts up a ballpark estimate for the number of primitive computational
units in the human brain, and it is quite reasonable to attempt to achieve
this scale.</p>
<p>One important fact to consider: the driving force behind applications of deep
learning is largely the industry, and this will remain the case as long as it
pays dividends (it does so greatly at the moment).</p>
<h1>2. Brain Simulations</h1>
<p>Understanding the brain and simulating it is what I think of as the
<em>safe route</em> to general AI.
We do not know whether it will take 5, 50, or 500 years, but it is likely
that we eventually will get there and be able to accurately simulate an
artificial brain which is functionally indistinguishable from a real human
brain.</p>
<p><img alt="Human brain" src="http://www.nowozin.net/sebastian/blog/images/ai-gauss-brain.png" /></p>
<p>Novel technology and approaches to study neural systems, such as
<a href="http://edge.org/conversation/ed_boyden-how-the-brain-is-computing-the-mind">optogenetics</a>,
<a href="http://www.nature.com/nnano/journal/v8/n2/full/nnano.2012.265.html">multi-electrode arrays</a>
and
<a href="https://en.wikipedia.org/wiki/Connectomics">connectomics</a>, eventually will
enable us to obtain a high-fidelity understanding of the brain.
Likewise, increase in computation and custom hardware will allow accelerated
simulation of neural models.</p>
<p>Most of the investments in this area of research are government funds, for
example through the large
US <a href="https://en.wikipedia.org/wiki/BRAIN_Initiative">BRAIN initiative</a> and the
<a href="https://en.wikipedia.org/wiki/Human_Brain_Project">Human Brain Project</a>, and
more general neuroscience funding.</p>
<h1>3. Algorithmic Information Theory and Universal Intelligence</h1>
<p>Whatever intelligence is, if we were to accept the possibility of a
mathematical theory for it, the closest contenders for such theory are found
in a field called algorithmic information theory.
If you have not heard of algorithmic information theory before, <a href="https://en.wikipedia.org/wiki/Gregory_Chaitin">Gregory
Chaitin</a> recently wrote <a href="http://inference-review.com/article/doing-mathematics-differently">an
excellent essay on the conceptual roots of algorithmic information
theory</a> and
the general history of notions of complexity in science and mathematics.</p>
<p>One approach which leverages algorithmic information theory for general
artificial intelligence is the
<a href="https://en.wikipedia.org/wiki/AIXI">AIXI agent</a>, a
theory put forward by <a href="http://www.hutter1.net/">Marcus Hutter</a> that attempts
to be universal in the sense that it will successfully and optimally solve any
solvable tasks.
At its heart it is a <a href="http://www.nowpublishers.com/article/Details/MAL-049">Bayesian reinforcement learning
agent</a> where the
hypothesis space are possible programs of a <a href="https://en.wikipedia.org/wiki/Turing_machine">Turing
machine</a>.
It is an extension of an earlier idea to consider Turing machines for
predicting future symbols in an observed sequence. This idea is
<a href="https://en.wikipedia.org/wiki/Solomonoff's_theory_of_inductive_inference">Solomonoff induction</a> proposed by <a href="http://world.std.com/~rjs/tributes/vitanyi.html">Ray
Solomonoff</a>.
Because Turing machines are universal any computable hypothesis can be
entertained.
AIXI extends this idea from mere prediction of symbols to acting in an unknown
environment, that is, to <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement
learning</a>.</p>
<p><img alt="Alan Turing" src="http://www.nowozin.net/sebastian/blog/images/ai-turing.png" /></p>
<p>Grounding intelligence in Turing machines is very appealing: not only is it
universal, but allows the <a href="http://arxiv.org/abs/0712.3329">formal <em>definition</em> of universal
intelligence</a> as well.
In essence, reasoning and acting intelligently is reduced to formal
manipulation of a notion of complexity defined by programs on a Turing
machine. See also <a href="http://people.idsia.ch/~juergen/">Jürgen
Schmidhuber</a>'s <a href="http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=48FFC58A49C83119D49EB93C5AA2A975?doi=10.1.1.7.2717&rep=rep1&type=pdf">speed prior for Turing
machines</a>.</p>
<p>Despite this promise, so far we do not see impressive results achieved by AIXI
agents. Why not?
There are at least two obstacles:</p>
<ol>
<li>
<p>Universal Turing machines are not practically implementable and
<a href="https://arxiv.org/abs/1510.05572">approximating AIXI is hard</a>; there have
been some approximation attempts, e.g. in the work of <a href="http://www.aaai.org/Papers/JAIR/Vol40/JAIR-4004.pdf">(Veness et al., JAIR
2011)</a>, but at best
results have matched other reinforcement learning methods without enabling
novel applications that were out of reach before.
More recently <a href="http://people.idsia.ch/~juergen/">Jürgen Schmidhuber</a>
proposed a more practical integration of recurrent neural network models of
the world with algorithmic information theory in the form of <a href="http://arxiv.org/abs/1511.09249">RNN-based
AIs</a>.</p>
</li>
<li>
<p>The choice of Turing machine is not clear.
There is an infinite set of possible universal Turing machines and we could
reasonably hope that the particular choice would not influence the agent
efficiency except perhaps for some small overhead.
(For a related example, the <a href="https://en.wikipedia.org/wiki/Kolmogorov_complexity">Kolmogorov
complexity</a> of a sequence
is defined through a Turing machine, but whatever the choice of a Turing
machine there is an invariance property in that only a constant overhead
introduced compared to any other Turing machine.)
<a href="https://arxiv.org/abs/1510.04931">Unfortunately for AIXI this is not the
case</a>: the choice of Turing machine
can determine the behaviour of the AIXI agent entirely.
(This may also affects Bayesian reinforcement learning more generally: when
using a non-parametric prior process the choice of prior may determine more
than intended.)</p>
</li>
</ol>
<p>This recent negative result leaves AIXI in an interesting state at the moment.
It is clearly the most complete theory of universal agents we have at the
moment, see e.g. <a href="https://arxiv.org/abs/1202.6153">Hutter's own review from
2012</a>, but it may turn out to be entirely
subjective (if no "<em>natural</em>" Turing machine can be identified) or practically
unworkable.</p>
<h1>4. Artifical Life</h1>
<p>In the above section on brain simulation I argued that by understanding the
human brain and then simulating it we will eventually be able to attain
human-level intelligence.
However, we can start at a more basic level: by understanding and simulating a
synthetic form of chemistry we may be able to simulate artificial life.
Given a sufficiently rich environment such life may evolve to become
intelligent.</p>
<p>The field of <em>Artificial Life</em> (ALife) studies the formation and dynamics of
life itself on top of artifical simulations of life.
This life does not need to be intelligent, and in fact, so far no such
simulation has produced life with the intelligence beyond that of a simple
organism. But it is clear that <a href="https://en.wikipedia.org/wiki/Tierra_%28computer_simulation%29">since the early
1990'ies</a>, by
any generally plausible definition of life (of which there are many and there
is some controversy), artificial life does indeed spontaneously form in
computer simulations and complex evolutionary dynamics such as symbiosis and
parasites do occur in these simulations.</p>
<p>For a dated but inspiring introduction to the field of artificial life
more generally, see <a href="http://adamilab.msu.edu/">Christoph Adami</a>'s <a href="http://www.springer.com/us/book/9781461272311">book on
Artificial Life</a>.
Adami also wrote a recent <a href="http://adamilab.blogspot.co.uk/2015/12/evolving-intelligence-with-little-help.html">article on evolving artificial
intelligence</a>
that highlights current research issues for the goal to evolve artificial
intelligence.</p>
<p>More fundamentally, in <a href="http://adamilab.blogspot.com/2013/02/your-conscious-you.html">another
article</a> Adami
argues that from theoretical results in a field called <a href="https://en.wikipedia.org/wiki/Integrated_information_theory"><em>integrated
information
theory</em></a> (and
which I have not heard of before), one possible consequence may be that due to
the complexity of general intelligence it is not possible to design it but
instead an evolutionary approach is needed.</p>
<p>Given that our goal is to evolve intelligence artifically, fundamentally there
are the following obstacles in producing useful general artificial
intelligence through artificial life:</p>
<p><img alt="Anomalocaris" src="http://www.nowozin.net/sebastian/blog/images/ai-anomalocaris.png" /></p>
<ol>
<li>
<p>The <em>big intelligence filter</em> hypothesis.
This hypothesis goes as follows: life may be abundant but intelligent life may
be exceedingly rare.
We currently do not know if intelligent life is rare or abundant in the
universe, but if it is rare, it may also be exceedingly rare in any simulation
of artificial life.
A related point is what is known as the <em>Fermi paradox</em>, namely, that what
science tells us about astrophysics implies we should have likely observed alien
civilizations by now, but this has not happened yet. (See Tim Urban's
wonderful article on the <a href="http://waitbutwhy.com/2014/05/fermi-paradox.html">Fermi's
paradox</a>.)
Even for life on our own planet we are not sure what triggered intelligence to
appear; one widely believed hypothesis is that it happened in a short time,
akin to a phase transition, due to a change in ocean oxygen levels 540 million
years ago, leading to the <a href="http://www.nature.com/news/what-sparked-the-cambrian-explosion-1.19379">Cambrian
explosion</a>.</p>
</li>
<li>
<p>Harnessing intelligence.
One of our closest genetic relatives, the chimpanzee, are clearly intelligent,
but harnessing this intelligence for something useful is difficult.
Now imagine a giant squid, swimming a kilometer deep within the ocean.
Likely they are also intelligent but we can hardly leverage this for anything
useful.
Who knows which form intelligent artificial life will take?
Will we be able to recognize this life as intelligent?
If we do, such life will likely be similar to encountering an alien species in
our universe: unlike anything you can imagine or predict beforehand.
That is to say, we may be able to achieve intelligent artificial life but may
still struggle to make it useful.
Even with full control over the simulation environment, a god-like state if
you will, it seems necessary that to make artificial intelligence life useful
we will at least have to decode its representation or ``language'' and
understand the incentives sufficiently well in order to communicate with such
intelligence and motivate it to work for us.</p>
</li>
</ol>
<p><img alt="Giant squid" src="http://www.nowozin.net/sebastian/blog/images/ai-octopus-drawing.png" /></p>
<p>In summary, the evolutionary approach to constructing AIs is promising in the
long run and there are now several labs working on it
(the labs of <a href="http://adamilab.msu.edu/">Adami</a>,
<a href="http://www.evolvingai.org/">Clune</a>, and <a href="http://hintzelab.msu.edu/">Hintze</a>).</p>
<h1>5. Robotics and Autonomous Systems</h1>
<p>Autonomous robots are rapidly conquering novel applications in industry and
consumer space, such as in self-driving cars, agricultural robotics,
<a href="https://en.wikipedia.org/wiki/Industry_4.0">industry 4.0</a>, and drones.</p>
<p>The key enablers of this development are improved sensing technology (e.g.
low-cost depth sensors), increased compute and memory capacities, and improved
pattern recognition methods.
As a result of the maturity of basic required technologies significant
industry capital is invested in driving advanced autonomous robotics research.</p>
<p>Beyond the natural urge to feel scared by autonomous machines, how could this
lead to a breakthrough in artificial intelligence that cannot be found in one
of the constituent technologies?</p>
<p><img alt="Humanoid robot" src="http://www.nowozin.net/sebastian/blog/images/ai-robot.png" /></p>
<p>One line of thought in the field of <a href="https://en.wikipedia.org/wiki/Embodied_cognition">embodied
cognition</a> argues that
an intelligent system is conditioned on its environment in a fundamental way,
shaping the allocation of precious (evolutionary) resources in order to
maximally exploit the types of sensors and actuators available to it.
Therefore the specific nature of sensing and acting abilities is not ancillary
to intelligence but the main driving force that enables intelligence in the
first place.</p>
<p>If the above thesis were true, autonomous robots with modern sensors and
actuators would provide a rich enough <em>embodiment</em> for artificial
intelligence, and the lack of such an embodiment in other domains would likely
impede the emergence of general intelligence.</p>
<p>In the past decade, the European Union, through its robotics programme in the
7th Framework Programme (FP7, totalling more than 50 billion Euro for
2007-2013) has placed an emphasis on <em>combining</em> cognition with robotics at
the exclusion of funding research on artificial intelligence not involving
robotics.
However, many of the resulting <a href="http://cordis.europa.eu/fp7/ict/robotics/projects/understanding/learning_en.html">large research projects of that
time</a>
are more reflecting the ample funding availability rather than representing
progress on fundamental questions of cognition.</p>
<h1>6. Game Playing</h1>
<p>Games entertain humans; what could they do to enable artificial intelligence?</p>
<p>The answer: quite a lot!
Games are designed to challenge our intellect, involve interactions
between multiple agents, and are sufficiently abstract to be formalized.
A computer-implemented game can be as simple and abstract as tic-tac-toe,
Chess, or Go, or as sophisticated and close to reality as the latest Grand
Theft Auto game.</p>
<p><img alt="Dice" src="http://www.nowozin.net/sebastian/blog/images/ai-dice.png" /></p>
<p>Therefore games are an almost ideal research vehicle to drive artificial
intelligence research.
<a href="http://engineering.nyu.edu/people/julian-togelius">Julian Togelius</a> argues
this point eloquently in <a href="http://togelius.blogspot.co.uk/2016/01/why-video-games-are-essential-for.html">a recent
article</a>.</p>
<p>In fact, there are now popular game playing competitions and platforms which
drive AI research:
the Stanford <a href="http://games.stanford.edu/">general game playing competition</a>,
the <a href="http://www.computerpokercompetition.org/">Computer Poker Competition</a>,
the <a href="http://webdocs.cs.ualberta.ca/~cdavid/starcraftaicomp/report2015.shtml">StarCraft AI Competition</a>,
the <a href="http://www.arcadelearningenvironment.org/">Atari 2600 Arcade Learning Environment</a>, and, most recently
<a href="https://blogs.microsoft.com/next/2016/03/13/project-malmo-using-minecraft-build-intelligent-technology/">Microsoft's Minecraft AI environment
(Malmo)</a>.</p>
<p>It is likely that such platforms will provide diverse and challenging
environments for testing the abilities of artificial general intelligence
agents, thus accelerating research and enabling breakthroughs. Perhaps the
next breakthrough will be in the form of mastering another game.</p>
<h1>7. Knowledge Bases</h1>
<p>A knowledge base is a discrete representation of basic facts and relations
about entities.
Large-scale knowledge bases constructed semi-automatically from the web are
already incredibly useful commercially and they power search engine results and
personal assistants.</p>
<p><img alt="Couple" src="http://www.nowozin.net/sebastian/blog/images/ai-couple.png" /></p>
<p>In search results they provide highly accurate results for known entities in
all major search engines (e.g. <a href="https://en.wikipedia.org/wiki/Knowledge_Graph">Knowledge
Graph</a> and <a href="https://en.wikipedia.org/wiki/Knowledge_Vault">Knowledge
Vault</a> in Google,
<a href="http://arstechnica.com/information-technology/2012/06/inside-the-architecture-of-googles-knowledge-graph-and-microsofts-satori/">Satori</a> in Microsoft Bing).
To see an example, search for a well-known person, e.g. "Stanislaw Ulam"
(<a href="http://www.bing.com/search?q=Stanislaw+Ulam&go=Submit+Query&qs=bs&form=QBLH">results from Bing</a>, <a href="https://www.google.com/?gws_rd=ssl#safe=off&q=Stanislaw+Ulam">results from Google</a>) and observe that the details about the person displayed.</p>
<p>In personal intelligent assistants such as
<a href="http://www.apple.com/ios/siri/">Apple Siri</a>,
<a href="https://www.google.com/landing/now/">Google Now</a>,
<a href="https://www.microsoft.com/en-us/mobile/experiences/cortana/">Microsoft Cortana</a>, or
<a href="https://www.amazon.com/gp/help/customer/display.html?nodeId=201549800">Amazon Alexa</a>
they are responsible for providing facts in basic reasoning abilities. For
example, in order to answer queries such as "Who was the president following
Thomas Jefferson?" a basic natural language understanding ability and a large
knowledge base go a long way.</p>
<p>But can knowledge bases provide the substrate for artificial intelligence?
The <a href="https://en.wikipedia.org/wiki/Cyc">Cyc project</a> started in 1984 and the
<a href="https://en.wikipedia.org/wiki/Open_Mind_Common_Sense">Open Mind Common Sense
project</a> started in
1999 are both based on the belief that in order to enable artificial
intelligence we need to encode common sense reasoning, particularly the
entities and relationships of everyday life.
The hope was that knowledge, encoded in this way, will make reasoning and
discovery of novel knowledge simpler.</p>
<p>It is fair to say, that while the (commercial) usefulness of knowledge bases
for intelligent applications is now well established, it is too early to say
whether general artificial intelligence would require reasoning on top of an
explicit symbolic knowledge base.
Perhaps a more continuous and non-symbolic representation of knowledge that
supports reasoning is sufficient.</p>
<h1>Conclusion</h1>
<p>The goal of artificial general intelligence (AGI) is challenging and exciting
on many levels.
In all likelihood artificial intelligence will make rapid progress in the next
decade, perhaps along the directions we just discussed.</p>
<p><em>Acknowledgements</em>. I thank
<a href="http://people.idsia.ch/~juergen/">Jürgen Schmidhuber</a>,
<a href="http://inverseprobability.com/">Neil Lawrence</a>,
<a href="http://research.microsoft.com/en-us/um/people/pkohli/">Pushmeet Kohli</a>,
<a href="http://adamilab.msu.edu/">Chris Adami</a>, and
<a href="http://www.cs.rhul.ac.uk/home/chrisw/">Chris Watkins</a>
for feedback, pointers to literature, and corrections on
a draft version of the article.</p>
<p><em>Image credits</em>.
The tree image is licensed CC-BY-SA by
<a href="http://adoomer.deviantart.com/art/Tree-of-life-190974117">adoomer</a>.
The brain image is a drawing of the brain of Gauss and is <a href="https://commons.wikimedia.org/wiki/File:PSM_V26_D768_Brain_of_gauss.jpg">public
domain</a>.
The Anomalocaris image is CC-BY-3.0 licensed art by <a href="https://commons.wikimedia.org/wiki/File:Anomalocaris_BW.jpg">Nobu
Tamura</a>.
The octopus image is CC-BY-2.0 licensed art by <a href="https://www.flickr.com/photos/bibliodyssey/">Paul
K</a>.
The robot image is licensed CC-BY-2.0 by
<a href="https://www.flickr.com/photos/striatic/245603625/">striatic</a>.
The dice image is public domain by Personeoneste.
The couple image is <a href="https://commons.wikimedia.org/wiki/File:Her-aide-de-camp-Early-19thc-Humorous.png">public
domain</a>.</p>The Best of Unpublished Machine Learning and Statistics Books2016-02-09T23:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2016-02-09:sebastian/blog/the-best-of-unpublished-machine-learning-and-statistics-books.html<p>Nowadays authors in the fields of statistics and machine learning often choose
to write their books openly by publishing early draft versions.
For popular books this creates a lot of feedback and in the end clearly
improves the final book when it is published.</p>
<p>Here is a short list of very promising draft books.
Because completing a book is difficult it is likely that some of these books
will never be finished.</p>
<h3><a href="http://www.deeplearningbook.org/">"Deep Learning"</a></h3>
<p>By Ian Goodfellow, Yoshua Bengio, and Aaron Courville.</p>
<p>Deep learning has revolutionized multiple applied pattern recognition fields
since 2011.
If you want to get started in applying deep learning methods, now is the time.</p>
<p>First, there are now <a href="https://docs.google.com/spreadsheets/d/1XvGfi3TxWm7kuQ0DUqYrO6cxva196UJDxKTxccFqb9U/edit#gid=0">many well-engineered
frameworks</a>
available that make learning and experimentation fun.</p>
<p>Second, there are good learning resources available. If you have a solid
background in basic machine learning and basic linear algebra, this book is
for you.
The field of deep learning is advancing very swiftly and likely this book will
not cover all the latest models and techniques used when it is published.
However, at the moment it is quite up-to-date and simply contains some of the
most organized material on the topic of deep learning.</p>
<h3><a href="http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">"Advanced Data Analysis from an Elementary Point of View"</a></h3>
<p>By Cosma Rohilla Shalizi.</p>
<p>This book is covering a lot of ground from classical statistics
(linear/additive/spline regression, resampling methods, density estimation,
dimensionality reduction), and some more recent topics (causality, graphical
models, models for dependent data).</p>
<p>One feature of this book is highlighted by the addition ``from an elementary
point of view'' to the title:
it is quite accessible and the author genuinely cares about conveying
understanding, often in a delightfully casual tone.</p>
<p>See also the unfinished but great book written by
<a href="http://www.stat.cmu.edu/~cshalizi/">Cosma Shalizi</a> and
<a href="http://www.cs.bgu.ac.il/~karyeh/">Aryeh Kontorovich</a>,
<a href="http://www.stat.cmu.edu/~cshalizi/almost-none/">"Almost None of the Theory of Stochastic
Processes"</a>.</p>
<h3><a href="http://statweb.stanford.edu/~owen/mc/">"Monte Carlo theory, methods and examples"</a></h3>
<p>By Art Owen.</p>
<p>While Monte Carlo methods have advanced in the last decade, in particular
driven by the need for large scale Bayesian statistics, most comprehensive
textbooks (Liu, Robert and Casella, Rubinstein and Kroese) are now dated.</p>
<p>Art Owen's book is a wonderful addition that will become a classic once it is
completed.
From first principles and with great depth Art Owen introduces the theory and
practice of Monte Carlo methods.
Chapters 8 to 10 contain a wealth of material not found in other textbook
treatments.
I am eagerly waiting for chapters 11 to 17.</p>
<h3><a href="http://ciml.info/">"A Course in Machine Learning"</a></h3>
<p>By Hal Daume III.</p>
<p>A basic course on a broad set of machine learning methods including
(eventually, not yet) chapters on structured prediction and Bayesian learning.
Very accessible and lots of pseudo-code.</p>
<h3><a href="http://alex.smola.org/drafts/thebook.pdf">"Introduction to Machine Learning"</a></h3>
<p>By Alex Smola and SVN Vishwanathan.</p>
<p>This draft was last updated in 2010 and probably will never be completed.
Chapter 3 however, covering continuous optimization methods, is complete and
nicely written.</p>The Fair Price to Pay a Spy: An Introduction to the Value of Information2016-01-09T22:30:00+01:00Sebastian Nowozintag:www.nowozin.net,2016-01-09:sebastian/blog/the-fair-price-to-pay-a-spy-an-introduction-to-the-value-of-information.html<p><img alt="Spy image" src="http://www.nowozin.net/sebastian/blog/images/voi-spy-illustration.png" /></p>
<p>(This article covers the decision-theoretic concept of <em>value of information</em>
through a classic example.)</p>
<p>What is the value of a piece of information?</p>
<p>It depends.
Two factors determine the value of information:
first, whether the information is new to you;
second, whether the information causes you to change your decisions.</p>
<p>The first point is immediately clear as you would be unwilling to pay a reward
for information which you already know.
Information is understood here in the sense of probabilistic knowledge
represented by a probability distribution. As such, if the information keeps
your beliefs unchanged, it cannot have any value.</p>
<p>The second point is more subtle.
Only decisions and actions can have value, information itself has only
indirect value through the decisions and actions that it influences.
The consequence of a decision is a realized utility which can be both positive
or negative.
As a simple monetary example, imagine you buy a share of a company. Then the
utility is a function of the change in the share price. Information such as
insider information can lead to a belief that the share price will drop, thus
leading to the decision to sell the share and realize the utility.
If the information I learn about the company does not change my decision
whether to sell the share or not, then it also cannot change the utility.
Therefore <em>value</em> is understood as a subjective but quantitative utility that
is realized at decision time.</p>
<h2>The Fair Price to Pay a Spy</h2>
<p>The following example is from one of the important papers on decision theory
and decision analysis, now in its 50th anniversary year(!),
<a href="http://dx.doi.org/10.1109/TSSC.1966.300074">(Howard, "Information Value Theory",
1966)</a>.
Unfortunately the paper is behind a paywall, but I will keep the presentation
below self-contained and also took the liberty to update the exotic notation
used in the paper to a more modern form.</p>
<p>Imagine you run a construction company and the government advertises a
contract to build a large development.
The bidding happens via a lowest price closed bidding, where every construction company submits a price for which they would construct the development in a technically acceptable manner.
You do not see any competing bids and the lowest-price bid wins.</p>
<p>Leaving moral and legal concerns aside, how much would you pay a spy to reveal
to you the lowest competing bid prior to you making your bid?
We will follow <a href="https://profiles.stanford.edu/ronald-howard">Ronald Howard</a> in
answering this question using decision theory, thus putting a monetary value
on a piece of information.</p>
<p>The following are the key quantities in this problem:</p>
<ul>
<li><span class="math">\(E\)</span>, the expense to your company in constructing the development. It is a
random variable.</li>
<li><span class="math">\(L\)</span>, the lowest price among all competing bids. It is a random variable.</li>
<li><span class="math">\(B\)</span>, your bid. It is a decision variable under your control, not a random
variable.</li>
<li><span class="math">\(V\)</span>, the profit you realize, a random variable.</li>
</ul>
<p>The situation is represented using <a href="https://en.wikipedia.org/wiki/Influence_diagram">influence
diagrams</a> in the following
figure. (Incidentally influence diagrams were also first formally published
by Ronald Howard in (Howard and Matheson, "Influence diagrams", 1981), and a
nice historical piece on them is available from Judea Pearl in <a href="http://ftp.cs.ucla.edu/pub/stat_ser/r326.pdf">(Pearl,
"Influence Diagrams - Historical and Personal Perspectives",
2005)</a>.)</p>
<p><img alt="Influence diagram for the construction company" src="http://www.nowozin.net/sebastian/blog/images/voi-spy.png" /></p>
<p>In the diagram the round nodes represent random variables, just like in
directed graphical models (Bayesian networks).
The rectangular node represents a decision node under our control, here the
bid <span class="math">\(B\)</span> we submit.
The diamond shaped utility node represents a value achieved, in our case the
profit <span class="math">\(V\)</span>.
The above diagram is not enough, we need to specify how our profit <span class="math">\(V\)</span> comes
about.</p>
<p>The first step in applying decision theory is to assume that everything is
known. So let us assume <span class="math">\(B\)</span>, <span class="math">\(E\)</span>, <span class="math">\(L\)</span> are known.
Then, it is easy to see whether we actually won the contract, i.e. whether
our bid is small enough, <span class="math">\(B < L\)</span>. If <span class="math">\(B \geq L\)</span>, we do not obtain the
contract and the profit is zero. (We assume here, for simplicity, that the
cost for making the bid is zero.)
If we won the bid, that is, if <span class="math">\(B < L\)</span> is true, then the profit is simply the
bid price minus our expenses, <span class="math">\(B - E\)</span>.
Therefore we have the profit as a function of <span class="math">\(B\)</span>, <span class="math">\(E\)</span>, and <span class="math">\(L\)</span> as
<div class="math">$$V = \left\{\begin{array}{cl}0,&\textrm{if $B \geq L$,}\\
B-E,&\textrm{if $B < L$.}\end{array}\right.$$</div>
The above expression can also be written using indicator notation as
<span class="math">\(V = \mathbb{1}_{\{B < L\}} \cdot (B-E)\)</span>.</p>
<p>But <span class="math">\(B\)</span>, <span class="math">\(E\)</span>, <span class="math">\(L\)</span> are not known. The second step in applying decision theory
is therefore to take expectations with respect to everything that is unknown
(<span class="math">\(E\)</span> and <span class="math">\(L\)</span> in our case) and to maximize utility with respect to all
decisions (<span class="math">\(B\)</span> in our case).
We do this in two steps. Let us first assume <span class="math">\(B\)</span> is fixed. Then we take the
expectation of the above expression with respect to the unknown <span class="math">\(E\)</span> and <span class="math">\(L\)</span>,</p>
<p>
<div class="math">$$\mathbb{E}[V | B] = \mathbb{E}_{E,L}[\mathbb{1}_{\{B < L\}} \cdot (B-E)].$$</div>
</p>
<p>Now we further assume independence of the cost <span class="math">\(E\)</span> and the lowest competing
bid <span class="math">\(L\)</span>, that is <span class="math">\(P(E,L) = P(E) \, P(L)\)</span>, a reasonable assumption.
Here is an example visualization of priors
<span class="math">\(P(E) = \textrm{Gamma}(\textrm{Shape}=80,\textrm{Scale}=6)\)</span> and
<span class="math">\(P(L) = \mathcal{N}(\mu=1100, \sigma=120)\)</span>. </p>
<p><img alt="Priors of cost and lowest bid" src="http://www.nowozin.net/sebastian/blog/images/voi-example.svg" /></p>
<p>Assuming independence we obtain</p>
<p>
<div class="math">\begin{eqnarray}
\mathbb{E}[V | B] & = & \mathbb{E}_{E,L}[\mathbb{1}_{\{B < L\}} \cdot (B-E)]\nonumber\\
& = & P(B < L) (B - \mathbb{E}_E[E]).\label{eqn:VgivenB}
\end{eqnarray}</div>
</p>
<p>The expression (\ref{eqn:VgivenB}) is intuitive: the expected profit is given
by the probability of winning the bidding times the difference between bid and
expected cost.
Here is a visualization for the above priors, with our bid <span class="math">\(B\)</span> on the
horizontal axis.</p>
<p><img alt="Expected profit as a function of our bid" src="http://www.nowozin.net/sebastian/blog/images/voi-expectedprofit.svg" /></p>
<p>You can see three regimes: 1. When <span class="math">\(P(B < L)\)</span> is very large (up to about
<span class="math">\(B=850\)</span>) the expected profit behaves linearly as <span class="math">\(B-\mathbb{E}_E[E]\)</span>, and if
we bid below our actual cost we realize a negative profit (loss).
2. When <span class="math">\(P(B < L)\)</span> is very small (above <span class="math">\(B=1300\)</span>) the expected profit drops to
zero. 3. Between <span class="math">\(B=850\)</span> and <span class="math">\(B=1300\)</span> we see the product expression resulting
in a nonlinear profit as a function of B.</p>
<p>To finish the second step of applying decision theory we have to maximize
(\ref{eqn:VgivenB}) over our decision <span class="math">\(B\)</span>, yielding</p>
<p>
<div class="math">$$\mathbb{E}[V] = \max_{B} \mathbb{E}[V|B].$$</div>
</p>
<p>This tells us how to bid without the help of a spy:
in the above example figures, we obtain an expected profit
<span class="math">\(\mathbb{E}[V] = 421.8\)</span> for a bid of <span class="math">\(B=966.2\)</span>.</p>
<p>Revealing <span class="math">\(L\)</span> gives a large competitive advantage, but how much would we be
willing to pay a spy for this information?
To this end Howard introduces the concept of <em>clairvoyance</em> and
<em>value of information</em>.</p>
<p>In <em>clairvoyance</em> we consider what could happen if a clairvoyant appears and
offers us perfect information about <span class="math">\(L\)</span>.
If we would know <span class="math">\(L\)</span> we can compute as before</p>
<p>
<div class="math">\begin{eqnarray}
\mathbb{E}[V | B, L] & = & P(B < L) (B - \mathbb{E}_E[E])\nonumber\\
& = & \mathbb{1}_{\{B < L\}} (B - \mathbb{E}_E[E]),\nonumber
\end{eqnarray}</div>
where the probability <span class="math">\(P(B < L)\)</span> is now deterministic one or zero as <span class="math">\(B\)</span> is
our decision and <span class="math">\(L\)</span> is known.
As <span class="math">\(B\)</span> is our decision we again maximize over it.
<div class="math">\begin{eqnarray}
\mathbb{E}[V | L] & = & \max_B \mathbb{E}[V | B,L]\nonumber\\
& = & \max_B \mathbb{1}_{\{B < L\}} (B - \mathbb{E}_E[E])\nonumber\\
& = & \left\{\begin{array}{cl}L - \mathbb{E}_E[E], & \textrm{if $L >
\mathbb{E}_E[E]$,}\\
0, & \textrm{otherwise (do not bid).}\end{array}\right.\nonumber
\end{eqnarray}</div>
The last step can be seen as follows: our bid <span class="math">\(B\)</span> should be above
our expected expenses <span class="math">\(\mathbb{E}_E[E]\)</span> otherwise we would incur a negative
profit but <span class="math">\(B\)</span> should also be as high as possible just below <span class="math">\(L\)</span>. Hence if
this is impossible (<span class="math">\(L \leq \mathbb{E}_E[E]\)</span>) we do not bid. Otherwise we bid
<span class="math">\(B=L-\epsilon\)</span> and realize the expected profit <span class="math">\(L-\mathbb{E}_E[E]\)</span>.</p>
<p>Ok, so this tells us how to bid when we know <span class="math">\(L\)</span>. But we do not know <span class="math">\(L\)</span> yet.
Instead we would like to put a value on the information about <span class="math">\(L\)</span>.
We do this by integrating out <span class="math">\(L\)</span>,</p>
<p>
<div class="math">$$\mathbb{E}_L[\mathbb{E}[V|L]]$$</div>
</p>
<p>(Howard introduces a special notation for
the above expression, but I am not a fan of it and will omit it here.)</p>
<p>The <em>value of information</em> (value of <span class="math">\(L\)</span>) is now defined as</p>
<p>
<div class="math">$$\textrm{EVPI}(L) = \mathbb{E}_L[\mathbb{E}[V|L]] - \mathbb{E}[V].$$</div>
</p>
<p>This quantity is again intuitive: the value of knowing <span class="math">\(L\)</span> is the expected
difference between the utility achieved with knowledge of <span class="math">\(L\)</span> and the expected
utility achieved without such knowledge.</p>
<p>The abbreviation <span class="math">\(\textrm{EVPI}\)</span> denotes the <a href="https://en.wikipedia.org/wiki/Expected_value_of_perfect_information"><em>expected value of perfect
information</em></a>,
a term that was introduced later and has become standard in decision analysis.</p>
<p>So how much is the knowledge of <span class="math">\(L\)</span> worth in our example?
We compute <div class="math">$$\mathbb{E}_L[\mathbb{E}[V|L]] \approx 620.0$$</div> with Monte Carlo
and we had <span class="math">\(\mathbb{E}[V] = 421.8\)</span> from earlier, hence
<div class="math">$$\textrm{EVPI}(L) \approx 620.0 - 421.8 = 198.2,$$</div>
is the maximum price we should pay our spy for telling us <span class="math">\(L\)</span> exactly.</p>
<h2>The Fair Price to Pay an Expert</h2>
<p>The above was the original scenario described in Howard's paper.
In practice obtaining perfect knowledge is often infeasible.
But the above reasoning extends easily to the general case where we only
obtain partial information.</p>
<p>Here is an example for our setup: consider that we can ask an expert to
provide us an estimate <span class="math">\(L'\)</span> of what the lowest bid <span class="math">\(L\)</span> could be.</p>
<p><img alt="Expert advice" src="http://www.nowozin.net/sebastian/blog/images/voi-experts-lovelornpoets.jpg" /></p>
<p>By assuming a probability model <span class="math">\(P(L' | L)\)</span> we can relate the true unknown
lowest bid <span class="math">\(L\)</span> to the experts guess.</p>
<p>The influence diagram looks as follows:</p>
<p><img alt="Influence diagram for the construction company with expert advice" src="http://www.nowozin.net/sebastian/blog/images/voi-expert.png" /></p>
<h3>Recipe for Value of Information Computation</h3>
<p>To understand how the above derivation extends to this case, let us state a
recipe of computing value of information:</p>
<ol>
<li>State the expected utility, conditioned on <em>decisions</em> and the <em>information
to be valued</em>.</li>
<li>Maximize the expression of step 1 over all <em>decisions</em>.</li>
<li>Marginalize the expression of step 2 over the <em>information to be valued</em>,
using your prior beliefs. The resulting expression is the expected utility
with information.</li>
<li>Start over: state the expected utility, conditioned only on <em>decisions</em>.</li>
<li>Maximize the expression of step 4 over all <em>decisions</em>. The resulting
expression is the expected utility without information.</li>
<li>Compute the value of information as the difference between the two expected
utilities (step 3 minus step 5).</li>
</ol>
<p>This recipe works for any single-step decision problem, and any potential
difficulties are computational.</p>
<h3>Application of the Recipe to our Example</h3>
<p>Here is its application to our generalized example:</p>
<ol>
<li>This is <span class="math">\(\mathbb{E}[V | L', B]\)</span> which is obtained by marginalizing over <span class="math">\(E\)</span>
and <span class="math">\(L\)</span> in <span class="math">\(\mathbb{E}[V | L', B, E, L]\)</span> and the marginal of <span class="math">\(L\)</span> is <span class="math">\(P(L|L')\)</span>
obtained by Bayes rule.</li>
<li>Maximize over <span class="math">\(B\)</span>, obtaining <span class="math">\(\max_B \mathbb{E}[V | L', B]\)</span>.</li>
<li>Take the expectation over <span class="math">\(L'\)</span>, which is defined via <span class="math">\(P(L') = \mathbb{E}_{L}[P(L'|L)]\)</span>, yielding
<div class="math">\begin{equation}
\mathbb{E}_{L'}[\max_B \mathbb{E}[V | L', B]]\label{eqn:Ltick-withinfo}
\end{equation}</div>
</li>
<li>This is <span class="math">\(\mathbb{E}[V | B]\)</span> which is obtained by marginalizing over <span class="math">\(E\)</span> and
<span class="math">\(L\)</span>, here the marginal of <span class="math">\(L\)</span> is the prior <span class="math">\(P(L)\)</span>.</li>
<li>Maximize over <span class="math">\(B\)</span>, obtaining
<div class="math">\begin{equation}
\max_B \mathbb{E}[V | B].\label{eqn:Ltick-withoutinfo}
\end{equation}</div>
</li>
<li>The value of information is the difference between (\ref{eqn:Ltick-withinfo}) and (\ref{eqn:Ltick-withoutinfo}),
<div class="math">$$
\textrm{EVPI}(L') = \mathbb{E}_{L'}[\max_B \mathbb{E}[V | L', B]] - \max_B \mathbb{E}[V | B].$$</div>
</li>
</ol>
<p>To make the above example concrete, let us assume that our expert is unbiased
and we have
<div class="math">$$P(L'|L) = \mathcal{N}(L, \sigma),$$</div>
where <span class="math">\(\sigma > 0\)</span> is the standard deviation.
Computing <span class="math">\(\textrm{EVPI}(L')\)</span> as a function of <span class="math">\(\sigma\)</span> is possible by solving
the maximization and integration problems.</p>
<p>Using the same parameters as before and using Monte Carlo for the integration,
here is a visualization of the fair price to pay our expert.</p>
<p><img alt="Price for expert advice as a function of expert reliability" src="http://www.nowozin.net/sebastian/blog/images/voi-expert-confidence.svg" /></p>
<p>We can see that for <span class="math">\(\sigma \to 0\)</span> we recover the previous case of perfect
information as the expert provides increasingly accurate knowledge about <span class="math">\(L\)</span>
when <span class="math">\(\sigma\)</span> decreases.
Conversely, with increasing expert uncertainty the value of his expert advice
decreases.</p>
<h2>Computation</h2>
<p>(This was added in April 2016 after the original article was published.)</p>
<p>Computing the EVPI can be challenging because in many cases both the
maximization problem and the expectation are intractable analytically and
sample-based Monte Carlo approximations induce a non-negligible bias.</p>
<p>The recent work of <a href="http://arxiv.org/abs/1604.01120">(Takashi Goda, "Unbiased Monte Carlo estimation for the
expected value of partial perfect information",
arXiv:1604.01120)</a> addresses part of the
computation diffulties by application of a randomly truncated series to
de-bias the ordinary Monte Carlo estimate.
I have not performed any experiments but it seems to be a potentially useful
method in the context of value of information computation problems.</p>
<h2>Summary</h2>
<p>From a formal decision theory point of view the <em>value of information</em> does not
occupy a special place. It just measures the difference between two different
expected utilities, given optimal decisions.</p>
<p>But <em>value of information</em> appears frequently in almost any statistical
decision task.
Here are two more examples.</p>
<p>In <em>active learning</em> we are interested in minimizing the amount of supervision
needed to learn to perform a task and we can obtain supervision (ground truth
class labels, for example) for instances of our choice at a cost.
By applying value of information we can select for supervision the instances
whose revealed label information brings the highest expected increase in
utility.</p>
<p>In <em>experimental design</em> we have to make choices about which information to
acquire, such as the number of patients to sample in a medical trial, or what
information to collect at different costs in a customer survey.
Value of information provides a way to make these choices, both statically, or
better, adaptively.</p>
<h3>Limitations</h3>
<p>While decision theory is rather uncontroversial, it is a normative theory,
that is, it tells you how to derive decisions which are optimal and coherent
(rational).
There are two main limitations I would like to point out:</p>
<ol>
<li>As a normative theory it cannot claim to be a description of how humans (or
other intelligent agents) make decisions.</li>
<li>It assumes infinite reasoning resources on behalf of the acting agent.</li>
</ol>
<p>Both limitations are related of course in that real intelligent agents may
deviate from normative decision theory precisely because they are limited in
their reasoning abilities.
There are both normative and descriptive theories to address these
limitations.
On the normative side we have for example <a href="http://web.mit.edu/sjgershm/www/GershmanHorvitzTenenbaum15.pdf">computational
rationality</a>,
taking into account the computational costs of reasoning and deriving optimal
decisions within these constraints.
On the descriptive side we have for example <a href="https://en.wikipedia.org/wiki/Prospect_theory">prospect
theory</a>, aiming to describe
human decision making.</p>
<h3>Further Reading</h3>
<p>A great introduction to decision theory, including value of information, is
the very accessible
<a href="http://eu.wiley.com/WileyCDA/WileyTitle/productCd-047149657X.html">(Parmigiani and Inoue, "Decision Theory: Principles and Approaches",
2009)</a>.</p>
<p>Three classic textbooks on decision theoretic topics are
<a href="http://eu.wiley.com/WileyCDA/WileyTitle/productCd-047168029X.html">(DeGroot, "Optimal Statistical Decisions",
1970)</a>,
<a href="http://www.springer.com/us/book/9780387960982">(Berger, "Statistical Decision Theory and Bayesian Analysis",
1985)</a>, and
<a href="http://eu.wiley.com/WileyCDA/WileyTitle/productCd-047138349X.html">(Raiffa and Schlaifer, "Applied Statistical Decision Theory",
1961)</a>
(available as outrageously priced reprint-paperback by Wiley).</p>
<p><em>Acknowledgements</em>. The expert image is CC-BY-2.0 licensed art by
<a href="https://www.flickr.com/photos/lovelornpoets/6214449310/">lovelornpoets</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>ICCV 2015, Day 42015-12-16T23:50:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-12-16:sebastian/blog/iccv-2015-day-4.html<p>This article summarizes the fourth day of the <a href="http://pamitc.org/iccv15/">ICCV
2015</a> conference, the International Conference on
Computer Vision.
A summary of the <a href="http://www.nowozin.net/sebastian/blog/iccv-2015-day-1.html">first day</a>,
<a href="http://www.nowozin.net/sebastian/blog/iccv-2015-day-2.html">second day</a>, and
<a href="http://www.nowozin.net/sebastian/blog/iccv-2015-day-3.html">third day</a> is also available.</p>
<h2>ICCV 2017 and 2019</h2>
<p>ICCV 2017 will be in Venice, Italy.</p>
<p>For ICCV 2019 there was an open voting between Seoul (Korea) and Shanghai
(China), with Seoul winning the election.
Both proposals were strong and because I have lived in Shanghai for two years I
favored that proposal, but I am confident that ICCV 2019 in Seoul will be
wonderful as well.</p>
<h2>Parties</h2>
<p>Computer vision is now fully recognized as having an impact in the industry.
All large tech companies invested heavily in the last three years or so, and
one of the visible results is the increased number of conference sponsors and
the conference parties.</p>
<p>Conferences such as NIPS, CVPR, and ICCV now host invite-only open bar parties
with several hundred attendees; this year at ICCV there were parties by
Microsoft, Intel, Google, and Facebook.</p>
<p>Interestingly they do not come across as recruiting events: there is a minimal
announcement perhaps, but otherwise people just chat with food and drinks.
It is more a show of strength and goodwill towards the community that computer
vision is taken seriously and the parties do demonstrate that the companies
are in good shape, much like banks invest in a marble floor and shiny glass
facades to gain the trust of their customers.</p>
<h2>Interesting Papers</h2>
<h3>Polarized 3D: High-Quality Depth Sensing with Polarization Cues</h3>
<p>By Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Raskar.</p>
<p>Polarization of light is a rarely exploited cue for 3D reconstruction.
This work revisits shape-from-polarization and shows fine detail 3D
reconstruction from polarization information (with non-trivial
post-processing).</p>
<p><a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Kadambi_Polarized_3D_High-Quality_ICCV_2015_paper.pdf">Paper</a>.</p>ICCV 2015, Day 32015-12-16T01:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-12-16:sebastian/blog/iccv-2015-day-3.html<p>This article summarizes the third day of the <a href="http://pamitc.org/iccv15/">ICCV
2015</a> conference, the International Conference on
Computer Vision.
A summary of the <a href="http://www.nowozin.net/sebastian/blog/iccv-2015-day-1.html">first day</a> and
<a href="http://www.nowozin.net/sebastian/blog/iccv-2015-day-2.html">second day</a> is also available.</p>
<h2>Interesting Papers</h2>
<h3>Registering Images to Untextured Geometry Using Average Shading Gradients</h3>
<p>By Tobias Ploetz and Stefan Roth.</p>
<p>This work considers the difficult problem of aligning an untextured 3D surface
to a real image of the same object, a challenging problem because of the
absence and presence of edges depending on texture and light.</p>
<p>The authors propose an alignment procedure that uses efficiently computable
<em>average shading gradient</em> images that capture expected visible edges due to
shadows despite unknown light direction.</p>
<p><a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Plotz_Registering_Images_to_ICCV_2015_paper.pdf">Paper</a>.</p>
<h3>Robust Nonrigid Registration by Convex Optimization</h3>
<p>By Qifeng Chen and Vladlen Koltun.</p>
<p>The authors consider the problem of aligning two 3D shapes to each other, where
each shape may be corrupted by missing surfaces (non water-tight surfaces) and
undergo severe nonrigid deformations.
Previous work has proposed to minimize a specific geodesic distortion measure
over suitable classes of continuous transformations, however, this yields
difficult non-convex optimization problems.</p>
<p>Because the distortion measure makes sense this work proposes a way to
approximate while simultaneously convexifying the problem. This is achieved by
representing the transformation nonparametrically through correspondences on
randomly sampled points.
While the original problem was continuous and non-convex, now it is a discrete
energy minimization problem that can be approximately solved using a standard
LP-based relaxation approach, where the authors use TRW-S.</p>
<p>What is surprising is how much the results improve on benchmark data sets; the
error is reduced by a factor of three compared to strong baseline methods.</p>
<p><a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Chen_Robust_Nonrigid_Registration_ICCV_2015_paper.pdf">Paper</a>.</p>ICCV 2015, Day 22015-12-15T01:20:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-12-15:sebastian/blog/iccv-2015-day-2.html<p>This article summarizes the second day of the <a href="http://pamitc.org/iccv15/">ICCV
2015</a> conference, the International Conference on
Computer Vision.
A summary of the <a href="http://www.nowozin.net/sebastian/blog/iccv-2015-day-1.html">first day</a> is also available.</p>
<h2>Awards</h2>
<p>The following awards were given at ICCV 2015.</p>
<h3>Achievement awards</h3>
<ul>
<li>PAMI Distinguished Researcher Award (1): <strong>Yann LeCun</strong></li>
<li>PAMI Distinguished Researcher Award (2): <strong>David Lowe</strong></li>
<li>PAMI Everingham Prize Winner (1): <strong>Andrea Vedaldi</strong> for <a href="http://www.vlfeat.org/">VLFeat</a></li>
<li>PAMI Everingham Prize Winner (2): <strong>Daniel Scharstein</strong> and <strong>Rick Szeliski</strong> for the <a href="http://vision/middlebury.edu/stereo/data/">Middlebury Datasets</a></li>
</ul>
<h2>Paper awards</h2>
<ul>
<li>PAMI Helmholtz Prize (1): <strong>David Martin</strong>, <strong>Charles Fowlkes</strong>, <strong>Doron Tal</strong>, and <strong>Jitendra Malik</strong> for their ICCV 2001 paper "A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics".</li>
<li>PAMI Helmholtz Prize (2): <strong>Serge Belongie</strong>, <strong>Jitendra Malik</strong>, and <strong>Jan Puzicha</strong>, for their ICCV 2001 paper "Matching Shapes".</li>
<li>Marr Prize: <strong>Peter Kontschieder</strong>, <strong>Madalina Fiterau</strong>, <strong>Antonio Criminisi</strong>, and <strong>Samual Rota Bulo</strong>, for <a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Kontschieder_Deep_Neural_Decision_ICCV_2015_paper.pdf">"Deep Neural Decision Forests"</a>.</li>
<li>Marr Prize honorable mention: <strong>Saining Xie</strong> and <strong>Zhuowen Tu</strong> for <a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Xie_Holistically-Nested_Edge_Detection_ICCV_2015_paper.pdf">"Holistically-Nested Edge Detection"</a>.</li>
</ul>
<h2>Interesting Papers</h2>
<p>The above Marr prize winning papers are very nice, but here I also want to
highlight three other papers I found interesting today.</p>
<h3>Fast R-CNN</h3>
<p>By Ross Girshick.</p>
<p>Since 2014 the standard object detection pipeline for natural images is the
R-CNN system which first extracts a set of object proposals then scores them
using a convolutional neural network.
The two key weaknesses of the approach are: first, the separation between
proposal generation and scoring, preventing joint training of model parameters;
and second the separate scoring of each hypothesis which leads to significant
runtime overhead.
This work and the follow-up work ("Faster R-CNNs" at NIPS this year) addresses
both issues by proposing a joint model that is trained end-to-end, including
proposal generation, leading to a new state of the art in object detection.</p>
<p><a href="http://github.com/rbgirshick/fast-rcnn">Code</a>,
<a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_Fast_R-CNN_ICCV_2015_paper.pdf">paper</a>.</p>
<h3>Unsupervised Visual Representation Learning by Context Prediction</h3>
<p>By Carl Doersch, Abhinav Gupta, and Alexei A. Efros.</p>
<p>Supervised deep learning needs lots of labeled training data to achieve good performance.
This paper investigates whether we can create and train deep neural networks on
artificial tasks for which we can create large amounts of training data. In
particular, the paper proposes to predict where a certain patch appears within
the image. For this task, an almost infinite amount of training data is easily
created.
Perhaps surprisingly the resulting network, despite being trained on this
artificial task, has learned useful representations for real vision tasks such
as image classification.</p>
<p><a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf">Paper</a>.</p>
<h3>Deep Fried Convnets</h3>
<p>By Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song,
and Ziyu Wang.</p>
<p>In deep convolutional networks the last few densely connected layers have the
most parameters and thus most of the required memory during test time and
training.
This work proposes to leverage the <em>fastfood</em> kernel approximation to replace
densely connected layers with specific efficient and low parameter operations.</p>
<p>The empirical results are impressive and the fastfood justification is
plausible, but I wonder if this work may even provide a hint at a more general
approach to construct efficient neural network architectures by using arbitrary
dense but efficient matrix operations (FFT, DCT, Walsh-Hadamard, etcetera).</p>
<p><a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yang_Deep_Fried_Convnets_ICCV_2015_paper.pdf">Paper</a>.</p>ICCV 2015, Day 12015-12-14T23:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-12-14:sebastian/blog/iccv-2015-day-1.html<p><a href="http://pamitc.org/iccv15/">ICCV 2015</a>, the International Conference on
Computer Vision, is one of the premier venues for computer vision research,
together with the CVPR conference.
This ICCV is happening in Santiago, Chile, a beautiful city with amazing food.</p>
<p>The computer vision community is growing, and this ICCV is the largest so far
(1460 attendees, 525 papers). Since a few years computer vision is broadly
relevant for the industry and there are no less than 22 companies sponsoring
the conference.
The acceptance rate this year was 30.92%, with the acceptance for oral
presentations at 3.30%.
All papers of the conference are <a href="http://www.cv-foundation.org/openaccess/ICCV2015.py">available as open-access PDF
here</a>.</p>
<p>There was a lot of interesting work presented on the first day, but here is my
subjective selection of interesting work.</p>
<h3>Aligning Books and Movies</h3>
<p>By Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel
Urtasun, Antonio Torralba, and Sanja Fidler.</p>
<p>Movies and the books they are based on form a rich paired data source.
In this work the authors propose a recurrent neural network model to align
these two sources semantically.
The challenge is that movies and books are often substantially different, but
apparently modern recurrent neural networks have enough semantic discrimination
ability to enable such alignment.</p>
<p><a href="http://www.cs.toronto.edu/~mbweb/">Project page</a>,
<a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Zhu_Aligning_Books_and_ICCV_2015_paper.pdf">paper</a>.</p>
<h3>Convolutional Color Constancy</h3>
<p>By Jonathan Barron.</p>
<p>Color constancy deals with the correction of colors in digital images. While
there have been a large number of works in this area, the issue remains
challenging and important.</p>
<p>In this work the author convincingly demonstrates that common changes in colors
correspond to simple translation of a color histogram in a transformed 2D
histogram space. Then, the problem of correcting for these translations can be
posed as simply recognizing the true center position of the observed color
histogram and undoing the translation.</p>
<p><a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Barron_Convolutional_Color_Constancy_ICCV_2015_paper.pdf">Paper</a>.</p>
<h3>Self-Calibration of Optical Lenses</h3>
<p>By Michael Hirsch and Bernhard Schoelkopf.</p>
<p>Both cheap and expensive camera lenses suffer from many optical effects,
leading to deterioration in image quality.
This work proposes an automatic way to obtain non-parametric kernel estimates
of the point spread functions characterising a lens.
The resulting model can then be used to deblur images.
In effect, this allows better image quality even when using cheap lenses.</p>
<p><a href="http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Hirsch_Self-Calibration_of_Optical_ICCV_2015_paper.pdf">Paper</a>.</p>
<p>The <a href="http://www.nowozin.net/sebastian/blog/iccv-2015-day-2.html">second day</a> is available now.</p>Ten Tips for Writing CS Papers, Part 22015-12-10T22:30:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-12-10:sebastian/blog/ten-tips-for-writing-cs-papers-part-2.html<p>This continues the <a href="http://www.nowozin.net/sebastian/blog/ten-tips-for-writing-cs-papers-part-1.html">first part</a> on tips to write
computer science papers.</p>
<h2>6. Ideal Structure of a Paragraph</h2>
<p>A paper has different levels of formal structure:
sections, subsections, paragraphs, sentences.
It is important to ensure that the structure of the content aligns well with
the formal structure because the formal structure is readily perceived by the
reader, whereas the structure of the content is not.
With a good alignment we make it easy for the reader to have the right mental
model for the organization of the content; this enables a better navigation
and memory of the content.</p>
<p>An important consequence of a well organized paper is to minimize the possible
surprise for the reader.
In general you may want to surprise readers with how amazing your method or
achievements are, but not through the organization of the paper.</p>
<p>How to align the content with the formal structure?
There is more to say about this and I recommend the references at the end of
this article, but here I want to focus on the structure of one or multiple <em>paragraphs</em>.
The basic rules are:</p>
<ol>
<li>One paragraph should contain only a single idea or a single point of
argumentation.</li>
<li>The <em>beginning</em> and the <em>end</em> of a paragraph glue the paragraph into the
surrounding content.</li>
</ol>
<p>There is an ambiguity as to what constitutes a separate idea and indeed
paragraphs may be of quite different lengths.</p>
<p>To achieve a good structure, here is a recipe that works for me.
For a section I would like to write I make a list of
bullet points of things I want to say, with one bullet point being a single
idea or important point. Each point may have one or more dependencies on
other points and I use the dependencies to order the list.
Finally, I write one paragraph for each item on the list and I may add an
additional paragraph at the beginning and end of the section to connect the
section to the surrounding content.</p>
<p>I found that this recipe also makes my job as a writer easier because it
overcomes my writing inhibition in two ways.
First, I can start by simply making a list and this does not feel like
writing.
Second, once the ordering of ideas is clear, the actual writing becomes a lot
simpler.</p>
<p>Here is an example of a less-than-ideal paragraph from Section 2.3 in <a href="http://www.nowozin.net/sebastian/papers/gehler2008ikl-tr.pdf">(Gehler
and Nowozin,
2008)</a>.</p>
<blockquote>
<p>"As already mentioned to our knowledge (Argyriou et al., 2006) were the
first to note the possibility of an infinite set of base kernels and they
also stated the subproblem (Problem 1). We will defer the discussion of the
subproblem to the next section and shortly comment on the differences of the
Algorithm of (Argyriou et al., 2006) and the IKL Algorithm. We denote with
<span class="math">\(g\)</span> the objective value of a standard SVM classifier with loss function
<span class="math">\(L\)</span>."</p>
</blockquote>
<p>Let us reverse engineer the content of this paragraph, then restructure it.
The paragraph makes two points:
first, a connection to the work of (Argyriou et al., 2006).
Second, it establishes some notation.
So it should perhaps be split into two paragraphs.</p>
<p>For the first point, the beginning is also less than ideal: "as already
mentioned to our knowledge"; it is a bit redundant and apologetic to point out
that we already mentioned it and that we may not know better.
The second point, the notation, is okay by itself, but it is unclear why it
follows the first: is it done in order to enable the comparison between
approaches? We would need to read ahead to find out. (This is indeed the
case.)
Here is a proposed improvement:</p>
<blockquote>
<p>(Argyriou et al., 2006) first recognized the possibility of an infinite set
of base kernels and we now discuss the connection to our work.</p>
<p>To make the connection explicit we first establish the notation we will use
throughout the paper. We use <span class="math">\(g\)</span> to denote the objective value of a
standard SVM classifier, where <span class="math">\(L\)</span> is the loss function.</p>
</blockquote>
<p>It is simpler to read and makes it clear why we introduce the notation.
Also note the end and beginning of the two short paragraphs: the end of the
first paragraph tells you what comes next ("the connection to our work"), the
beginning of the second paragraph tells you how this is done (through
notation).
The flow between the two paragraphs is natural now and they could almost be
merged into one again with the single point of the resulting paragraph being
"the connection between (Argyriou et al.) and our work".</p>
<h2>7. Avoid Ambiguous Relative Pronouns (This, These, That, Which)</h2>
<p>When used properly, a relative pronoun, such as "this", "these", "that",
"which", can effectively refer to a previously mentioned noun,
and <em>that</em> has to be remembered by the reader.</p>
<p>In the previous sentence, which entity did "<em>that</em>" refer to?
Is it "a previously mentioned noun"? Or is it "a relative pronoun"?
Or is it the proper use?</p>
<p>Ambiguities of relative pronouns are common because the writer does not
experience the ambiguity. After all, it is clear to the writer what he refers
to.
Train yourself to recognize any potentially ambiguous relative pronoun,
ideally by using a highlighter to mark them in a printout.</p>
<p>To resolve the ambiguity the easiest solution is simply to add the noun it
refers to. For the above example, "that" would become "that noun".</p>
<p>(Another issue I ran into frequently is in deciding between "which" in cases
where "that" should have been used, such as in "We use an algorithm which is
efficient." I remember annoying a former American colleague of mine by using
"which" a bit too often. <a href="http://www.dailywritingtips.com/when-to-use-that-which-and-who/">Some advice is
available.</a>)</p>
<p>Here is a real example from an <a href="http://www.nowozin.net/sebastian/papers/nowozin2008freqgeo.pdf">ICDM 2008
paper</a> of
mine. I highlight all relative pronouns.</p>
<blockquote>
<p>Extracting such geometric patterns from molecular 3D
structures is one of the central topic in computational biology, and
numerous approaches have been proposed. Most of them are optimization
methods, <em>which</em> detect one pattern at a time by minimizing a loss function
(e.g., [14, 15, 6]).
<em>They</em> are different from our approach enumerating all patterns satisfying a
certain geometric criterion. In particular, <em>they</em> do not have a minimum
support constraint. Instead <em>they</em> try to find a motif that matches all
graphs.</p>
</blockquote>
<p>This is not the worst example but can be improved nevertheless. The first
"which" is best removed, the other relative pronouns are best clarified. Here
is a proposed improvement:</p>
<blockquote>
<p>Extracting such geometric patterns from molecular 3D
structures is one of the central topic in computational biology, and
numerous approaches have been proposed. Most of them are optimization
methods, <em>detecting</em> one pattern at a time by minimizing a loss function
(e.g., [14, 15, 6]).
<em>These optimization methods</em> are different from our approach enumerating all
patterns satisfying a certain geometric criterion. In particular, <em>other
methods</em> do not have a minimum support constraint <em>and instead</em> try to find a
motif that matches all graphs.</p>
</blockquote>
<h2>8. Provide Continuation Markers</h2>
<p><em>Continuation markers</em> are sentences or paragraphs, typically at the beginning
of sections, to tell the reader what will be presented next and to tell the
reader how it is relevant or how it relates to what has been presented
already. It provides structure and flow, connecting the different parts of
the paper.</p>
<p>Here is an example, from an <a href="http://www.nowozin.net/sebastian/papers/stuehmer2015toftracking.pdf">ICCV 2015
paper</a>:</p>
<blockquote>
<p>"3. Method</p>
<p>We now describe our model for tracking fast moving objects. While the
motion model is standard, the observation model for raw ToF captures is a
novel contribution."</p>
</blockquote>
<p>Note two elements here: first, there is an explicit statement of what will be
presented next (the model for tracking fast moving objects). Second, we
establish relevance with respect to the contribution.</p>
<p>There are two reasons why thinking about natural continuation markers for
reading the paper is important.
First, it enables navigation through the paper by allowing the reader to skip
sections more efficiently.
Second, without the necessary background it may take a reader multiple
repeated readings to fully understand the paper. If you lost the reader,
providing a natural re-entry point makes it easier to continue reading the
paper despite a lack of understanding of some parts.</p>
<p>Both reasons are especially important for reviewers, a special type of reader.
Ideally the reviewer is an expert in the field already, so we would like to
make it easy for him to quickly navigate to relevant parts of the paper.
Less ideally, the reviewer is working under time pressure or without keen
interest in the work; in this case we would like to minimize misunderstanding
or missing important points during reading.</p>
<p>It is important to co-locate the continuation markers with the actual text
itself. It is not sufficient to provide a mini table-of-contents as part of
the introduction ("In Section 2 we present related work. In Section 3 we
present our method. etc.").</p>
<h2>9. Multiple Authors</h2>
<p>It is a reality that most computer science papers are authored by multiple
authors.
Coordinating the writing between multiple authors can be challenging on both
the level of content and in terms of technology.</p>
<p><em>In terms of content</em>, in my experience a recipe for disaster is to divide the
paper into parts and agree that "Author A will write the introduction, author
B will write the method, etcetera". The resulting draft will be incoherent
and everyone has an excuse for delaying their part due to perceived
dependencies ("I will write the method once the notation is defined in the
introduction", "I will write the introduction when we have results").</p>
<p>Also, when dividing up work this way the draft can be poorly balanced in terms
of relevant parts, as sub-authors tend to be assigned to the parts they have
contributed to the most, which provides an incentive to describe their own
contribution in too much detail (for example senior authors writing the
introduction will fill it discussing their past research agenda that led to
this work; the author writing about the implementation will want to go into
detail because it was really difficult to get it to work and people may miss
just how difficult it was, etcetera).</p>
<p>It is better to assign responsibility to a single author to write a full
draft, then iterate together over this draft.
There are two reasons why it is better: <em>first</em>, clear responsibility gets
stuff done; <em>second</em>, the draft will be more coherent with a more linear
flow of arguments.</p>
<p>The single author draft works best if the draft writer is an experienced
author because iterating on a poorly organized draft may take more effort than
a complete rewrite.
When iterating on a draft it is important to distinguish substantial from
minor changes.
<em>Minor changes</em> are changes that fix issues locally, such as adding a
sentence for clarification, changes of word order, typos, etc. These changes
are important but not urgent. Most accomplished authors I know prefer to
make these changes in passes through the full paper, much like <em>polishing</em>
the paper with each reading.</p>
<p><em>Substantial changes</em> are things like addition or removal of sections,
changing the order of the presentation, enlarging or shrinking the claimed
contribution, etcetera. Such changes can have large implications on the other
parts of the paper which need to be addressed and therefore such changes are
important and urgent because they require less time if made early.</p>
<p><em>In terms of technology</em>, I frequently experienced problems due to the
diversity of authors and their working style. Often some authors will be
senior authors with a proven but dated work setup, for example, not using
basic version control systems and being stuck in an unflexible editor that
mangles LaTeX every time it opens a file.
To be fair, these authors are often most essential in terms of providing
feedback on the content of the paper and they may have little time available to
stay up to date with the latest tools. For addressing this problem with
technology, my recommendations are the following:</p>
<ol>
<li>Use a version control system: this should almost go without saying and even
if you are the sole author of a paper it is best to use a version control
system because it provides a simple method to back your work up.
But for multiple authors coordinating the writing of a paper without a version
control system is simply a waste of time and nerves of everyone involved.</li>
<li>Use a <em>friendly</em> version control system that provides a simple web
interface; <a href="http://bitbucket.org/">Bitbucket</a> is my favorite for paper
writing because it offers free private git repositories and allows you to
view changes in a neat timeline in the browser. While hardly surprising to
any git user, this feature is readily appreciated by everyone.
Also, for minor changes Bitbucket actually allows editing from within the
browser.</li>
<li>For yourself: when writing LaTeX write one sentence in a line and use
a line break after each sentence. This makes merging conflicts easier and
leads to fewer surprises with strange editors breaking long lines.
(I also found that this helps me to improve the organization of a paragraph
because every sentence now starts at the beginning of a line.)</li>
<li>When you need only high level feedback from your coauthors, sending them a
PDF for annotation via email may still be the most efficient way.</li>
</ol>
<h2>10. Authorship and Author Ordering</h2>
<p>Except for the writing itself, another common problem with multiple authors is
discussions about authorship and author ordering. While not related to
writing papers per se, I do want to share some remarks on this topic.
There are only a few common situations where debates about author ordering
arise. Here are a few common examples, with the more common cases first:</p>
<ol>
<li>A small contributor or someone involved in early discussions wants to be a
co-author, but other authors disagree based on the amount of time they
contributed.</li>
<li>There is a PhD student, a post-doc, and a faculty author and in most
computer science venues the recognition is strongest for the first and last
author position. The post-doc feels he guided the student the most so
deserves to be recognized, but the faculty member may feel different based
on seniority or being the source of funding.</li>
<li>Two or more students contributed to a piece of work and see their
contribution as the strongest; this happens sometimes when a student
postpones a line of work and another student is continuing with the work,
directed by a joint supervisor.</li>
<li>Two or more senior authors feel that they started or guided the project the
most.</li>
</ol>
<p>Obviously there is no "right way" to handle all circumstances, and indeed
computer science handles authorship differently to, say, mathematics, for
example.
Of course everyone agrees that scientific authorship should imply substantial
contributions to the work, but that is about as ambiguous a statement as can
be made.
To be more concrete, here are some observations.</p>
<p><em>First</em>, some conflicts can be anticipated, for example the case of two
students. Here, it is best to discuss a possible publication and authorship
as soon as the second student gets involved. This discussion should be
summarized via email for future reference. Likewise for the case of the small
contributor, as soon as it is clear the work will end up in a publication a
discussion should help to set expectations, for example to offer authorship
only if additional work is invested.</p>
<p><em>Second</em>, as a young PhD student one naturally underestimates the implicit
future benefits that arise from co-authorship. For example the senior
co-authors may present the work at venues otherwise inaccessible, or the work
will lead to substantial future collaborations with the original co-authors.</p>
<p><em>Third</em>, when considering whether to include a small contributor as co-author,
the problem is most often not the co-authorship itself, but possible future
actions by the contributor after the paper is published (for example, giving
seminar talks about the paper). The other authors may then feel that the
credit and opportunities are taken away from them.
By discussing not just the co-authorship itself early but instead also what
future paper-related actions are done by whom these problems can be avoided.
For example, all authors may agree that seminar and job talks about the work
should only be presented by the lead author.</p>
<h1>Recommended Reading</h1>
<p>I have bought many books on writing, especially when I started my PhD.
But there is one that stands above all others, and if you are writing papers I
can recommend this to you, no matter whether you just start out or have been
writing since decades.</p>
<p>This book is <a href="http://www.scientific-writing.com/cos/o.x?c=/wbn/pagetree&func=view&rid=10476">"Scientific Writing: A Reader and Writer's
Guide"</a>
by <a href="http://www.scientific-writing.com/">Jean-Luc Lebrun</a>.</p>
<p><em>Acknowledgements</em>. Thanks to Jonathan Strahl for corrections to the article.
<a href="http://www.cs.cmu.edu/~andrewgw/">Andrew Wilson</a> was kind enough to point out
George Orwell's essay <a href="http://georgeorwellnovels.com/essays/politics-and-the-english-language/"><em>Politics and the English
Language</em></a>
as timeless advice on writing in English.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Ten Tips for Writing CS Papers, Part 12015-11-29T21:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-11-29:sebastian/blog/ten-tips-for-writing-cs-papers-part-1.html<p>As a non-native English speaker I can relate to the challenge of writing
concise and clear English.
Scientific writing is particularly challenging because the audience is only
partially known at the time of writing: at best, the paper will still be read
in 10 or 20 years from the time of writing by people from all over the world.</p>
<p>Learning to write papers well takes a long time and is achieved mostly by
practice, that is, writing and publishing papers.
But to improve your writing at a faster pace you can actively reflect on
certain patterns and writing habits you may have.</p>
<p>Below I compiled a short list of some best practices from my own experience
and preference, with more following in a second part.
This list is by no means exhaustive and has a certain bias towards computer
science publications.
However, I hope it will serve as an inspiration to improve your writing.</p>
<p>I provide some examples of poor writing from published papers.
To avoid offending anyone, I select the examples from my own published papers.</p>
<h2>1. Use Simple Language</h2>
<p>Concepts and ideas in scientific papers can at times be complex but the
writing used to describe them should remain simple.
Simple writing has short sentences, a clear logical structure, and uses
minimal jargon. Writing papers is not poetry but still requires you to pay
attention to the language you use.</p>
<p>Computer science does not seem to have an overly large problem with complex
writing, possibly due to a large number of non-native English speakers. Or
perhaps there is a strong desire to be understood by the writers;
<a href="http://www.theatlantic.com/education/archive/2015/10/complex-academic-writing/412255/">other academic fields are more challenged</a>.</p>
<p>Yet, I have frequently seen non-native English speaking junior authors,
perhaps when writing their first paper, who attempt to copy style from their
native language. At least for native German speakers (like me)
this would often lead to comparatively complex writing in terms of sentence
lengths and less than optimal didactics in terms of presenting the abstract
before the concrete.</p>
<p>If still in doubt whether using simple language is a good idea, check this
<a href="http://www.improbable.com/ig/winners/#ig2015">Ig-Nobel-prize-winning</a> work:
<a href="http://www.ucd.ie/artspgs/semantics/ConsequencesErudite.pdf">(Oppenheimer, "Consequences of Erudite Vernacular Utilized Irrespective of Necessity:
Problems with Using Long Words
Needlessly", Applied Cognitive Psychology, 2006)</a>.</p>
<h2>2. State your Contribution</h2>
<p>The key contribution of most published papers falls into exactly one out of
the following three categories.</p>
<ol>
<li><em>Insight</em>: you have an explanation for something that is already there.</li>
<li><em>Performance</em>: you can do something better.</li>
<li><em>Capability</em>: you can do something that could not be done before.</li>
</ol>
<p>If you know which category your paper falls into this, emphasize this aspect
early in the paper, ideally in the abstract. This sets the tone and
expectations for the remainder of the paper.</p>
<h2>3. See Everything as a Facet on the Contribution</h2>
<p>Every scientific paper claims a contribution over previous work.
Once you have stated the contribution clearly, the rest of the paper is there
just to support the contribution:
The <em>introduction</em> motivates the need for your contribution.
The <em>related work</em> section differentiates prior work against your claimed
contribution.
The <em>method</em> section typically provides a description of the contribution.
The <em>experiments</em> verifies that your contribution works as advertised.
Etcetera.</p>
<p>The point is: the contribution anchors everything else in the paper.
If the contribution is clear, every part of the paper should make sense and
become a different facet or view onto the contribution.</p>
<p>There are two common ways how this simple structure is violated, leading
to a poorly written paper.
The first way is to not clearly state the contribution, leaving it ambiguous
during the whole paper. In such papers some method may be described, some
experiments may be performed, but the higher goal does not emerge. At the end
of the paper, the reader may agree with all statements of the paper and still
wonder what he should make of it.</p>
<p>The second way to violate the structure is less severe: a long description of
another method or work is added to the paper. I have seen this frequently
with junior authors who have just learned about a cool method and want to
showcase their understanding. Such description may even be interesting to a
reader of the paper, but it is orthogonal to the contribution of the paper thus
has negative value and is best removed.</p>
<h2>4. Consider Using a Page-1 Figure</h2>
<p>Consider using an explanatory figure on page one of the paper.
This was started in the SIGGRAPH community with the <a href="http://ivizlab.sfu.ca/arya/Papers/ACM/SIGGRAPH-96/Storytelling%20in%20VR.pdf">work of Randy
Pausch</a>,
but has slowly spread to other communities.</p>
<p>The main purpose of a page one figure is to provide a gist of the paper, much
like a "visual abstract".
It highlights what is important and sets the right expectations.
It is also visually engaging and whets the appetite of the reader.</p>
<p>What makes a good page one figure?
1. <em>Simplicity</em>: You need to be able to understand it in 20 seconds or less.
2. <em>Being self-contained</em>: All relevant information should be in the figure or
the figure caption. The figure caption should be short.</p>
<p>Many papers benefit form the addition of a page one figure, but there are some
exceptions, for example in theory papers it could appear out of place.</p>
<h2>5. Avoid the Passive Voice</h2>
<p>You can write clear English in both the active and passive voice.
A historical note on this is available in <a href="http://www.biomedicaleditor.com/active-voice.html">this essay on active vs passive
voice in scientific
writing</a>:</p>
<blockquote>
<p>"More than a century ago, scientists typically wrote in an active style that
included the first-person pronouns I and we. Beginning in about the 1920s,
however, these pronouns became less common as scientists adopted a passive
writing style.</p>
<p>Considered to be objective, impersonal, and well suited to science writing,
the passive voice became the standard style for medical and scientific
journal publications for decades.</p>
<p>...</p>
<p>Nowadays, most medical and scientific style manuals support the active over
the passive voice."</p>
</blockquote>
<p>The reason for this change is simple: most people find text written in the
active voice easier to read and more engaging.
Duke university published a guide on scientific writing that contains a <a href="https://cgi.duke.edu/web/sciwriting/index.php?action=passive_voice">long
discussion on the active versus passive
voice</a>.</p>
<p>In my writing there are very few exceptions were a passive voice may be more
appropriate, for example when discussing prior work ("The relationship between
iron intake and lifespan of parrots was studied by Miller and Smith.")
or when discussing experimental results ("The test error remained small even
when the regularization strength was decreased."), but even for these two
examples we can find an alternative active formulation ("Miller and Smith
studied the relationship between iron intake and lifespan of parrots.") and
("Even when we decreased the regularization strength the test error remained
small.").
The use of the passive voice in these two exceptions conveys an impersonal
attitude that may be justified when discussing the work of others or reporting
(as opposed to interpreting) experimental results.</p>
<p>Here is a real example from a <a href="http://www.nowozin.net/sebastian/papers/nowozin2007actionclassification.pdf">ICCV 2007 paper</a>
of mine (page 4):</p>
<blockquote>
<p>The dual problem has a limited number of variables,
but a huge number of constraints. Such a linear program
<em>can be solved</em> efficiently by the constraint generation
technique: Starting with an empty hypothesis set, the hypothesis whose
constraint (6) <em>is violated</em> the most is identified and <em>added</em> iteratively.
Each time a hypothesis <em>is added</em>, the optimal solution <em>is updated</em>
by solving the restricted dual problem.</p>
</blockquote>
<p>I highlight all the passive formulations.
Here is a rewrite of the paragraph using only the active voice:</p>
<blockquote>
<p>The dual problem has a limited number of variables,
but a huge number of constraints.
<em>We can solve</em> such a linear program efficiently by the constraint generation
technique: Starting with an empty hypothesis set,
<em>we identify the hypothesis with the largest constraint violation in (6)</em>
and <em>add the hypothesis to the hypothesis set</em>.
Each time <em>we add</em> a hypothesis, <em>we also update</em> the optimal solution by
solving the restricted dual problem.</p>
</blockquote>
<p>I made a few minor changes such as changing the word order and adding the noun
("to the hypothesis set") for added clarity. I hope you agree that the second
version is easier to read.</p>
<p>The <a href="http://www.nowozin.net/sebastian/blog/ten-tips-for-writing-cs-papers-part-2.html">next part</a> is available now.</p>History of Monte Carlo Methods - Part 32015-11-13T21:30:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-11-13:sebastian/blog/history-of-monte-carlo-methods-part-3.html<p>This is the third part of a three part post.
The <a href="http://www.nowozin.net/sebastian/blog/history-of-monte-carlo-methods-part-1.html">first part</a> covered the early history of Monte
Carlo and the rejection sampling method, the <a href="http://www.nowozin.net/sebastian/blog/history-of-monte-carlo-methods-part-2.html">second
part</a> covered sequential Monte Carlo.</p>
<h1>Part 3</h1>
<p>In this part we are going to look at Markov chain Monte Carlo.</p>
<p>The video file is also available for offline viewing as
<a href="/sebastian/videos/MonteCarlo-Part3.mp4">MP4/H.264</a> (114MB),
<a href="/sebastian/videos/MonteCarlo-Part3.vp8.webm">WebM/VP8</a> (110MB), and
<a href="/sebastian/videos/MonteCarlo-Part3.vp9.webm">WebM/VP9</a> (72MB) file.</p>
<video id="montecarlo-part3" class="video-js vjs-default-skin"
controls preload="auto"
width="639" height="360" data-setup="{}">
<source src="/sebastian/videos/MonteCarlo-Part3.mp4" type='video/mp4'>
<source src="/sebastian/videos/MonteCarlo-Part3.vp8.webm" type='video/webm'>
<p class="vjs-no-js">
To view this video please enable JavaScript, and
consider upgrading to a web browser that
<a href="http://videojs.com/html5-video-support/" target="_blank">supports HTML5 video</a>
</p>
</video>
<script src="http://vjs.zencdn.net/5.0.0/video.js"></script>
<p><iframe
src="https://onedrive.live.com/embed?cid=6B87C0D396848478&resid=6B87C0D396848478%21108438&authkey=AD-c_aaKMW9BSFI&em=2"
width="639" height="360" frameborder="0" scrolling="no"></iframe></p>
<p>(Click on the slide to advance, or use the previous/next buttons.)</p>
<p>(Also note there are three additional video visualization below.)</p>
<h2>Transcript</h2>
<p>(This is a slightly edited and link-annotated transcript of the audio in the
above video. As this is spoken text, it is not as polished as my writing.)</p>
<p><strong>Speaker</strong>: So this was one of family of Monte Carlo methods. I have too few
time remaining but a little bit of time to talk about a completely different
family of Monte Carlo methods and you may have heard this abbreviation before.
It is called MCMC.</p>
<p>MCMC stands for Markov chain Monte Carlo and it is completely different from
<a href="http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf">importance sampling</a>.
The basic difference is instead of growing a configuration or weighting
configurations, I always have a certain state and I manipulate that state
iteratively, and if I do this long enough then I have obtained a sample that
is uniformly distributed. I will get into the details in a minute. I first
want to talk briefly about the history.</p>
<p>It was invented by <a href="https://en.wikipedia.org/wiki/Marshall_Rosenbluth">Marshall
Rosenbluth</a> but it is
called the Metropolis algorithm. Why is that? Well, there were five authors on
the paper. The first of which was Nicholas Metropolis. And I roughly sized the
pictures according to contributions to the paper.</p>
<p>There are two different historical accounts of how the method came about and
they agree on that <a href="https://en.wikipedia.org/wiki/Edward_Teller">Edward
Teller</a> posed the mathematical
problem, and Marshall Rosenbluth solved it, and Arianna Rosenbluth implemented
it, and the other two authors did not do anything. They ordered the author
names alphabetically and <a href="https://en.wikipedia.org/wiki/Nicholas_Metropolis">Nicholas
Metropolis</a> happened to be
head of the group at that time. In any case, two interesting things about it,
all of these authors did not use the method in their following research. Also
the method is now very, very popular. Actually, Marshall Rosenbluth
afterwards founded the field of plasma physics, so he completely went into a
different direction. At the turn of the 21st century, Jack Dongarra and
Francis Sullivan, two researchers in scientific computing, were asking to
compile a <a href="https://www.siam.org/pdf/news/637.pdf">list of the top 10 algorithms of the 20th
century</a> and this was one
of them in their list. And quicksort was another one. So it is really an
important algorithm. First, the intuition, then a little bit formalization and
how it applies to our problem.</p>
<p>So, here is the intuition. What the method does: it constructs a directed
graph where each possible state is a node, and there are simple modifications
you can make indicated by these directed arcs to transform one state into
another state. For example, in our chain case, we could bend the chain at a
certain node or we could maybe change the last state a little bit or
something. Some simple transformations you can perform to transform one state
into another state. Only a few for each state, only a few arcs that leave each
state. And then it performs a random walk on this graph in such a way that if
you perform the random walk long enough and then stop it, you are uniformly
distributed across the whole graph, or according to some target probability
distribution.</p>
<p>The graph is much too large to explicit construct, right? It has exponentially
many configurations on our case, but it still it is able to perform a random
walk on this graph. It is called Markov chain Monte Carlo because the basic
concept is a <a href="https://en.wikipedia.org/wiki/Markov_chain">Markov chain</a> and I
want to quickly introduce that concept to you. So here is a simple graph with
just three states, <em>A</em>, <em>B</em>, and <em>C</em>. And imagine that you are standing at
State <em>B</em> and you follow the very simple rule of sequentially walking along
that graph. You see all the edges that leave your current state. They have
numbers associated to them and these numbers sum up to one. So if you stand on
State <em>B</em>, you have a 40% probability to move to State <em>A</em>, a 50% probability
to move to State <em>C</em>, and a 10% probability to stay in State <em>B</em>. So you just
follow that rule in each time step and you arrive at a new state.</p>
<p>(This is the video I showed, which is not visible in the slides above. The
video file is also available as <a href="/sebastian/videos/MarkovChain1.mp4">MP4/H.264 file</a> (1MB).)</p>
<video id="markovchain1" class="video-js vjs-default-skin"
controls preload="auto"
width="600" height="450" data-setup="{}">
<source src="/sebastian/videos/MarkovChain1.mp4" type='video/mp4'>
<p class="vjs-no-js">
To view this video please enable JavaScript, and
consider upgrading to a web browser that
<a href="http://videojs.com/html5-video-support/" target="_blank">supports HTML5 video</a>
</p>
</video>
<p>And when you do that and you record how often you reach a certain
state, then this is a histogram you would obtain. So it seems to converge to
some values after just a few hundred steps. And in fact, the limit
distribution that you obtain ultimately is given by these numbers here. And it
does not depend where you start, but it is completely independent of where you
start.</p>
<p>The Metropolis algorithm solves the inverse problem. The inverse problem is
the following. So here we had the rules how to walk and we just recalled that
the limit distribution. What if you have a graph structure and you have some
target probabilities that you want to realize? How should you put numbers at
the edges to reach that limiting distribution? That was the problem that was
posed by Edward Teller in essence.</p>
<p>Here we have that target distribution and the Metropolis algorithm is a
constructive way to choose its transition probabilities. It assumes that you
have a base Markov chain, so some basic random walk on the graph. So let us
say, just uniformly go over all outgoing edges on that graph. And we could
follow that random walk, right? It would not be the same limit distribution
that we are interested in but we could follow that random walk and we would
get a different limit distribution. And what it now does is whenever we
transition from one state to the other, in addition, modulates that decision
and has the option to reject that step. It can only accept or reject steps
proposed by the base Markov chain. The final Markov chain of the Metropolis
algorithm is this chain <span class="math">\(T\)</span> which is the base Markov chain multiplied with its
acceptance rate. And the acceptance rate is calculated according to that
formula which has a quite simple interpretation.</p>
<p>The acceptance probability is high when the target probability, <span class="math">\(\pi_j\)</span>, is
high but your current probability is low, when it's more likely to be in a
target state or vice versa, if it is unlikely, you are more likely to stay at
the current configuration. Or if the base chain pushes you to some other
state, you divide by that probability to compensate for that bias of the base
chain. So that is sort of the numerator and denominator have the two effects.
And the remaining probability, everything that is sort of modulated down, that
is the reject probability to stay in the current state.</p>
<p>(This is the video I showed, which is not visible in the slides above. The
video file is also available as <a href="/sebastian/videos/MarkovChain2.mp4">MP4/H.264 file</a> (1MB).)</p>
<video id="markovchain2" class="video-js vjs-default-skin"
controls preload="auto"
width="600" height="450" data-setup="{}">
<source src="/sebastian/videos/MarkovChain2.mp4" type='video/mp4'>
<p class="vjs-no-js">
To view this video please enable JavaScript, and
consider upgrading to a web browser that
<a href="http://videojs.com/html5-video-support/" target="_blank">supports HTML5 video</a>
</p>
</video>
<p>So let us do that simple calculation for our example here. For that limit
distribution, we would obtain that values. And let us take a look at whether
we converge to that limiting distribution. So the limiting distribution was
0.7, 0.1, 0.2, it converges slower than in the example before, but ultimately,
we are guaranteed to converge to the limit distribution that we setup. It is
not a unique way to construct such a Markov chain but it is a constructive way
to do so.</p>
<p>In our (self-avoiding random walk) example, we want to walk on this graph. And
what we are going to do is we just allow simple transformations, we pick a
random element on that chain and bend it 90 degrees in a random fashion. So
that gives us the arcs on that graph and you can imagine for a chain of a
certain length, there are only so many ways which you can enumerate to bending
the elements by 90 degrees. And if we happen to bend it in such a way that it
is actually no longer self-avoiding, we can remove that and add that back to
the probability of staying in a certain state. And remember we wanted to
sample uniformly overall the states on the graph. So we just plug that into
the acceptance rate calculation of the Metropolis algorithm and the <span class="math">\(\pi\)</span>,
because it is uniformly distributed, just cancels out of that rate. So it is a
very simple calculation.</p>
<p>And in practice, it is even simpler to implement that. We have a certain
state. We just initialize it with any state we want. We propose a random
modification and we accept or reject that, and then we have a new state, and
we iterate, we accept or reject that. So we keep doing that and after a
certain amount of time, we just keep the number of samples that we have
generated and we can compute our expectations with that sample that we have
generated. They are no longer independent samples because we have always
modified them a little bit but they are a set of samples that we have
generated. And this is the estimate, again, same curve that we had before,
this time, with the MCMC approach. And I am almost out time but I want to take
the last three slides here to show you our last method, the final method of
the talk. Any questions on MCMC so far?</p>
<p><strong>Attendee</strong>: If we said we accept or reject. We make a bend and we accept
them. How do you decide whether to accept or reject, just whether it crosses
itself.</p>
<p><strong>Speaker</strong>: If it crosses itself, it is no longer a valid state, and we
immediately reject. If it does not cross itself, you compute the acceptance
probability by that formula and the acceptance probability maybe 0.8, and then
you roll a random number between zero and one, uniformly distributed, and if
that number is below the acceptance rate, then you <em>accept</em>. If it is above the
acceptance rate that you have calculated, you <em>reject</em>. So if the acceptance
rate is one, you always accept. If the acceptance rate is 0.5, you flip a coin
uniformly to accept or reject.</p>
<p><strong>Attendee</strong>: How would I calculate the acceptance rate in the self-avoiding
random walk problem?</p>
<p><strong>Speaker</strong>: That depends on this graph structure here. So every state in the
graph has a certain set of allowed changes, right? For some state which is
very compact in number of allowed changes becomes smaller. For some longer
chain, you can basically bend it at any element, in any direction and it would
still be a self-avoiding work.</p>
<p><strong>Attendee</strong>: We have to check them all to count how many were
self-intersecting?</p>
<p><strong>Speaker</strong>: Yes, in some case you can remove it, I mean, not in this case. In
some cases, you always have a probability mass everywhere and then not like a
hard constraint like self-avoiding, then it cancels out as well. But in this
case, we have to enumerate all of them.</p>
<p>(A warning in this edit: I cannot recommend learning about the simulated
annealing by browsing the web as there is lots of misinformation around or
special cases are described as simulated annealing, or the base Markov chain
is not a reversible Markov chain, etc. See the references at the end of this
transcript for good links if you want to learn about the method.)</p>
<p>The final method is called <em>Simulated Annealing</em>.
It is a method to convert a Markov chain into an optimization method. As
simple as that, a Markov chain into an optimization method. It was proposed in
1983 by <a href="http://www.cs.huji.ac.il/~kirk/">Scott Kirkpatrick</a> and co-workers
and it is a very simple and often quite effective optimization method. And
simple to implement. It can optimize over complex state spaces. And for that
reason, it is very popular. So this
<a href="http://www.csb.pitt.edu/BBSI/2006/jc_talk/cheng.pdf">Science paper that they published in
1983</a> has 28,000
citations (now 35,000, October 2015). And interestingly, Scott Kirkpatrick
later in the '90s worked at the IBM T.J. Watson Research Center and there, at
least he writes this, he invented the first pen-based tablet computer. So it
is nice to see that he is innovative on quite different levels.</p>
<p>So how does it work? Say we have a function that we want to optimize. A very
simple function here. There are only 40 possible inputs to that function. So
in that case, we could simply enumerate all the 40 possible states and pick
the one that is maximal, so we want to maximize that function. But imagine you
have a different problem with exponentially many states so we can no longer
list them and this is just for illustration. But imagine instead of 40, you
would have 2 to the power of 40 or something. What we are going to do is we
convert that function into a probability distribution and we do that by what
is called a Gibbs distribution. So just a simple formula, where <span class="math">\(Z\)</span> is a
normalizing constant and the formula depends on the parameter <span class="math">\(T\)</span>, the so called
temperature parameter. If the temperature parameter is very high, you divide
that function value by a very large value and the argument almost does not
matter. So the function value does not matter. In this case, the temperature
is 100 and you see that the resulting distribution is almost uniform. Maybe
hard to see but it is almost uniform because the temperature is quite high
compared to the function value.</p>
<p><strong>Attendee</strong>: What is <span class="math">\(Z\)</span>?</p>
<p><strong>Speaker</strong>: <span class="math">\(Z\)</span> is a normalizing constant. So it is just a sum over all the
possible configurations. But it is a constant and it just depends on <span class="math">\(T\)</span>. And
interestingly, if we apply this Metropolis algorithm, the <span class="math">\(Z\)</span> constant cancels
out, it is not really important. You do not even need to write it down. You
can just write <span class="math">\(P\)</span> is proportional to <span class="math">\(X\)</span>, (<span class="math">\(P \propto X\)</span>) or something, the
constant is just a normalizing constant. If I decrease the temperature, you
see, so this is temperature 10, temperature 4, 1, and now I decrease it even
further, 0.1, the distribution puts more and more mass on the function values
that are higher; and basically, what the simulated annealing does it runs a
Markov chain, a Metropolis chain, for example. But, while it runs it, it
modifies it by decreasing the temperature.</p>
<p>So it tries to shift all the probability mass in our current state as well to
the states that have high function value. And how it does it, well, it chooses
its schedule, a temperature schedule so on the x-axis here I have the steps
that I take with the Markov chain and on the y-axis I have the temperature
that I use, and I just decrease the temperature here, a geometric schedule so
I just modify the temperature on each step with 0.99 or something.</p>
<p>For very high temperatures, the Markov chain is basically just a purely random
walk; it does not even look at the function value. For very low temperatures,
it basically is a local search algorithm; it only accepts improvements in the
function value. But, for intermediate temperature values, it is something in
between so it tries to optimize but it can still escape local minima.</p>
<p>So that is intuition. There's some theory to it actually, in another famous
paper by <a href="http://www.dam.brown.edu/people/geman/">Stuart</a> and <a href="http://cis.jhu.edu/people/faculty/geman/">Don Geman</a>, a <a href="http://www.csee.wvu.edu/~xinl/library/papers/infor/Geman_Geman.pdf">1984 paper</a> which is actually famous for a
different reason because they proposed another famous Monte Carlo method they
<a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampler</a> in that paper
but in that very paper they also have some theory of simulated annealing and
they prove that if you decrease the temperature slow enough the probability is
one to obtain the optimal state. But that optimal schedule is too slow in
practice so you cannot use it and that is why we are still stuck with the
geometric schedule.</p>
<p>Last minute, let us do simulated annealing and I go back to the more
complicated model where we actually have this two types of elements: the black
ones that attract each other and the white ones which are neutral. And you see
here I plotted whenever two black elements are close to each other on this 2-D
grid; I plotted a red line, and I am going to try to optimize the number of
red lines so I try to get as many red connections as possible.</p>
<p>(This is the video I showed, which is not visible in the slides above. The
video file is also available as <a href="/sebastian/videos/SimulatedAnnealing-HP2D-48.mp4">MP4/H.264 file</a> (5MB).)</p>
<video id="simulatedannealing" class="video-js vjs-default-skin"
controls preload="auto"
width="600" height="450" data-setup="{}">
<source src="/sebastian/videos/SimulatedAnnealing-HP2D-48.mp4" type='video/mp4'>
<p class="vjs-no-js">
To view this video please enable JavaScript, and
consider upgrading to a web browser that
<a href="http://videojs.com/html5-video-support/" target="_blank">supports HTML5 video</a>
</p>
</video>
<p>And that is really a model for protein folding, folding in such a way that
there are many black to black noncovalent bonds. So here, is an animation of
performing simulated annealing exactly with that proposal that I had, bending
it at 90 degrees left or right at a random position and, I do 100,000 steps
and I show you every 100th step. At high temperatures, this is quite high
temperatures still you see it is very stretched out, there are not very many
compact structures but as the optimization proceeds the temperature decreases
and it favors more and more compact configuration. So I think already at a
step of, like now, you would see already quite some compact structures
appearing.</p>
<p>So this is a purely random walk. I think it goes until 1,000. I can skip it in
the interest of time but, I can show you the result. This is the configuration
that we have obtained with 100,000 steps in our Markov chain in simulated
annealing and for that model problem there is a paper that analyzes different
model problems and the optimal configuration is known, the ground state is
known and the ground state is slightly better. It has 23 connections. We only
have 21 but actually, with such a simple method we have obtained a quite good
solution and that is really the essence of why simulated annealing is popular.</p>
<p>We can often get quite far with allegedly fewer effort both in implementation
and runtime. Although, there may be a better method in a specific domain, that
is optimized for that. And let us reflect a little bit on what we did. We have
solved a rather complicated problem like folding this 2D protein with a very,
very simple method with just a random walk that accepts or rejects simple
modifications.</p>
<p>And that is basically it. Before I have my last slide, I just want to say a
little bit about the literature if you are interested in the Monte Carlo
Methods I can highly recommend the first book and if you are interested in the
black and white pictures I showed of the people that are relevant to the
invention of the Monte Carlo method I can recommend the last book which is the
autobiography of Stanislaw Ulam, it is very interesting.</p>
<p>So, thank you very much for your attention.</p>
<h2>References</h2>
<p>Here are some of the introductionary book references mentioned in the talk.</p>
<p>The historical context and anecdotes are mostly from the autobiography of Stan
Ulam,
<a href="http://www.ucpress.edu/book.php?isbn=9780520071544">Adventures of a
Mathematician</a>.
The book is accessible to anyone with a basic high school math background.
See also this <a href="http://projecteuclid.org/euclid.bams/1183540384">kind 1978 review of the
book</a>.</p>
<p>A great, now somewhat dated introduction to Monte Carlo methods is <a href="http://www.people.fas.harvard.edu/~junliu/">Jun
Liu</a>'s <a href="http://www.springer.com/us/book/9780387763699">Monte Carlo Strategies in
Scientific Computing</a>. I
learned Monte Carlo through this book and it has a spot in my bookshelf that
is at arm-length from my desk.</p>
<p>The Liu book is somewhat dated and it covers a lot of ground; a slightly more
formal but up-to-date Monte Carlo book is
<a href="http://statweb.stanford.edu/~owen/">Art Owen</a>'s upcoming book <a href="http://statweb.stanford.edu/~owen/mc/">Monte Carlo
theory, methods and examples</a>, which is
excellent.</p>
<p>A highly accessible and very well written introduction to Markov chains and
simple Monte Carlo methods is
<a href="http://www.math.chalmers.se/~olleh/">Olle Häggström</a>'s
<a href="http://www.cambridge.org/catalogue/catalogue.asp?isbn=0521813573">Finite Markov Chains and Algorithmic Applications</a>.
I recommend it if you want a most gentle introduction to the theory behind
MCMC.
Still the most authoritative reference on MCMC is the
<a href="http://www.mcmchandbook.net/HandbookTableofContents.html">Handbook of Markov Chain Monte Carlo</a>.
In particular, the first twelve chapters cover are on general methodology and
contain a wealth of information not found in other textbooks.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>History of Monte Carlo Methods - Part 22015-10-30T20:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-10-30:sebastian/blog/history-of-monte-carlo-methods-part-2.html<p>This is the second part of a three part post.
The <a href="http://www.nowozin.net/sebastian/blog/history-of-monte-carlo-methods-part-1.html">last part</a> covered the early history of Monte
Carlo and the rejection sampling method.</p>
<h1>Part 2</h1>
<p>In this part we are going to look at importance sampling and sequential Monte
Carlo.</p>
<p>The video file is also available for offline viewing as
<a href="/sebastian/videos/MonteCarlo-Part2.mp4">MP4/H.264</a> (66MB),
<a href="/sebastian/videos/MonteCarlo-Part2.vp8.webm">WebM/VP8</a> (64MB), and
<a href="/sebastian/videos/MonteCarlo-Part2.vp9.webm">WebM/VP9</a> (44MB) file.</p>
<video id="montecarlo-part2" class="video-js vjs-default-skin"
controls preload="auto"
width="639" height="360" data-setup="{}">
<source src="/sebastian/videos/MonteCarlo-Part2.mp4" type='video/mp4'>
<source src="/sebastian/videos/MonteCarlo-Part2.vp8.webm" type='video/webm'>
<p class="vjs-no-js">
To view this video please enable JavaScript, and
consider upgrading to a web browser that
<a href="http://videojs.com/html5-video-support/" target="_blank">supports HTML5 video</a>
</p>
</video>
<script src="http://vjs.zencdn.net/5.0.0/video.js"></script>
<p><iframe
src="https://onedrive.live.com/embed?cid=6B87C0D396848478&resid=6B87C0D396848478%21108434&authkey=AJ6I7fTgWMVe7s8&em=2"
width="639" height="360" frameborder="0" scrolling="no"></iframe></p>
<p>(Click on the slide to advance, or use the previous/next buttons.)</p>
<h2>Transcript</h2>
<p>(This is a slightly edited and link-annotated transcript of the audio in the
above video. As this is spoken text, it is not as polished as my writing.)</p>
<p><strong>Speaker</strong>: Okay, so rejection sampling works only for short chain lengths.
And then, in light of this finding the next method, sequential importance
sampling, was introduced independently by two groups.
<a href="https://en.wikipedia.org/wiki/John_Hammersley">John Michael Hammersley</a>,
who actually did his PhD here in Cambridge and then moved to Oxford to become
professor, and by <a href="https://en.wikipedia.org/wiki/Marshall_Rosenbluth">Marshall
Rosenbluth</a> and his wife
Arianna Rosenbluth. They did it independently. They called it different by
different names. I think Hammersley called it <em>inversely restricted sampling</em>
and the Rosenbluth called it <em>biased sampling</em> and both these names did not
really stick. So now, nowadays it is called <em>sequential importance sampling</em>.
In different communities, it is also called the <em>growth method</em> or the
<em>Rosenbluth method</em>.</p>
<p>How does this method work? It is based on the idea that was suggested by the
audience just before (in the first part). Remember when we are growing these
chains step by step and we make the step and we would have to reject the
sample? Well, we could just prevent making that step, all right. We could just
say, "Look, there are two possibilities but you would not reject that sample.
So why not just take one of these?"</p>
<p>Of course the method is not fail-safe. So if we, in such a situation, no
matter what we do if we keep growing we will still run into trouble.
Right, so the method is still myopic. We still only make one inference at a
time. But the real problem with this method is that we no longer sample
uniformly from the set that interest us. In fact, we favor more compact
configurations and what Hammersley, and Morton, and the Rosenbluth's realized
is a method to systematically compensate for that bias. </p>
<p>So let me first talk about how this in general is done and then how it is done
in our specific example. So in general, we have this expectation expression
that we want to approximate. And what they said is, well, we assign one weight
to each sample and if the weight would be one that would be the original
expectation. But we assign a weight to the sample and we choose the weight in
such a way that we compensate for that bias, that we favor some
configurations. So whenever we favor these configurations, we down-weight
them, and whenever some configurations is rare but we generate them, we
up-weight them.</p>
<p>In practice, we would generate a few samples and if every sample would have a
weight, in this particular instance how it works is, well we just count the
number of possibilities we have in each step. So we basically decompose a
sampling distribution. So in the first step we have four possibilities, four
points free adjacent to it and the next one we have only three available. And
so we just unroll basically our decisions. Here we only have two available,
two choices available, all right? So because we grow the chain sequentially,
the probability of generating that particular configuration also decomposes
sequentially. So the final chain would have this probability of being
generated. And what they simply do is say, "Okay, this is the distribution by
which we generate the sample, but we want to generate it at uniformly at
random so we also decompose the weight and just build the weight as a
inverse". So when we weight these samples by weights we will systematically
de-bias the sampling distribution, so we will get unbiased estimates of the
expectation that we are interested in.</p>
<p>Let us take a look at where we were with the rejection sampler. So this is a
limit, what we could do with a rejection sampler. Now, with the growth method,
we can go to significantly longer chain length, to a chain length of 60 and
then again, the uncertainty estimates the confidence intervals blow up. So why
is this? I mean why do we say uncertainty intervals blow up in this improved
method? Well the thing is, the weights that we compute, they become very
unbalanced. And even though we generate maybe a few thousand samples, only few
of them will have significant share of the weights.</p>
<p>So here is a visualization of that. So here, I grow 50 chains in parallel, one
step at a time, and I show you in each step the normalized weights. So in the
beginning, everything is uniform because in the beginning, everything has
equal number of possibilities. But over time, as I grow more and more, as
append more elements, the weights become very unbalanced so that after 100
steps, actually five elements have weights significantly different from zero.
And this means we actually do not have 50 samples on this case, we only have
five, and our estimates become very poor. And this only amplifies when you
have a few thousands ones. One way to measure the quality of the samples we
have generated is to ask, "Okay, I have generated, say 5,000 samples with
weights. How much are these worth in terms of computing expectations, in terms
of unweighted samples?"</p>
<p>Because unweighted samples are optimal there is a quality measure that you can
compute that is an estimated quantity and it is called <a href="http://www.nowozin.net/sebastian/blog/effective-sample-size-in-importance-sampling.html">effective sample
size</a>,
which exactly measures the worthiness of a weighted sample set. And for this
plot I have shown you now with 5,000 samples, you see that it drops and drops
and drops until it is almost close to one. So that is a real problem.</p>
<p>You guessed the next step would be another improved Monte Carlo method and
indeed it improves on that, and it is generally known as
<a href="http://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf"><em>Sequential Monte Carlo Method</em></a>.
The idea is quite simple and natural and it has been reinvented
many times in different communities. So the <a href="http://scitation.aip.org/content/aip/journal/jcp/30/3/10.1063/1.1730021">original
paper</a>
is from 1959 but has been reinvented in the signal processing community as
<a href="https://en.wikipedia.org/wiki/Particle_filter">particle filter</a>, it
has been reused and pioneered in computer vision by our Andrew Blake for
tracking objects and their contours. And it is used across many different
communities often under very different names. But generally, Sequential Monte
Carlo is the preferred name. And the basic idea here is quite simple. The
problem is unbalanced weights. We have to prevent getting unbalanced weights
in each step. So how we are going to do this is by introducing a process which
removes samples that have low weight and duplicate samples that have high
weight. That is called re-sampling.</p>
<p>So say, in one certain timestamp, we grow all the chains in parallel, we grow
say 50 chains in parallel. On this example, it is only six chains in parallel.
And we have observed that the weights are unbalanced. Then remove some of the
low weight instances and we duplicate some of the high weight instances as
shown here. The algorithm that corresponds to that is the same as before just
weighted sequential importance sampling but we grow all the samples in
parallel and monitor the weights. And if the weights are in trouble, if the
weights become unbalanced, we enforce balanced weights again, by removing low
weight samples and duplicating other.</p>
<p><strong>Attendee</strong>: Question.</p>
<p><strong>Speaker</strong>: Yes.</p>
<p><strong>Attendee</strong>: In your little white chain example, the weights are going to be
high when at every step you can take three choices, right? Like three times
three times three. And that is going to get bigger. The long ones do not go
near each other. So this is going to bias in favor of things that just sort of
go off into the distance, all of them curly wurly things, is that right?</p>
<p><strong>Speaker</strong>: Right, exactly, because the sampling distribution biases towards
compact configurations the weights have to undo that bias by favoring the
configurations that walk off and are no longer compact.</p>
<p><strong>Attendee</strong>: But isn't that bad because then essentially we'll you dominated by
all the ones that go off into the distance and we won't get any curly wurly
ones.</p>
<p><strong>Speaker</strong>: It would be bad but remember that the sampling distribution that
I had above the weight, the sampling distribution exactly has that opposing
bias. So we generate from that distribution and we want the weights to
compensate for that bias. So the extra samples that we get are more compact
than they should be. And that is why we have to downweight these to get a low
weight, right? And we have to upweight these samples that are not compact but
that go out into basically long chains.</p>
<p><strong>Attendee</strong>: I did not get that, sorry. But never mind. I believe you.</p>
<p><strong>Speaker</strong>: Okay. So if you do that re-sampling operation to compensate for
the unbalanced weight effect, and I plot again the effective sample size and
whenever we see that the effective sample size, in this case drops below
2,500, I perform this re-sampling operation and reset the weights to uniform
and I enforce this effective sample size to become 5,000 again. You see that
basically, I can control how unbalanced the weights become. Here is another
visualization in terms of the same plot that I had before and now with the red
arrows are indicated whenever I reset the weights to uniform, I perform his
re-sampling operation so I always can ensure that my weights are close to the
uniform distribution.</p>
<p>So let us compare that again. This was a plot without re-sampling. You will
see that the uncertainty estimates indicate that our estimates are very
unreliable. And this is basically with re-sampling. The whole family of
Sequential Monte Carlo approaches are really state of the art methods. This
scales to almost no limits, so people have used generate chains with over a
million bonds. It is state of the art for any kind of probabilistic model
where you can sequentially decompose the model. For example, time series
models, hidden Markov models, state space models, dynamic Bayesian networks,
all these kind of models, these methods are applicable and highly efficient.</p>History of Monte Carlo Methods - Part 12015-10-16T21:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-10-16:sebastian/blog/history-of-monte-carlo-methods-part-1.html<p>Some time ago in June 2013 I gave a <em>lab tutorial</em> on Monte Carlo methods at
Microsoft Research. These tutorials are seminar-talk length (45 minutes) but
are supposed to be light, accessible to a general computer science audience,
and fun.</p>
<p>In this tutorial I explain and illustrate a number of Monte Carlo methods
(rejection sampling, importance sampling, sequential Monte Carlo, Markov chain
Monte Carlo, and simulated annealing) on the same problem.
Although I am not exactly a comedian, in order to keep the tutorial fun I
peppered the talk with lots of historical anecdotes from the inventors of the
methods.</p>
<p>This is the first of three parts.</p>
<h1>Part 1</h1>
<p>The first part (17 minutes) covers the history of modern Monte Carlo methods,
their use in scientific computation, and one of the most basic Monte Carlo
methods, <em>rejection sampling</em>.</p>
<p>The video file is also available for offline viewing as
<a href="/sebastian/videos/MonteCarlo-Part1.mp4">MP4/H.264</a> (109MB),
<a href="/sebastian/videos/MonteCarlo-Part1.vp8.webm">WebM/VP8</a> (106MB), and
<a href="/sebastian/videos/MonteCarlo-Part1.vp9.webm">WebM/VP9</a> (74MB) file.</p>
<video id="montecarlo-part1" class="video-js vjs-default-skin"
controls preload="auto"
width="639" height="360" data-setup="{}">
<source src="/sebastian/videos/MonteCarlo-Part1.mp4" type='video/mp4'>
<source src="/sebastian/videos/MonteCarlo-Part1.vp8.webm" type='video/webm'>
<p class="vjs-no-js">
To view this video please enable JavaScript, and
consider upgrading to a web browser that
<a href="http://videojs.com/html5-video-support/" target="_blank">supports HTML5 video</a>
</p>
</video>
<script src="http://vjs.zencdn.net/5.0.0/video.js"></script>
<p><iframe
src="https://onedrive.live.com/embed?cid=6B87C0D396848478&resid=6B87C0D396848478%21108435&authkey=AHiRYMuM9wBIKos&em=2"
width="639" height="360" frameborder="0" scrolling="no"></iframe></p>
<p>(Click on the slide to advance, or use the previous/next buttons.)</p>
<h2>Transcript</h2>
<p>(This is a slightly edited and link-annotated transcript of the audio in the
above video. As this is spoken text, it is not as polished as my writing.)</p>
<p><strong>Speaker</strong>: Thank you all for coming to this lab tutorial. I know many of you
have used Monte Carlo techniques in your research or in your projects. And
still I decide to keep the level of this tutorial very basic and I try to show
you a few different Monte Carlo methods and how they may be useful in your
research. I hope that after the talk you basically understand how these
methods can be applied and what different limitations the different methods
have. And I will introduce these different methods in chronological order and
also say a little about the interesting history, how these methods have been
invented.</p>
<p>But before I get to that, I first want to ask you, do you like to play
solitaire? I certainly do sometimes play solitaire and when you play a couple
of games, you realize that some games are actually not solvable. So some games
are just, no matter what you try, no matter what you do, they are just
provably not solvable. And so if you shuffle a random deck of 52 cards and put
it out as a solitaire deck, it's a valid question to ask is what probability
do you get a solvable game? That's the question. And it's precisely this
question, precisely this question for the game of <a href="https://en.wikipedia.org/wiki/Canfield_%28solitaire%29">Canfield
Solitaire</a> that has
led to the invention of the modern Monte Carlo methods.</p>
<p>One way to attack this problem would be instead of trying analytic or
mathematical approaches, basically having to take into account all the rules
of the game, is to just take a random set of cards, play a hundred times after
randomly shuffling the cards and just looking at how many times you come up
with a solvable game. And that would give you a ballpark estimate on the
probability. And that's precisely what this man, <a href="https://en.wikipedia.org/wiki/Stanislaw_Ulam">Stanislaw
Ulam</a>, has
recognized, that this is possible.</p>
<p>I want to say a few words about Stanislaw Ulam because he's so crucial to the
invention of Monte Carlo methods. So he was born in today's Ukraine in a town
called <a href="https://en.wikipedia.org/wiki/Lviv">Lviv</a> (formerly Lemberg,
in Austria-Hungary). And he was enjoying a very good education. His family had
a good background. And he very early discovered in his life that he likes to
do mathematics. He was part of the <a href="https://en.wikipedia.org/wiki/Lw%C3%B3w_School_of_Mathematics">Lviv School of
Mathematics</a>
who has done many contribution to the more abstract mathematics, vector
spaces. So he's known for some of the mathematical results. But then he had to
flee to the United States in the 1930s and there became professor in
Mathematics and was recruited to <a href="http://www.atomicheritage.org/history/hydrogen-bomb-1950">Los
Alamos</a> to do
research on the Hydrogen bomb. Not the first nuclear weapon but on the second
Hydrogen bomb design.</p>
<p>During that time in 1946, working at Los Alamos, he had a breakdown. For
couple of days he had a headache and he had a breakdown and was delivered to a
hospital. The doctors performed an <a href="https://en.wikipedia.org/wiki/Encephalitis">emergency
surgery</a>, removed part of his
skull, because it turned out he had a brain infection, the brain has swollen
and he would have died if the doctors didn't perform his operation. And the
doctors told him "You have to recover, you have to stay at home for half a
year and don't do any mathematics."</p>
<p>He was obsessed with Mathematics for his whole life. So instead of doing
mathematics, he tried to pass the time playing Canfield Solitaire. And while
playing Canfield Solitaire he asked the question, "<a href="https://en.wikipedia.org/wiki/Canfield_%28solitaire%29#Solvability">Okay, what's probability to
solve this
game?</a>"
and with his quite broad knowledge of Mathematics he tried a few different
analytic attempts to come up with the answer to that question. But ultimately
he realized that it is much easier to get an estimate by just playing games
randomly.</p>
<p>And at that time he was already doing research in Los Alamos. He recognized
that this has applications as well for studying different scientific problems
such as <a href="https://en.wikipedia.org/wiki/Neutron_transport">Neutron Transport</a>,
which is essential to understand when designing nuclear weapons. So he is also
the inventor of the <a href="https://en.wikipedia.org/wiki/History_of_the_Teller%E2%80%93Ulam_design">first working Hydrogen bomb
design</a>
together with <a href="https://en.wikipedia.org/wiki/Edward_Teller">Edward Teller</a>.
And the inventor of the Monte Carlo method, <a href="http://www.amstat.org/misc/TheMonteCarloMethod.pdf">published a few years
later</a> in
1949. And also he's known for having performed probably the most laborious
manual computation ever undertaken (with Cornelius Everett) to disprove Edward
Teller's earlier nuclear weapon design, to show that it is not possible. So
very interesting history. I will talk a little bit later about him some more.</p>
<p>So nowadays, Monte Carlo methods, and with Monte Carlo methods, really, I
mean, any method where you perform some random experiment, which is typically
quite simple, and you aggregate this results into some inferences about a more
complex system. Today, Monte Carlo methods are very popular in simulating
complex systems. For example, models of physical or biological or chemical
processes, for example, weather forecasting, and of course, nuclear weapon
design.
But also just last week, it was used to <a href="http://www.nature.com/nature/journal/v497/n7451/full/nature12162.html">simulate the HI Virus
capsid</a>.
A simulation of 64 million atoms, a major breakthrough in understanding the HI
Virus. So it has huge applications in scientific simulations, it also has
applications in doing inference in probabilistic models. The most famous
system there would be the <a href="http://www.mrc-bsu.cam.ac.uk/software/bugs/">BUGS
system</a> also developed here in
Cambridge at the University, initially developed in the early '90s.
<a href="http://research.microsoft.com/en-us/um/cambridge/projects/infernet/">Infer.NET</a>
also supports Monte Carlo inference and here at MSRC also the <a href="http://research.microsoft.com/en-us/projects/filzbach/">Filzbach
system</a>
does. Also there's a quite popular system now, from the University of
Columbia, called <a href="http://mc-stan.org/">STAN</a>. It's actually named STAN because
of Stanislaw Ulam.</p>
<p>Monte Carlo methods can also be used for optimization. So not just for
simulating but also for optimizing a cost function. We will see an example
later, but typically it is often used where very complicated systems are
optimized. So something like the circuit layout that has many interdependent
constraints. And it is also used for planning, for games, and for robotics,
where it is essential to approximate intractable quantities, to perform
planning under uncertainty or where measurement noise makes it essential to
represent uncertainty in a representative way. So these are many, many
different applications, too many to really list. I want to pick out one
application for the rest of the talk and illustrate this application with
different Monte Carlo methods.</p>
<p>And that application is <a href="https://en.wikipedia.org/wiki/Protein_folding">protein
folding</a>. So protein folding
happens right now in your body, in every cell of your body. In every cell you
have a structure called the Ribosome and that's basically the factory in your
cell. It transforms information, encoded in the DNA into one linear long
structure, the protein. And that structure is such a long chain that folds
itself into very intricate three dimensional structures. Very beautiful
structures arise, and it is really the three-dimensional shape that this long
chain folds into that determines the functional properties. It is really
essential to understand in order to make predictions about what these
molecules do. This can take anywhere between a few milliseconds and a few
hours. And I think the state of the art on a modern machine is to be able to
simulate accurately something like 60 nanoseconds per computer day. So we are
nowhere in reach of being able to accurately fold these structures. But there
is the <a href="https://folding.stanford.edu/">Stanford Folding@home</a> project which
uses Monte Carlo methods. And I think right now, they have something like a
hundred fifty thousand computers working right now on the problem of protein
folding. So it is quite essential to understand a couple of different
diseases.</p>
<p>We are not going to solve protein folding in this talk but I am going to use a
slightly more simplified model. One thing to simplify is you still have a
chain. But you say, "Okay, first the chain does not live in three dimensions,
it only lives on the plane." And we do not have many different amino acids, we
only have two: the black ones and the white ones. And the white ones repel
water, the white ones like water and the black ones repel water and so they
fold into something that has a black core and a white surrounding. In fact, I
am going to make it even simpler. I say, "Okay, it lives on the plane but it
lives on the grid". So it is a further simplification. And now for the next
few slides, I even simplify this one step further: we only have the white
bonds. </p>
<p>So that is a so-called <a href="https://en.wikipedia.org/wiki/Self-avoiding_walk">2D lattice self-avoiding random walk</a> model. So you have
a certain length. Say 48 bonds, 48 elements, and you have a self-avoiding
walk, so this walk is not allowed to cross onto itself. And this is a very
simplified model but already some questions which are interesting become very
hard or actually intractable. For example, if I fix a number of elements in
this walk, one question is, how many self-avoiding walks are there on the
plane? Another question is, okay the number is finite, while there are many
but finitely many possible combinations, how do I uniformly generate such
random walk? And the third question would be, okay, I am interested in some
average properties, for example, the average end-point distance between the
two ends, how do I compute an approximation to that average quantity?</p>
<p>These are really typical problems that can be addressed with Monte Carlo
methods:
- <em>average quantities</em>,
- <em>counting problems</em>,
- <em>random sampling problems</em>.
So that's what's going to be with us in the next few slides.</p>
<p>The first method is a very simple one. It's called <a href="https://en.wikipedia.org/wiki/Rejection_sampling"><em>rejection
sampling</em></a> and the idea is
really very simple to explain. While we have this complicated set, the set of
all self-avoiding random walks of a certain lengths and we want to generate
one element uniformly at random from the set. This is hard. So what we do is
we instead consider a super set, the set of all random walks of a certain
lengths, and this is allowed to cross onto each other. And it is very easy to
simulate from that set. So we just simulate from this orange set, from this
larger set, and whenever we end up outside the blue set we discard that
sample. And whenever we are inside the blue sample set, then we keep the
sample. And because we uniformly generate samples, we can just keep doing this
and collect whenever we reach an element in the blue set.</p>
<p>In practice this would work as follows: we start and we just keep appending in
a randomly chosen direction, one out of three say, and if we happen to cross
on ourselves we can already discard that sample and start over. And we would
keep all the sample set that we can grow to the full lengths we want and we
keep them and we maybe collect a thousand of them. And compute whatever
property we want from that sample set.</p>
<p><strong>Attendee 1</strong>: May I ask a question?</p>
<p><strong>Speaker</strong>: Yes, sure.</p>
<p><strong>Attendee 1</strong>: What happens when instead you say, "Oh Dear, I shouldn't have
gone down, I should have gone-in in a different direction." Did you just get a
biased sample or something?</p>
<p><strong>Speaker</strong>: You are anticipating the future. We are still in 1949.</p>
<p><strong>Attendee 1</strong>: But I thought this was the-- you said, right to begin with,
generating the ones that don't cross themselves is hard. </p>
<p><strong>Speaker</strong>: Yes and it would still be hard-- just bear with me for a few
slides, this is actually where it's leading to.</p>
<p><strong>Attendee 1</strong>: Okay.</p>
<p><strong>Speaker</strong>: But this is a simple method. And we can, once we have generated the
set of samples, we can compute average properties. For example, this squared
extension there where you compute the distance in the plane between the two
end points, that is a model problem that people considered. And more
generally, what we would like to do is to compute expectations. So we have a
distribution <span class="math">\(\pi\)</span> of <span class="math">\(X\)</span> over some state <span class="math">\(X\)</span> and we would like to evaluate some
quantity <span class="math">\(\phi\)</span> of <span class="math">\(X\)</span>. For example, this distance between the two endpoints
of a certain state and we want to compute some sum and the sum contains
exponentially many terms in this case. We want to compute the sum as an
expectation and average quantity. And the Monte Carlo idea is to simply
replace it, approximate that huge sum with exponentially many terms with
something that has only say a thousand or 10,000 terms which is the samples we
generated.</p>
<p>When we do that, when we actually do this rejection sampling here as a
function of the chain lengths. I do that and I generate here 10 million
samples, 10 million times I try and I keep all the samples at length that
helped me to be self-avoiding. Then I can plot this average distance and
because it is an average of many terms I know that the central limit theorem
applies so I can also plot <a href="https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval">confidence
intervals</a>.
So I not only get the inferences that I am interested in now, I also get an
estimate, a confidence interval that captures with a certain probability the
two value. </p>
<p>Okay and it works until a chain length of thirty so already quite large chain
lengths. Then the confidence intervals become larger because I get less and
less samples accepted. I use 10 million attempts here but actually similar
methods are very useful even for a few hundred attempts. This is a picture
around that time in Los Alamos where they performed the simulation manually by
drawing with this drawing device, basically, on a sheet of paper, and whenever
they cross from one type of material to another type of material, they would
change the wheels and roll a new random number and then move it and turn it in
random directions and they do it a few hundred times and get a global picture
on how the neutrons are scattered in this matter. Because everything was named
MANIAC, ENIAC, etc., and this idea was from Enrico Fermi they called this
device a <a href="https://en.wikipedia.org/wiki/FERMIAC">FERMIAC</a>.</p>
<p>But anyway, another thing we could do is solve the counting problem. So we can
estimate the acceptance rate. We have the number of attempts that we made and
the number of attempts that were accepted that happened to be self-avoiding.
It gives us the acceptance rate and we can estimate the number of
self-avoiding walks simply as a product of this acceptance rate with a total
number of 2D walks that are not necessarily self-avoiding. That is easy to
calculate as well because the first step was into right direction and we had
three possibilities in each step, so we could just have a formula for that one
and this gives an estimate of the number of self-avoiding walks.
So here is a plot of that and in <a href="http://arxiv.org/abs/cond-mat/0409355">this paper I found from
2005</a> where people have exhaustively
computed that with clever enumeration methods up to a length of 59, but beyond
that the exact number is unknown. But it happens to agree very well with these
known ground truths.</p>
<p><strong>Attendee 1</strong>: Quick question. Is that what even with your early rejection
business?</p>
<p><strong>Speaker</strong>: Yes, that's the one thing.</p>
<p><strong>Attendee 1</strong>: Okay.</p>
<p><strong>Speaker</strong>: It's exactly with the rejection sample here. So the acceptance
rate is from the rejection sample here. <span class="math">\(P\)</span> is estimated from the rejection
sample.</p>
<p><strong>Attendee 1</strong>: What is the acceptance rate when you get to 30?</p>
<p><strong>Speaker</strong>: Again, next slide here.</p>
<p><strong>Speaker</strong>: I am impressed. One second. Let us first enjoy what we have
achieved, let us take a look at Monte Carlo, enjoy some sunshine. So the name
Monte Carlo, I mean, what first comes to mind is all the casinos, right? And
the gambling and that is indeed one of the origins of the name. But the
particular reason and the person who suggested this was Nicholas Metropolis,
the colleague of Stanislaw Ulam, was very much amused about the stories Stan
was telling about his uncle, Michael Ulam, who was a wealthy businessman in
his hometown in Lviv. And then switched to the finance industry and spent the
rest of his life gambling away his fortune in Monte Carlo and Nicholas
Metropolis found this so amusing that he insisted the method being called,
Monte Carlo method. So that is the real reason why it is called Monte Carlo
method.</p>
<p>And it is not all sunny and that is where we come to this slide, which is the
acceptance rate as a function of the chain lengths. And you see the simpler
rejection sample for long enough chains. I mean, intuitively you can
understand when you grow the chain very long, the probability to cross onto
yourself when you walk randomly becomes higher and higher. The acceptance rate
is very, very small. So I think for a million samples I had only like 15 walks
accepted at the lengths of 30. And that's why the confidence intervals have
been blowing up because the estimates become unreliable. </p>
<p>The <a href="http://www.nowozin.net/sebastian/blog/history-of-monte-carlo-methods-part-2.html">next part</a> is available.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Julia language for Scientific Computing2015-10-02T22:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-10-02:sebastian/blog/the-julia-language-for-scientific-computing.html<p><a href="http://julialang.org/">Julia</a> is a relatively new programming language with
the <a href="http://julialang.org/blog/2012/02/why-we-created-julia/">declared goal</a>
to become the leading language for scientific computing.</p>
<p>I have probably annoyed half of my colleagues by raving about how great the
language is and what it is good at.
Before we get to this, and in my defense, let me provide some context. I have
been developing using C and C++ for 20 years now, and have been using Matlab
and Python for over ten years now. These are great languages and I can be
productive using each, infact I continue to use them regularly.</p>
<p>Also, I tend to be quite conservative in terms of adopting new languages or
development tools: while learning a new language and environment is fun
it also takes a lot of effort and most languages/tools/libraries tend to come
and go rather quickly and every developer carries with him a graveyard of
tools and languages long gone.</p>
<p>Because of this short-lived nature of software, when someone approaches me
with a new language or tool I am skeptical by default, and my litmus test
question is usually how confident they are that this tool will still be around
in five years time. This is of course unfair, but I prefer to invest my time
in learning things that have long term value. Which brings me to the point
that I firmly believe Julia is here to stay and in fact may even become a
popular language in scientific computing.</p>
<p>Enough rambling, let's get to the good parts.</p>
<p>I have been using Julia for the last 18 month now, both for work and pleasure.
Counting all code I wrote at work (just counting .jl files, no notebooks) I
see that I wrote more than 15k lines of Julia code in that time, including
several larger projects, ports of existing Matlab and C++ code, and interfaces
to C libraries.
Given my experience Julia is ready for production in internal projects (as
opposed to shipping executable code to a customer) and in particular is very
well suited to research-type projects.</p>
<p><a href="http://julialang.org/"><img src="images/julia.svg" alt="Logo of the Julia language" style="width: 240px; display: block; margin-left: auto; margin-right: auto;"/></a></p>
<h1>Julia</h1>
<p>Developing code for research projects is in many ways similar to developing
other software, but the key difference for me is that I need a quick
turnaround time from idea to result not just once but in multiple iterations,
sometimes changing the idea and implementation drastically.</p>
<p>In a very real sense most research projects should fail to achieve their
original goals; almost by definition research is beyond what is known to
work. If you only attempt known-to-work ideas it is not research.
If your project fails it is important to learn as much as possible from the
failure, that is, increasing the understanding of the problem and finding
suitable new research ideas, and quick iterations make this process fun.
The new ideas are often variants of earlier ideas and thus can reuse code. If
this code happens to be compact and flexible this translates directly into
productivity.</p>
<p>Matlab, R, and Python achieve this tight cycle of iterations quite
successfully, but in all three languages there is a price towards the later
iterations in that for achieving a high performance implementation significant
parts of the code needs to be rewritten in a more basic language
such as C++, which then needs to be interfaced to the other code through some
interface specification.
For big high-value projects in industry with dedicated engineering support
the additional effort required is typically not a problem, but for individual
researchers it means hours and days spend writing additional code without
adding functionality.</p>
<p>This process is cumbersome, errorprone, and creates a strong coupling, making
further iterations of changing ideas and implementations slower.
(As an example, in my <a href="http://www.nowozin.net/sebastian/grante/">grante</a>
library I prototyped many algorithms in Matlab, then programmed them in C++,
then wrote a Matlab interface which by itself is almost 2,000 lines of C++
code.)</p>
<p><a href="http://julialang.org/">Julia</a> also achieves this tight cycle, but does not
require you to resort to compiled statically-typed languages such as C++ in
order to achieve high performance.
Using a single language maintains productivity both at the very beginning
(prototyping) and towards the later iterations (productization).</p>
<p>Productivity in Julia (roughly "scientific results per wallclock developer
time") is achieved through a number of features:</p>
<ul>
<li><em>compact syntax</em>, for example I can declare a function using <code>f(x) = 2x+5</code>.
As mentioned above, I see the advantage of a compact syntax not in the
keystrokes saved initially, but in lowering the barrier to future
understanding and modification as the code evolves.</li>
<li><em>optional type annotation</em>, the above function will work for <code>x</code> being an
integer, or a float, or anything that has a multiplication and addition with
integer arguments defined; in fact, I could write <code>f(x::Float64) = 2x+5</code> to
require that <code>x</code> is a float, but performance-wise they both yield the same
code. This means that I can be strict about types when I need to be, but have
the feel of a <a href="https://en.wikipedia.org/wiki/Dynamic_programming_language">dynamic programming
language</a>.</li>
<li><a href="http://jupyter.org/"><em>Jupyter notebook interface</em></a> for quick
think-implement-results cycles.</li>
<li><em>excellent default choices of numerical libraries</em>, dense linear algebra,
sparse linear algebra,
<a href="http://juliaopt.org/">numerical optimization libraries</a>,
<a href="http://julia.readthedocs.org/en/latest/stdlib/numbers/#bigfloats">arbitrary precision computation</a>,
<a href="http://julia.readthedocs.org/en/latest/stdlib/math/?highlight=bessel#mathematical-functions">special functions</a>,
<a href="http://julia.readthedocs.org/en/latest/stdlib/math/?highlight=fft#signal-processing">FFT</a>, etcetera, most of what you can wish for
in a technical computing environment is already there by default or in the
many numerical packages available. In terms of numerical optimization codes
Julia is probably one of the <a href="http://www.juliaopt.org/">best environments</a>
available. All these libraries are carefully chosen to be the best-in-class
for the functions that they implement.</li>
<li><a href="https://en.wikipedia.org/wiki/Foreign_function_interface"><em>foreign function interfaces</em></a>
to a number of languages:
<a href="http://docs.julialang.org/en/release-0.3/stdlib/c/">C and Fortran</a>,
<a href="https://github.com/Keno/Cxx.jl">C++</a> (unfortunately planned only for Julia 0.5),
<a href="https://github.com/stevengj/PyCall.jl">Python</a>,
<a href="https://github.com/lgautier/Rif.jl">R</a>,
<a href="https://github.com/JuliaLang/MATLAB.jl">Matlab</a>. This makes it relatively
easy to use code in any of these languages and I have used several Python
libraries without issues.</li>
<li><a href="http://julialang.org/benchmarks/"><em>high performance</em></a>, I regularly find my
first-attempt Julia code for a problem to be an order of magnitude faster
than the equivalent Matlab code. Infact, I <em>unlearned</em> a number of bad Matlab
programming patterns such as using <code>bsxfun</code> and vectorizing all code. Last
year I wrote Julia code for a <a href="https://en.wikipedia.org/wiki/R-tree">R-tree data
structure</a> to maintain a dynamic spatial
index. Doing this in Matlab/R/Python in a reasonably performant way would be
unthinkable! Instead you have to resort to <a href="https://pypi.python.org/pypi/Rtree/">wrapping native
libraries</a>.
In Julia it was fun to write and it is fast, and I could add the required
methods I needed for my application easily, including fancy filtering
iterators.</li>
<li><em>no separation between user and developer</em>, almost all of the base library
is implemented in Julia itself, and it is easy to find where things are. For
example, if you want to find out how two complex numbers are multiplied in
Julia's base library? Enter <code>methods(*)</code> and <a href="https://github.com/JuliaLang/julia/blob/c8ceeefcc1dc25953a644622a895a3adcbc80dad/base/complex.jl#L112">have a
look!</a>
This transparency makes it easy to learn good Julian style and extends further
to how code is run:
Want to see what machine code is executed when you call the <code>sqrt</code> function on
a single precision float?
Enter <code>code_native(sqrt, (Float32,))</code> and see</li>
</ul>
<div class="highlight"><pre><span></span><span class="na">.text</span>
<span class="nl">Filename:</span> <span class="nf">math.jl</span>
<span class="nf">Source</span> <span class="no">line</span><span class="p">:</span> <span class="mi">132</span>
<span class="nf">push</span> <span class="no">RBP</span>
<span class="nf">mov</span> <span class="no">RBP</span><span class="p">,</span> <span class="no">RSP</span>
<span class="nf">xorps</span> <span class="no">XMM1</span><span class="p">,</span> <span class="no">XMM1</span>
<span class="nf">ucomiss</span> <span class="no">XMM1</span><span class="p">,</span> <span class="no">XMM0</span>
<span class="nf">Source</span> <span class="no">line</span><span class="p">:</span> <span class="mi">132</span>
<span class="nf">ja</span> <span class="mi">6</span>
<span class="nf">sqrtss</span> <span class="no">XMM0</span><span class="p">,</span> <span class="no">XMM0</span>
<span class="nf">pop</span> <span class="no">RBP</span>
<span class="nf">ret</span>
<span class="nf">movabs</span> <span class="no">RAX</span><span class="p">,</span> <span class="mi">140269793784104</span>
<span class="nf">mov</span> <span class="no">RDI</span><span class="p">,</span> <span class="no">QWORD</span> <span class="no">PTR</span> <span class="p">[</span><span class="no">RAX</span><span class="p">]</span>
<span class="nf">movabs</span> <span class="no">RAX</span><span class="p">,</span> <span class="mi">140269778958624</span>
<span class="nf">mov</span> <span class="no">ESI</span><span class="p">,</span> <span class="mi">132</span>
<span class="nf">call</span> <span class="no">RAX</span>
</pre></div>
<p>Almost nothing is hidden from the eyes of the user and this makes it easy and
fun to look into the implementation.</p>
<h2>Weak parts</h2>
<p>Julia, while ready for serious use, is not yet at version 1.0 and lacks
several important features.
In my work, I found the following pieces missing (as of version 0.4).</p>
<ul>
<li><em>Simple single machine parallelism</em>. In C/C++/Fortran this would be
<a href="http://openmp.org/wp/">OpenMP</a> and in Matlab it is <code>parfor</code>.
While Julia does have good support for <a href="http://julia.readthedocs.org/en/latest/manual/parallel-computing/">distributed parallel
computing</a>,
it currently does not have simple single-machine parallelism.
In my experience using the distributed computing abstractions for single
machine parallelism has severe performance overheads because all data is
serialized and remote method invocations are used to execute code.
(Also, I found the use of <code>@everywhere</code> macros cumbersome.)
Apparently a simpler single machine parallelism is difficult to implement but
in the works, as shown in this <a href="https://www.youtube.com/watch?v=GvLhseZ4D8M">recent work by
Intel</a> presented at JuliaCon
2015.</li>
<li><em>Debugger</em>. Quite simply, a debugger is essential for larger projects where
errors can arise that are difficult to understand and debug without being able
to interactively inspect the context in which the error appeared.
Currently Julia has <a href="https://github.com/toivoh/Debug.jl">Debug.jl</a> which
provides debugging at <a href="http://www.gnu.org/software/gdb/">gdb</a> level in terms
of functionality.
But Julia lacks an interactive debugging capability on par with what is
available in Matlab or most C/C++ environments (actually, I am not sure about
<a href="https://wiki.python.org/moin/PythonDebuggingTools">Python debuggers</a> here, is
there a single popular tool?).
As far as I understand, this is planned for the 0.5 version of Julia.</li>
<li><em>Shipping/productization/static-compilation</em>. With this I mean the ability
to select the distribution mechanism for the software, in particular to select
whether all dependencies are included so that the software "will just run" on
the target system, and whether binaries or source code is delivered.
For most researchers and open-source programmers this is not an issue and the
Julia package system caters for all their needs, but I found it relevant in a
company environment because explaining to someone how they install Julia and a
piece of code takes a while, whereas for C++ I can typically easily send an
executable file and some library dependencies.
As far as I understand, static compilation is planned for a future version of
Julia.</li>
</ul>
<h1>Further Reading</h1>
<p>If you want to give Julia a spin, here are a few links:</p>
<ul>
<li><a href="http://julialang.org/downloads/">Official installer</a>, go for the 0.4
release (or the release candidates until the final release is available).</li>
<li><a href="https://github.com/stevengj/julia-mit/">Julia at MIT</a> helpful
getting-started information, including the best way to install it on Windows.</li>
<li><a href="http://tinyurl.com/JuliaLang">Julia cheat sheet</a>, courtesy of Ian
Hellstroem.</li>
<li><a href="https://gist.github.com/gizmaa/7214002">Basic plotting examples</a></li>
<li><a href="https://www.youtube.com/playlist?list=PLP8iPy9hna6Sdx4soiGrSefrmOPdUWixM">JuliaCon 2015 talk recordings</a></li>
</ul>
<p>Packages which I use frequently and can recommend:</p>
<ul>
<li><a href="https://github.com/stevengj/PyPlot.jl">PyPlot.jl</a>, wrapper around
<a href="http://matplotlib.org/">matplotlib</a> for all your plotting needs.</li>
<li><a href="https://github.com/JuliaStats/Distributions.jl">Distributions.jl</a>, common
probability distributions.</li>
<li><a href="https://github.com/JuliaLang/IJulia.jl">IJulia.jl</a> for interactive
browser notebooks.</li>
<li><a href="https://github.com/JuliaStats/DataFrames.jl">DataFrames.jl</a> for processing
tabular data.</li>
<li><a href="https://github.com/goedman/Stan.jl">Stan.jl</a>, an interface for the <a href="http://mc-stan.org/">Stan
probabilistic programming language</a>.</li>
<li><a href="https://github.com/JuliaOpt/JuMP.jl">JuMP.jl</a> a high-performance
mathematical modeling language with excellent solver integration.</li>
<li><a href="https://github.com/JuliaOpt/NLopt.jl">NLopt.jl</a> state-of-the-art
optimization solvers for non-linear minimization (L-BFGS and gradient-free
methods).</li>
<li><a href="https://github.com/JuliaGPU/OpenCL.jl">OpenCL.jl</a> an interface for
<a href="https://www.khronos.org/opencl/">OpenCL</a> similar to PyOpenCL. Not
feature-complete yet (e.g. images are missing).</li>
</ul>
<p>The <a href="http://pkg.julialang.org/pulse.html">Julia package ecosystem</a> has a lot
more packages, so if you are looking for a particular thing, have a look
there.</p>How good are your beliefs? Part 2: The Quiz2015-09-18T21:00:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-09-18:sebastian/blog/how-good-are-your-beliefs-part-2-the-quiz.html<p>This post continues the previous post, <a href="http://www.nowozin.net/sebastian/blog/how-good-are-your-beliefs-part-1-scoring-rules.html">part 1</a> on
scoring rules. However, today we will be more hands on, testing your skill of
making good and well-calibrated predictions.</p>
<p>To this end, I will ask you several questions about numerical quantities and I
would like to hear an answer stated as a belief interval.
First, we consider scoring rules for intervals.</p>
<h2>Interval Scoring Rules</h2>
<p>Often the prediction or elicitation of a full probability distribution is
cumbersome due to the many degrees of freedom a distribution has.</p>
<p>Therefore, in practice we can instead ask our model or users for intervals.
This carries the implicit assumption of unimodal beliefs, which may not be
satisfied in important tasks, but has the advantage of requiring only two
numbers to be elicited.</p>
<p>Given an interval forecast <span class="math">\([L,U]\)</span>, where <span class="math">\(U > L > 0\)</span>, and <span class="math">\(x > 0\)</span> is a
realization, <a href="https://www.csss.washington.edu/~raftery/Research/PDF/Gneiting2007jasa.pdf">(Gneiting and Raftery,
2007)</a>
define the following interval scoring rule for <span class="math">\(\alpha \in (0,\frac{1}{2})\)</span>,</p>
<p>
<div class="math">$$S_{\textrm{int}}(L,U,x,\alpha) = (U-L) + 1_{\{x < L\}} \, \alpha (L-X)
+ 1_{\{x > U\}} \, \alpha (X-U).$$</div>
</p>
<p>This is a proper scoring rule for intervals constructed from the sum of two
quantile losses at the <span class="math">\(\alpha\)</span>-quantile and the <span class="math">\((1-\alpha)\)</span>-quantile.
However, it has the problem that if the score is used in different contexts
where the quantities <span class="math">\(x\)</span> are of very different scales, then the resulting
scores also carry this scale and are not comparable.</p>
<p>To achieve a scale-free interval scoring rule, we propose the following
<em>scale-free interval scoring rule</em>.</p>
<p>
<div class="math">$$S_{\textrm{sf}}(L,U,x,\alpha) = \alpha \log(U/L)
+ 1_{\{x < L\}} \log(L/x) + 1_{\{x > U\}} \log(x/U).$$</div>
</p>
<p>The rule is negatively oriented, thus acting as a loss function.
This scoring rule is <em>proper</em> and is minimized in expectation over <span class="math">\(X\)</span>
if we set <span class="math">\(L = F^{-1}(\alpha)\)</span> and <span class="math">\(U = F^{-1}(1-\alpha)\)</span> where <span class="math">\(F\)</span> is the
cummulative distribution function of <span class="math">\(X\)</span> so that <span class="math">\(L\)</span> and <span class="math">\(U\)</span> become the
<span class="math">\(\alpha\)</span>-quantile and the <span class="math">\((1-\alpha)\)</span>-quantile. (You can find a short proof
that this is a proper scoring rule in an appendix to this article.)</p>
<h2>Quiz</h2>
<p>The following quiz tests your ability to make well-calibrated but uncertain
assessments.
(This also means that the quiz becomes somewhat pointless if you resort to
Google or Wikipedia searches.)
The quiz contains twelve items, and each item asks for a number, assuming there
is a single true answer. Please pay attention to the units being asked for.
Your knowledge regarding the different items is likely quite variable and for
some questions you may have a good idea (your beliefs are concentrated),
whereas for some other questions you may be more uncertain.</p>
<p>Because of this uncertainty the quiz does not ask you for your best guess but
instead asks for an interval in an attempt to elicit your subjective beliefs.
The lower number should be chosen such that you
consider it 10 percent likely that the truth is below this number. The upper
number is a 90 percent quantile and should be chosen such that there is a 10
percent chance that the truth is above this number.</p>
<p>For example, say the question is "Maximum horsepower of an 2015 Audi R8 car
(horsepower)". Given my limited knowledge of cars I know that the Audi R8 is
likely a quite powerful car so I would provide maybe an interval of 200 to
510. How I arrive at this is up to me, for example, I may consider that a car
manufacturer may want to break the magic "500 horsepower" mark for marketing
purposes. Fixing this interval, the truth is revealed. The truth is 570
horsepower, and the above scale-free interval loss would be 0.205.</p>
<p>For the interval score a <em>lower score is better</em>, that is, the score is
negatively-oriented and behaves like a loss function.
Here is an illustration of different intervals and their scores for the
example. I plot the true value 570 as a solid green vertical line, and the
intervals are green if they cover the truth and red otherwise. The score is
shown next to each interval.</p>
<p><a href="http://www.nowozin.net/sebastian/blog/images/interval-rule.svg"><img alt="Example interval predictions" src="http://www.nowozin.net/sebastian/blog/images/interval-rule.png" /></a></p>
<p>Have fun, and feel free to comment or suggest new questions/answer in the
comment field.</p>
<style>
#calibform label { width:auto; }
#calibform label.error,#calibform input.submit { margin-left:253px; }
table.calibration {
table-layout: auto;
border-collapse: collapse;
width: 660px;
}
#calibrationform form { border: none; }
form.calibration tr, td { padding:7px; border:0px; }
tr.row0 { background-color:#cccccc; border:0px; }
tr.row1 { background-color:#909090; border:0px; }
tr.correctRow { background-color:#90ff90; border:0px; }
tr.incorrectRow { background-color:#ff9090; border:0px; }
.row-question { width: 400px; }
.row-q10 { width: 80px; }
.row-q90 { width: 80px; }
.row-score { width: 40px; }
</style>
<iframe name="hidden_iframe" id="hidden_iframe" style="display:none;"></iframe>
<div id="calibformdiv"></div>
<div id="calibresultdiv"></div>
<script src="http://code.jquery.com/jquery-1.11.1.min.js"></script>
<script src="http://jqueryvalidation.org/files/dist/jquery.validate.min.js"></script>
<!--<script src="http://jqueryvalidation.org/files/dist/additional-methods.min.js"></script>-->
<script>
$.validator.addMethod("positive", function(value, element) {
return this.optional(element) || parseFloat(value) > 0.0;
}, "Please enter a positive scalar");
$.validator.addMethod("largerThan", function(value, element, param) {
var target = $( param );
if (this.settings.onfocusout) {
target.off( ".validate-equalTo" ).on( "blur.validate-equalTo",
function() { $( element ).valid(); });
}
return parseFloat(value) > parseFloat(target.val());
}, "90% quantile must be strictly larger than 10% quantile");
/* List of question/answer pairs */
var calibquestions = [
{
"Q": "Height of the Mount Everest above sea level (meters)",
"A": 8850.0
},
{
"Q": "Year the first album of The Beatles has been released (four digit year)",
"A": 1963.0
},
{
"Q": "World record distance achieved in the longest human throw of an object without any velocity-aiding features (meters)",
"A": 406.3
},
{
"Q": "Most recent (2013) scientific estimate for the number of cells in an average adult human body (in trillions, 10 to the power of 12)",
"A": 37.2
},
{
"Q": "Gold price on the 7th September 1977 (US dollar per kg)",
"A": 4747.06
},
{
"Q": "Birth year of Sir Isaac Newton (four digit year)",
"A": 1643.0
},
{
"Q": "Scientific estimate for the average number of eggs a industrial hen lays during its life time (number of eggs)",
"A": 530.0
},
{
"Q": "Current scientific estimate (2013) of the age of the universe (billion years)",
"A": 13.82
},
{
"Q": "Mass of a Bee hummingbird, the lightest known bird (grams)",
"A": 2.0
},
{
"Q": "Guinness world record for the longest kiss as of 2015, rounded to the nearest hour (hours)",
"A": 59.0
},
{
"Q": "World population in 2050 as projected by the UN (as of 2015, in billions)",
"A": 9.7
},
{
"Q": "Number of officially confirmed moons in our solar system (as of 2015, count)",
"A": 146
}
];
// Scale-free interval scoring function (proper)
// This function is a special case of the general family proposed in
// (Gneiting and Raftery, 2007).
function score_sfi(alpha, L, U, x) {
var res = - alpha * Math.log(U/L);
if (x < L) {
res += Math.log(x/L);
} else if (x > U) {
res += Math.log(U/x);
}
return -res;
}
$.validator.setDefaults({
submitHandler: function() {
// Disable submit button
$("input#submitbutton").prop('disabled', true);
$("th#Qmsg").append("Score");
// 1. Compute scoring function
var avg_score = 0.0;
for (i = 0; i < calibquestions.length; ++i) {
var L = parseFloat($('input#Qlow'+i).val());
var U = parseFloat($('input#Qhigh'+i).val());
var x = calibquestions[i].A;
var score = score_sfi(0.1, L, U, x);
avg_score += score;
$("tr#RowQ"+i).removeClass(i % 2 == 0 ? "row0" : "row1");
if (x >= L && x <= U) {
$("tr#RowQ"+i).addClass("correctRow");
} else {
$("tr#RowQ"+i).addClass("incorrectRow");
}
$('td#Qmsg'+i).append(" " + score.toFixed(4));
}
avg_score /= calibquestions.length;
$("#calibresultdiv").append($("<p/>",
{ text: "Average score: " + avg_score.toFixed(4) }));
return true;
}
});
function randperm(n) {
var rsort = new Array(n);
for (var i = 0; i < rsort.length; ++i) {
rsort[i] = i;
}
for (var i = 0; i < rsort.length; ++i) {
var si = i + Math.floor(Math.random() * (rsort.length - i));
var tmp = rsort[i];
rsort[i] = rsort[si];
rsort[si] = tmp;
}
return rsort;
}
$().ready(function() {
var form = $("<form/>", {
name: "calibrationform",
action: "https://script.google.com/macros/s/AKfycbzK161Q-fZCqRZzOtpO03-qJCWhYaBheNtR2Vt7iWd_Xt6R_rg/exec",
method: "get",
target: "hidden_iframe"
});
var fieldset = $("<fieldset/>");
rulelist = {};
var tab = $("<table/>", { class: "calibration" });
var thead = $("<thead/>");
var th_row = thead.append($("<tr/>"));
th_row.append($("<th/>", { class: "row-question" }).append("Question"));
th_row.append($("<th/>", { class: "row-q10" }).append("10% quantile"));
th_row.append($("<th/>", { class: "row-q90" }).append("90% quantile"));
th_row.append($("<th/>", { id: "Qmsg" }));
tab.append(thead);
var tbody = tab.append($("<tbody/>"));
P = randperm(calibquestions.length);
for (var i1 = 0; i1 < calibquestions.length; ++i1) {
var i = P[i1];
var trow = $("<tr/>", {
class: i1 % 2 == 0 ? "row0" : "row1",
id: "RowQ"+i });
trow.append($("<td/>").append($("<label/>",
{ class: "row-question", for: "Qlow"+i, text: calibquestions[i].Q })));
trow.append($("<td/>").append($("<input/>",
{ class: "row-q10", type: "text", name: "Qlow"+i, id: "Qlow"+i })));
trow.append($("<td/>").append($("<input/>",
{ class: "row-q90", type: "text", name: "Qhigh"+i, id: "Qhigh"+i })));
trow.append($("<td/>", { class: "row-score", id: "Qmsg"+i }));
tbody.append(trow);
rulelist["Qlow"+i] = { required: true, number: true, positive: true };
rulelist["Qhigh"+i] = { required: true, number: true, positive: true,
largerThan: "#Qlow"+i };
}
fieldset.append(tab);
fieldset.append($("<p/>", { text: "Optional comment" }).append(
$("<input/>", { type: "text", name: "Comment", id: "Comment" })));
fieldset.append($("<input/>",
{ id: "submitbutton", type: "submit", value: "Check answers" }));
form.append(fieldset);
$("#calibformdiv").append(form);
form.validate({
errorPlacement: function(error, element) {
//error.appendTo(element.parent().next());
error.appendTo(element.parent());
},
highlight: function(element, errorClass) {
$(element).addClass(errorClass).parent().prev().children("select").addClass(errorClass);
},
rules: rulelist
});
});
</script>
<p>Based on my informal testing with a few volunteers, for the above questionaire
the following seems like a reasonable subjective scale for the average score:</p>
<ul>
<li><span class="math">\(0\)</span> to <span class="math">\(0.1\)</span>, expert</li>
<li><span class="math">\(0.1\)</span> to <span class="math">\(0.2\)</span>, proficient</li>
<li><span class="math">\(0.2\)</span> to <span class="math">\(0.5\)</span>, good</li>
<li><span class="math">\(0.5\)</span> to <span class="math">\(1.0\)</span>, medium</li>
<li>above <span class="math">\(1\)</span>, fair</li>
</ul>
<p>As for calibration, you should ideally have around eight to ten out of the
twelve questions showing as green, because the quantile range should have 80
percent coverage.
(Most persons who do not work with probability on a regular basis will have a
lower coverage because of overconfidence.)</p>
<p><em>Acknowledgements</em>. I thank <a href="http://johnwinn.org/"><em>John Winn</em></a> for the
original calibration experiment he conducted in 2014 which inspired this
article, <a href="https://www.stat.washington.edu/tilmann/"><em>Tilmann Gneiting</em></a> for
commenting on the scale-free quantile score, <a href="http://files.is.tue.mpg.de/pgehler/"><em>Peter
Gehler</em></a> for feedback and providing
further questions,
<a href="http://www.ong-home.my/"><em>Cheng Soon Ong</em></a> for comments that improved clarity
of the article,
<a href="http://research.microsoft.com/en-us/people/iankash/"><em>Ian Kash</em></a> for
explaining scoring rules, <a href="http://cdann.net/"><em>Christoph Dann</em></a> and <em>Juan Gao</em>
for feedback on the questionnaire.</p>
<h3>Appendix: Propriety of the Scale-free Interval Scoring Rule</h3>
<p>The following is a proof that the scale-free interval scoring rule is proper.
We will use the result from <a href="https://www.csss.washington.edu/~raftery/Research/PDF/Gneiting2007jasa.pdf">(Gneiting and Raftery,
2007)</a> and show that our scoring rule is a special case.</p>
<p>First, consider the general form of a scoring rule for an <span class="math">\(\alpha\)</span>-quantile
from <em>Theorem 6</em> in Gneiting and Raftery; for a choice <span class="math">\(r\)</span> and realization <span class="math">\(x\)</span>
this takes the form</p>
<p>
<div class="math">\begin{equation}
S(r,x,\alpha) = \alpha s(r) + (s(x) - s(r)) \, 1_{\{x \leq r\}} + h(x).
\label{eqn:Squantile}
\end{equation}</div>
</p>
<p>Gneiting and Raftery show that for any nondecreasing function <span class="math">\(s\)</span> and an
arbitrary function <span class="math">\(h\)</span> this yields a proper scoring rule for quantiles.
Infact, it is known that any proper scoring rule for quantile has to be of the
form <span class="math">\((\ref{eqn:Squantile})\)</span>, see Theorem 3.3 in <a href="http://arxiv.org/abs/0912.0902">(Gneiting,
2009)</a>.
In the Gneiting and Raftery JASA paper the authors propose the choices
<span class="math">\(s(y)=y\)</span> and <span class="math">\(h(y)=-\alpha y\)</span>.
But here, in order to achieve a scale-free rule we propose to use</p>
<p>
<div class="math">$$s(y) = \log y,\qquad h(y) = -\alpha \log y.$$</div>
</p>
<p>We obtain the specialization of <span class="math">\((\ref{eqn:Squantile})\)</span> as</p>
<p>
<div class="math">\begin{eqnarray}
S_{\textrm{q}}(r,x,\alpha)
& = & \alpha \log r + (\log x - \log r) \,1_{\{x \leq r\}} - \alpha \log x\nonumber\\
& = & \alpha \log (r/x) + 1_{\{x \leq r\}} \, \log (x/r).\label{eqn:qscore}
\end{eqnarray}</div>
</p>
<p>Because <span class="math">\(s\)</span> is a non-decreasing function this is a proper scoring rule for
quantiles.
This quantile loss looks as follows (compare to the check loss figure
earlier), for different quantiles (<span class="math">\(x=5\)</span> is the sample realization, and the
horizontal axis denotes our quantile estimate).</p>
<p><img alt="Scale-free quantile loss" src="http://www.nowozin.net/sebastian/blog/images/scoringrules-quantile-rule-scalefree.svg" /></p>
<p>The expected risk plot has a different shape compared to the check loss that
we have seen earlier, but note that the minimizer again corresponds to the
right quantiles of the <span class="math">\(N(5,1)\)</span> belief distribution.</p>
<p><img alt="Integrated risk under the scale-free quantile loss" src="http://www.nowozin.net/sebastian/blog/images/scoringrules-quantile-rule-scalefree-example.svg" /></p>
<p>By using <em>Corollary 1</em> in (Gneiting and Raftery, 2007) the sum of multiple
quantile scoring rules remains a proper scoring rule. To obtain a scoring
rule for intervals we use the <span class="math">\(\alpha\)</span>-quantile and the <span class="math">\((1-\alpha)\)</span>-quantile
to obtain (after some rewriting of terms)</p>
<p>
<div class="math">\begin{eqnarray}
S_{\textrm{sf}}(L,U,x,\alpha)
& = & -S_{\textrm{q}}(L,x,\alpha) - S_{\textrm{q}}(U,x,1-\alpha)\nonumber\\
& = & \alpha \log(U/L) + 1_{\{x < L\}} \log(L/x) + 1_{\{x > U\}} \log(x/U).\nonumber
\end{eqnarray}</div>
</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>How good are your beliefs? Part 1: Scoring Rules2015-09-04T22:00:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-09-04:sebastian/blog/how-good-are-your-beliefs-part-1-scoring-rules.html<p>This article is the first of two on <em>proper scoring rules</em>,
a specific type of loss function defined on probability distributions or
functions of probability distributions.</p>
<p>If this article sparks your interest, I recommend the gentle introduction to
scoring rules in the context of decision theory in Chapter 10 of <a href="http://eu.wiley.com/WileyCDA/WileyTitle/productCd-047149657X.html">Parmigiani
and Inoue's "Decision Theory"
book</a>,
which is a great book to have on your data science bookshelf in any case and
it deservedly won the <a href="https://bayesian.org/awards/DeGrootPrize.html">DeGroot
prize</a> in 2009.</p>
<h2>Scoring Rules</h2>
<p>Consider the following forecasting setting.
Given a set of possible outcomes <span class="math">\(\mathcal{X}\)</span> and a class of probability
measures <span class="math">\(\mathcal{P}\)</span> defined on a suitably constructed <span class="math">\(\sigma\)</span>-algebra,
we consider a <em>forecaster</em> which makes a forecast in the form of a probability
distribution <span class="math">\(P \in \mathcal{P}\)</span>.
After the forecast is fixed, a realization <span class="math">\(x \in \mathcal{X}\)</span> is revealed and
we would like to assess quality of the prediction made by the forecaster.</p>
<p>A <em>scoring rule</em> is a function <span class="math">\(S\)</span> such that <span class="math">\(S(P,x)\)</span> is taken to mean the
<em>quality</em> of the forecast. Hence the function has the form
<span class="math">\(S: \mathcal{P} \times \mathcal{X} \to \mathbb{R} \cup \{-\infty,\infty\}\)</span>.
There are two variants popular in the literature: the <em>positively-orientied</em>
scoring rules assign higher values to better forecasts, the
<em>negatively-oriented</em> scoring rules behave like loss functions, taking smaller
values for better forecasts.</p>
<p>A <em>proper</em> scoring rule has desirable behaviour, to be made precise shortly.
Let us first think what could be desirable in a scoring rule. Intuitively we
would like to make "cheating" difficult, that is, if we really subjectively
believe in <span class="math">\(P\)</span>, we should have no incentive to report any deviation from <span class="math">\(P\)</span>
in order to achieve a better score.
Formally, we first define the <em>expected score</em> under distribution <span class="math">\(Q\)</span>,</p>
<p>
<div class="math">$$S(P,Q) = \mathbb{E}_{x \sim Q}[S(P,x)].$$</div>
</p>
<p>So that if we believe in any prediction <span class="math">\(P \in \mathcal{P}\)</span>, then we should
demand that (for negatively-oriented scores)</p>
<p>
<div class="math">$$S(P,P) \leq S(P,Q),\qquad \forall P,Q \in \mathcal{P}.$$</div>
</p>
<p>For <em>strictly proper</em> scoring rules the above inequality holds strictly except
for <span class="math">\(Q=P\)</span>.
For a proper scoring rule the above inequality means that in expectation the
lowest possible score can be achieved by faithfully reporting our true
beliefs. Therefore, a rational forecaster who aims to minimize expected score
(loss) is going to report his beliefs.</p>
<p>Key uses of scoring rules are:</p>
<ul>
<li>Evaluating the predictive performance of a model;</li>
<li>Eliciting probabilities;</li>
<li>Using them for parameter estimation.</li>
</ul>
<p>Let us look briefly at the different uses.</p>
<h3>Model Evaluation</h3>
<p>For <em>assessing the model performance</em>, we simply use the scoring rule as a loss
function and measure the predictive performance on a holdout data set.</p>
<h3>Probability Elicitation</h3>
<p>For <em>probability elicitation</em> we can use a scoring rule as follows: we ask a
user to make predictions and we tell him that we will reward him
proportionally to the value achieved by the scoring rule once the prediction
can be scored. Assuming that the user is <em>rational</em> and aims to maximize his
reward, if we use a proper scoring rule, then he can maximize his expected
reward by making predictions according to the true beliefs he holds.
However, while the existence of a strictly proper scoring rule roughly means
that elicitation of a quantity is possible, more efficient methods for
probability elicitation may exist. Infact, <a href="http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/french/">Simon
French</a> and <a href="http://www.davidriosinsua.org/">David Rios Insua</a>
argue in their book <a href="http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470711051.html">Statistical Decision
Theory</a>,
page 76, that</p>
<blockquote>
<p>"de Finetti (1974; 1975) and others have championed the use of <em>scoring
rules</em> to elicit probabilities of events. ...
Scoring rules are important in de Finetti's development of subjective
probability, but it is not clear that they have a practical use in
statistical or decision analysis. ...
Scoring rules could provide a very expensive method of eliciting
probabilities. In training probability assessors, however, they can have a
practical use."</p>
</blockquote>
<p>If you wonder what more efficient alternatives French and Insua have in mind,
they do propose several methods to elicit probabilities, such as an idealized
"probability wheel" the user can configure and spin, and a sequence of
proposed gambles in order to find a fair value accepted by the user.</p>
<p>In general it seems to me (as an outsider of this field), that probability
elicitation is as much about theoretically sound methods as it is about human
psychology and biases, and how to avoid them. The human aspect of probability
elicitation is discussed in the <a href="http://www.rff.org/people/profile/roger-m-cooke">Roger
Cooke</a>'s <a href="https://books.google.com/books?isbn=0195362373">book-length
monograph</a> on the topic, and
the recent study of <a href="http://journal.sjdm.org/13/131029/jdm131029.pdf">(Goldstein and Rothschild, "Lay understanding of
probability distributions",
2014)</a> (thanks to Ian Kash
for pointing me to this study!).</p>
<h3>Estimation</h3>
<p>For <em>parameter estimation</em> we perform empirical risk minimization on a
probabilistic model using the scoring rule as a loss function, an approach
dating back to <a href="http://link.springer.com/article/10.1007/BF02613654">(Pfanzagl,
1969)</a>. This is a
special case of <a href="https://en.wikipedia.org/wiki/M-estimator">M-estimation</a> but
generalizes maximum likelihood estimation (MLE), where the log-probability
scoring rule is used.</p>
<p>If the model class contains the true generating model this yields a
<a href="https://en.wikipedia.org/wiki/Consistent_estimator"><em>consistent estimator</em></a>
but for misspecified models this can yield answers different from the MLE, and
these answers may be preferable; for example, if model assumptions are
violated and for any choice of parameter the model would put have a low
density on some observations these tend to influence the MLE severely because
the log-prob scoring rule assigns a large penalty to these observations.
Using a suitable scoring rule cannot prevent misspecification of course but
the consequences can be made less severe.</p>
<p>It should also be said that for estimation problems the log-prob scoring rule
is the most principled in that it is the only one that can be justified from
the <a href="https://projecteuclid.org/euclid.lnms/1215466210#toc">likelihood principle</a>.</p>
<h2>Scoring Rule Examples</h2>
<p>Here are a few examples of common and not so common scoring rules both for
discrete and continuous outcomes.</p>
<h3>Scoring Rule Example: Brier Score</h3>
<p>This scoring rule was historically the first, proposed by
<a href="http://imsc.pacificclimate.org/awards_brier.shtml">Glenn Wilson Brier</a>
(1913-1998) in his seminal work
<a href="http://docs.lib.noaa.gov/rescue/mwr/078/mwr-078-01-0001.pdf">(Brier, "Verification of Forecasts Expressed in Terms of Probability",
1950)</a>
as a means to verify weather forecasts.</p>
<p>Given a discrete outcome set <span class="math">\(\{1,2,\dots,K\}\)</span> the forecaster specifies a
distribution <span class="math">\(P=(p_1,\dots,p_K)\)</span> with <span class="math">\(p_i \geq 0\)</span> and <span class="math">\(\sum_i p_i = 1\)</span>.
Then, when an outcome <span class="math">\(j\)</span> is realized we score the forecaster according to
the <em>Brier score</em>,</p>
<p>
<div class="math">$$S_B(P,j) = \sum_{i=1}^K (1_{\{i=j\}} - p_i)^2.$$</div>
</p>
<p>The Brier score is extensively discussed in <a href="http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA121924">(DeGroot and Fienberg,
1983)</a> and they show that
it can be decomposed into two terms measuring <em>calibration</em> and <em>refinement</em>,
respectively. Here, <em>refinement</em> measures the information
available to discriminate between different outcomes that is contained in the
prediction.</p>
<p>For the case with binary classes, the definite work is <a href="http://www-stat.wharton.upenn.edu/~buja/PAPERS/paper-proper-scoring.pdf">(Buja, Stuetzle, Shen,
2005)</a>
in which a class of scoring rules is proposed based on the Beta distribution
which generalizes both the Brier score and the log-probability score.</p>
<h3>Scoring Rule Example: Log-Probability</h3>
<p>The most common scoring rule in estimation problems is the log-probability,
also known as the log-loss in machine learning.
Maximum likelihood estimation can be seen as optimizing the log-probability
scoring rule.</p>
<p>For the discrete outcome case it is given simply by</p>
<p>
<div class="math">$$S_{\textrm{log}}(P,i) = -\log p_i.$$</div>
</p>
<p>If <span class="math">\(p_i = 0\)</span> the score <span class="math">\(S_{\textrm{log}}(P,i) = \infty\)</span>.
The log-probability is a proper scoring rule, but what really distinguishes it
is that it is <em>local</em> in that when outcome <span class="math">\(j\)</span> realizes only the predicted
value <span class="math">\(p_j\)</span> is used to compute the score.
Intuitively this is a desirable property because if <span class="math">\(j\)</span> happens, why should we
care about the precise distribution of probability mass for the other events?</p>
<p>It turns out that this <em>local</em> property is unique to the log-probability
scoring rule. (For the result and proof see Theorem 10.1 in Parmigiani and
Inoue's book.)</p>
<h3>Scoring Rule Example: Energy Statistic</h3>
<p>This scoring rule is for predicting a distribution in <span class="math">\(\mathbb{R}^d\)</span> and is
defined for <span class="math">\(\beta \in (0,2)\)</span>, realization <span class="math">\(x \in \mathbb{R}^d\)</span>, and distribution <span class="math">\(P\)</span> on <span class="math">\(\mathbb{R}^d\)</span> as</p>
<p>
<div class="math">$$S_E(P,x) = \mathbb{E}_{X \sim P}[\|X-x\|^\beta] - \frac{1}{2} \mathbb{E}_{X,X' \sim P}[\|X-X'\|^\beta].$$</div>
</p>
<p>This score has an intuitive interpretation: the score is the expected distance
to the realization minus half the expected pairwise sample distance.
Let us think about a few cases: if <span class="math">\(P\)</span> is a point mass, then the first term is
just the distance to the realization and the second term is zero; in
particular for <span class="math">\(\beta \to 2\)</span> the score recovers the squared Euclidean norm
loss.
The original definition is from <a href="https://www.csss.washington.edu/~raftery/Research/PDF/Gneiting2007jasa.pdf">(Gneiting and Raftery,
2007)</a> except for the sign change, but is based on
Szekely's <a href="http://personal.bgsu.edu/~mrizzo/energy.htm">energy statistic</a>
which also independently found its way into machine learning through the
<a href="http://kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/indepHS140_3437%5B0%5D.pdf">Hilbert-Schmidt independence
criterion</a>.</p>
<p>For <span class="math">\(\beta \in (0,2)\)</span> the energy score is a strictly proper scoring function
for all Borel measures with finite moment <span class="math">\(\mathbb{E}_P[\|X\|^\beta]
< \infty\)</span>.</p>
<p>Here is a visualization, where <span class="math">\(P = \mathcal{N}([0,0]^T, \textrm{diag}([1/2,
5/2]))\)</span> is given by the 10k samples and the red marker corresponds to the
realization <span class="math">\(x\)</span>. Here we have <span class="math">\(\beta=1\)</span>. We can see that the Euclidean
nature of the scoring rule seems to dominate the anisotropic distribution <span class="math">\(P\)</span>,
that is, a realization that is unlikely under our belief distribution
(leftmost plot) achieves a lower score than a sample with higher density
(second leftmost plot).</p>
<p><img alt="Energy score for beta equal to
one" src="http://www.nowozin.net/sebastian/blog/images/scoringrules-energyscore-beta10-80dpi.png" /></p>
<p>As a practical manner, the energy score is simple to evaluate even when you
have only predictive Monte Carlo realizations of your model, compared to the
log-probability rule which requires the normalizer of the predictive
distribution.</p>
<h3>Scoring Rule: Check Loss</h3>
<p>The <em>check loss</em>, also known as <em>quantile loss</em> or <em>tick loss</em>, is a loss
function used for <a href="https://en.wikipedia.org/wiki/Quantile_regression">quantile
regression</a>, where we would
like to learn a model that directly predicts a
<a href="https://en.wikipedia.org/wiki/Quantile">quantile</a> of a distribution, but we
are given only samples of the distribution at training time.</p>
<p>This scoring rule is somewhat different in that a specific property of a
belief distribution is scored, namely the quantile of the distribution.
Being <em>proper</em> here means that the lowest expected loss is achieved by
predicting the corresponding quantile of your belief.
(Interestingly proper scoring rules exist only for some functions of the
distribution, see <a href="http://arxiv.org/abs/0912.0902">(Gneiting, 2009)</a>.)</p>
<p>You may know a special case of the check loss already:
when using an absolute value loss, your expected risk is minimized by taking
the median of your belief distribution, that is, the <span class="math">\(\frac{1}{2}\)</span>-quantile.
The <em>check loss</em> generalizes this to a richer family of loss functions such
that the expected minimizer corresponds to arbitrary quantiles, not just the
median.
Thus, instead of scoring an entire belief distribution <span class="math">\(P\)</span> we only score its
quantile statistics.</p>
<p>The check loss is defined as</p>
<p>
<div class="math">$$S_{\textrm{c}}(r,x,\alpha) = (x-r) (1_{\{x \leq r\}} - \alpha),$$</div>
</p>
<p>where <span class="math">\(r\)</span> is our predicted <span class="math">\(\alpha\)</span>-quantile and <span class="math">\(x \sim Q\)</span> is a sample from
the true unknown distribution <span class="math">\(Q\)</span>.</p>
<p>Plotting this loss explains the name <em>check loss</em> and <em>tick loss</em>, because it
looks like two tilted lines.
I show it for a sample realization of <span class="math">\(x=5\)</span>, and the horizontal axis denotes
the quantile estimate.</p>
<p><img alt="Check loss, a popular quantile loss" src="http://www.nowozin.net/sebastian/blog/images/scoringrules-quantile-rule.svg" /></p>
<p>For any belief distribution, taking the minimum expected risk decision yields
the matching quantile.
For example, if your beliefs are distributed according to <span class="math">\(X \sim N(5,1)\)</span>,
then you would consider the expected risk</p>
<p>
<div class="math">$$R_{\alpha}(r,\alpha) = \mathbb{E}_{X \sim N(5,1)}[-S_c(r,X,\alpha)].$$</div>
</p>
<p>This convolves the check loss function with the belief distribution, in this
case corresponding to a Gaussian kernel.
The minimizer over <span class="math">\(r\)</span> of this expected risk function would correspond to your
optimal decision.</p>
<p><img alt="Integrated risk under the check loss" src="http://www.nowozin.net/sebastian/blog/images/scoringrules-quantile-rule-example.svg" /></p>
<p>The above plot marks the 10/50/90 quantiles and these correspond to the
minimizers of the expected risks of the respective check losses.</p>
<h2>Conclusion</h2>
<p>The above is only a small peek into the vast literature on scoring rules.
If you are mathematically inclined, I highly recommend <a href="https://www.csss.washington.edu/~raftery/Research/PDF/Gneiting2007jasa.pdf">(Gneiting and Raftery,
2007)</a>
as an enjoyable further read and <a href="http://www.cs.colorado.edu/~raf/media/papers/vec-props.pdf">(Frongillo and Kash,
2015)</a> for the
most recent general results; everyone else may enjoy the book mentioned in the
introduction.</p>
<p>In the second part we are going to put your forecasting skills to the test via
an interactive quiz!</p>
<p><em>Acknowledgements</em>. I thank <a href="http://research.microsoft.com/en-us/people/iankash/"><em>Ian
Kash</em></a> for further
insightful discussions on scoring rules and pointing me to relevant
literature.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Machine Learning for Intelligent Image and Video Processing (ICCV 2015 Workshop)2015-09-02T23:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-09-02:sebastian/blog/machine-learning-for-intelligent-image-and-video-processing-iccv-2015-workshop.html<p><a href="http://ei.is.tuebingen.mpg.de/person/mhirsch">Michael Hirsch</a> and myself are
organizing a <a href="http://ml4ip-iccv2015.is.tuebingen.mpg.de">workshop on the topic of machine learning for image and video
processing</a>
as part of the <a href="http://pamitc.org/iccv15/">ICCV 2015 programme</a>.</p>
<p>The workshop takes place on the 17th December 2015 in Santiago, Chile, right
after the main ICCV conference.</p>
<h2>Call for Contributions</h2>
<p>Image processing methods are highly relevant in a large variety of
industrial and consumer applications. Traditionally some of the
successful methods have been derived based on a careful consideration
of the particular imaging modality and task, or on an adhoc basis by
image processing practitioners. More recently statistical machine
learning models have been proposed for tasks such as denoising,
deblurring, inpainting, etc., often leading to significant gains in
image quality. Machine learning methods require training data to learn
about the image statistics and the task, and challenges arise in how
this data should be collected and how ground truth is obtained.</p>
<p>The goal of this workshop is to bring together researchers from the
image processing and machine learning community to discuss all issues
related to machine learning models for image processing applications.</p>
<p>We invite submission of papers on relevant topics including, but not
limited to the following areas:</p>
<ul>
<li>Statistical modelling of image processing tasks</li>
<li>Runtime and data efficiency</li>
<li>Tractable estimation</li>
<li>Deep learning for image processing applications</li>
<li>Procedures to obtain ground truth data sets</li>
</ul>
<p>In all aspects the ICCV community has been at the forefront of developing
new ideas and we hope to continue this development through this workshop.</p>
<h3>Keynote Speakers</h3>
<p>Join us for an exciting program including invited talks by:</p>
<ul>
<li><a href="https://users.soe.ucsc.edu/~milanfar/">Peyman Milanfar</a>, Google</li>
<li><a href="http://www.visinf.tu-darmstadt.de/vi_people/sroth/sroth.en.jsp">Stefan Roth</a>, TU Darmstadt</li>
</ul>
<h3>Important Dates</h3>
<ul>
<li>Submission deadline: Friday, September 25th, 2015</li>
<li>Author Notification: Friday, October 16th, 2015</li>
<li>Final version of submission: Friday, October 23rd, 2015</li>
</ul>
<h3>Submission Instructions</h3>
<ul>
<li>Papers should be in ICCV style</li>
<li>Maximum paper length is 6 pages</li>
<li>Papers will be reviewed in a double blind process</li>
<li>Accepted papers are not published as part of IEEE Proceedings but
inofficially on the workshop website</li>
</ul>
<p>Accepted papers will be presented at the poster session with an
additional poster spotlight presentation. One author of every accepted
paper has to attend the workshop to present poster and spotlight talk.</p>
<h3>Organizers</h3>
<ul>
<li>Sebastian Nowozin, Microsoft Research, Cambridge, UK</li>
<li>Michael Hirsch, Max Planck Institute for Intelligent Systems, Germany</li>
</ul>
<p>Please find further details at the <a href="http://ml4ip-iccv2015.is.tuebingen.mpg.de">workshop
website</a> or send me an email in
case you have any questions.</p>Effective Sample Size in Importance Sampling2015-08-21T21:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-08-21:sebastian/blog/effective-sample-size-in-importance-sampling.html<p>In this article we will look at a practically important measure of efficiency
in importance sampling, the so called <em>effective sample size</em> (ESS) estimate.
This measure was proposed by <a href="http://www.decode.com/management/">Augustine
Kong</a> in 1992 in a technical report which
until recently has been difficult to locate online, but after getting in
contact with the University of Chicago I am pleased that the report is now
available (again):</p>
<ul>
<li>Augustine Kong, "A Note on Importance Sampling using Standardized Weights",
Technical Report 348,
<a href="https://galton.uchicago.edu/techreports/tr348.pdf">PDF</a>,
Department of Statistics, University of Chicago, July 1992.</li>
</ul>
<p>Before we discuss the usefulness of the effective sample size, let us first
define the notation and context for importance sampling.</p>
<p><a href="http://en.wikipedia.org/wiki/Importance_sampling">Importance sampling</a> is one
of the most generally applicable method to sample from otherwise intractable
distributions.
In machine learning and statistics importance sampling is regularly used for
sampling from distributions in low dimensions (say, up to maybe 20
dimensions).
The general idea of importance sampling has been extended since the 1950s
to the sequential setting and the resulting class of modern
<a href="http://www.stats.ox.ac.uk/~doucet/doucet_defreitas_gordon_smcbookintro.pdf"><em>Sequential Monte Carlo</em> (SMC) methods</a>
constitute the state of the art Monte Carlo methods in many important time
series modeling applications.</p>
<p>The general idea of importance sampling is as follows.
We are interested in computing an expectation,
<div class="math">$$\mu = \mathbb{E}_{X \sim p}[h(X)] = \int h(x) p(x) \,\textrm{d}x.$$</div>
If we can sample from <span class="math">\(p\)</span> directly, the standard Monte Carlo estimate is
possible, and we draw <span class="math">\(X_i \sim p\)</span>, <span class="math">\(i=1,\dots,n\)</span>, then use
<div class="math">$$\hat{\mu} = \frac{1}{n} \sum_{i=1}^n h(X_i).$$</div>
</p>
<p>In many applications we cannot directly sample from <span class="math">\(p\)</span>.
In this case importance sampling can still be applied by sampling from a
tractable proposal distribution <span class="math">\(q\)</span>, with <span class="math">\(X_i \sim q\)</span>, <span class="math">\(i=1,\dots,n\)</span>, then
reweighting the sample using the ratio <span class="math">\(p(X_i)/q(X_i)\)</span>, leading to the
standard importance sampling estimate
<div class="math">$$\tilde{\mu} = \frac{1}{n} \sum_{i=1}^n \frac{p(X_i)}{q(X_i)} h(X_i).$$</div>
</p>
<p>In case <span class="math">\(p\)</span> is known only up to an unknown normalizing constant, the so called
<em>self-normalized importance sampling estimate</em> can be used.
Denoting the weights by <span class="math">\(w(X_i) = \frac{p(X_i)}{q(X_i)}\)</span> it is defined as
<div class="math">$$\bar{\mu} = \frac{\frac{1}{n} \sum_{i=1}^n w(X_i) h(X_i)}{
\frac{1}{n} \sum_{i=1}^n w(X_i)}.$$</div>
</p>
<p>The quality of this estimate chiefly depends on how good the proposal
distribution <span class="math">\(q\)</span> matches the form of <span class="math">\(p\)</span>. Because <span class="math">\(p\)</span> is difficult to sample
from, it typically is also difficult to make a precise statement about the
quality of approximation of <span class="math">\(q\)</span>.</p>
<p>The effective sample size solves this issue: it can be used after or during
importance sampling to provide a quantitative measure of the quality of the
estimated mean.
Even better, the estimate is provided on a natural scale of worth in samples
from <span class="math">\(p\)</span>, that is, if we use <span class="math">\(n=1000\)</span> samples <span class="math">\(X_i \sim q\)</span> and obtain an ESS
of say 350 then this indicates that the quality of our estimate is about the
same as if we would have used 350 direct samples <span class="math">\(X_i \sim p\)</span>. This justifies
the name <em>effective sample size</em>.</p>
<p>Since the late 1990s the effective sample size is popularly used as a reliable
diagnostic in importance sampling and sequential Monte Carlo applications.
Sometimes it even informs the algorithm during sampling; for example, one can
continue an importance sampling method until a certain ESS has been reached.
Another example is during SMC where the ESS is often used to decide whether
operations such as resampling or rejuvenation are performed.</p>
<h2>Definition</h2>
<p>Two alternative but equivalent definitions exist. Assume normalized weights
<span class="math">\(w_i \geq 0\)</span> with <span class="math">\(\sum_{i=1}^n w_i = 1\)</span>. Then, the original definition of
the effective sample size estimate is by Kong, popularized by <a href="http://www.people.fas.harvard.edu/~junliu/">Jun
Liu</a> in
<a href="https://stat.duke.edu/~scs/Courses/Stat376/Papers/ConvergeRates/LiuMetropolized1996.pdf">this
paper</a>, as
<div class="math">$$\textrm{ESS} = \frac{n}{1 + \textrm{Var}_q(W)},$$</div>
where <span class="math">\(\textrm{Var}_q(W) = \frac{1}{n-1} \sum_{i=1}^n (w_i - \frac{1}{n})^2\)</span>.
The alternative form emerged later (I did not manage to find its first use
precisely), and has the form
<div class="math">$$\textrm{ESS} = \frac{1}{\sum_{i=1}^n w_i^2}.$$</div>
</p>
<p>When the weights are unnormalized, we define <span class="math">\(\tilde{w}_i = w_i /
(\sum_{i=1}^n w_i)\)</span> and see that
<div class="math">$$\textrm{ESS} = \frac{1}{\sum_{i=1}^n \tilde{w}_i^2}
= \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}.$$</div>
</p>
<p>As is often the case in numerical computation in probabilistic models the
quantities are often stored in log-domain, i.e. we would store <span class="math">\(\log w_i\)</span>
instead of <span class="math">\(w_i\)</span>, and compute the above equations in log-space.</p>
<h2>Example</h2>
<p>As a simple example we set the target distribution to be a
<span class="math">\(\textrm{StudentT}(0,\nu)\)</span> with <span class="math">\(\nu=8\)</span> degrees of freedom, and the proposal
to be a Normal <span class="math">\(\mathcal{N}(\mu,16)\)</span>.
We then visualize the ESS as a function of the shift <span class="math">\(\mu\)</span> of the Normal
proposal. The sample size should decrease away from the true mean (zero) and
be highest at zero.</p>
<p><img alt="ESS example, StudentT target, Normal proposal" src="http://www.nowozin.net/sebastian/blog/images/ess-demo-1.svg" /></p>
<p>This is indeed what happens in the above plot and, not shown, the estimated
variance from the ESS agrees with the variance over many replicates.</p>
<h2>Derivation</h2>
<p>The following derivation is from Kong's technical report, however, to make it
self-contained and accessible I fleshed out some details and give explanations
inline.</p>
<p>We start with an expression for <span class="math">\(\textrm{Var}(\bar{\mu})\)</span>.
This is a variance of a ratio expression with positive denominator; hence we
can apply the multivariate delta method for ratio expressions (see appendix
below) to obtain an asymptotic approximation.
Following Kong's original notation we define <span class="math">\(W_i = w(X_i)\)</span> and <span class="math">\(W=W_1\)</span>, as
well as <span class="math">\(Z_i = h(X_i) w(X_i)\)</span> and <span class="math">\(Z = Z_1\)</span>.
Then we have the asymptotic delta method approximation
<div class="math">\begin{eqnarray}
\textrm{Var}_q(\bar{\mu}) & \approx &
\frac{1}{n}\left[\frac{\textrm{Var}_q(Z)}{(\mathbb{E}_q W)^2}
- 2 \frac{\mathbb{E}_q Z}{(\mathbb{E}_q W)^3} \textrm{Cov}_q(Z,W)
+ \frac{(\mathbb{E}_q Z)^2}{(\mathbb{E}_q W)^4}
\textrm{Var}_q(W)\right].\label{eqn:delta1}
\end{eqnarray}</div>
We can simplify this somewhat intimidating expression by realizing that
<div class="math">$$\mathbb{E}_q W = \int \frac{p(x)}{q(x)} q(x) \,\textrm{d}x
= \int p(x) \,\textrm{d}x = 1.$$</div>
(For the unnormalized case the derivation result is the same because the ratio
<span class="math">\(\bar{\mu}\)</span> does not depend on the normalization constant.)
Then we can simplify <span class="math">\((\ref{eqn:delta1})\)</span> to
<div class="math">\begin{eqnarray}
& = &
\frac{1}{n}\left[\textrm{Var}_q(Z)
- 2 (\mathbb{E}_q Z) \textrm{Cov}_q(Z,W)
+ (\mathbb{E}_q Z)^2 \textrm{Var}_q(W)\right].\label{eqn:delta2}
\end{eqnarray}</div>
The next step is to realize that
<span class="math">\(\mathbb{E}_q Z = \int w(x) h(x) q(x) \,\textrm{d}x
= \int \frac{p(x)}{q(x)} q(x) h(x) \,\textrm{d}x
= \int h(x) p(x) \,\textrm{d}x = \mu.\)</span>
Thus <span class="math">\((\ref{eqn:delta2})\)</span> further simplifies to
<div class="math">\begin{eqnarray}
& = &
\frac{1}{n}\big[\underbrace{\textrm{Var}_q(Z)}_{\textrm{(B)}}
- 2 \mu \underbrace{\textrm{Cov}_q(Z,W)}_{\textrm{(A)}}
+ \mu^2 \textrm{Var}_q(W)\big].
\label{eqn:delta3}
\end{eqnarray}</div>
</p>
<p>This is great progress, but we need to nibble on this expression some more.
Let us consider the parts (A) and (B), in this order.</p>
<p><strong>(A)</strong>. To simplify this expression we can leverage the <a href="http://en.wikipedia.org/wiki/Covariance#Definition">definition of the
covariance</a> and then apply
the known relations of our special expectations.
This yields.
<div class="math">\begin{eqnarray}
\textrm{(A)} = \textrm{Cov}_q(Z,W) & = &
\mathbb{E}_q[\underbrace{Z}_{= W H} W]
- \underbrace{(\mathbb{E}_q Z)}_{= \mu}
\underbrace{(\mathbb{E}_q W)}_{= 1}\nonumber\\
& = & \mathbb{E}_q[H W^2] - \mu\nonumber\\
& = & \mathbb{E}_p[H W] - \mu.\label{eqn:A1}
\end{eqnarray}</div>
Note the change of measure from <span class="math">\(q\)</span> to <span class="math">\(p\)</span> in the last step.
To break down the expectation of the product further we use the known rules
about expectations, namely <span class="math">\(\textrm{Cov}(X,Y) = \mathbb{E}[XY] -
(\mathbb{E}X)(\mathbb{E}Y)\)</span>, which leds us to
<div class="math">\begin{eqnarray}
\textrm{(A)} = \textrm{Cov}_q(Z,W) & = &
\textrm{Cov}_p(H,W) + \mu \mathbb{E}_p W - \mu.\label{eqn:A2}
\end{eqnarray}</div>
</p>
<p><strong>(B)</strong>. First we expand the variance by its definition, then simplify.
<div class="math">$$\textrm{Var}_q(Z) = \textrm{Var}_q(W H)
= \mathbb{E}_q[W^2 H^2] - (\underbrace{\mathbb{E}_q[WH]}_{= \mu})^2
= \mathbb{E}_p[W H^2] - \mu^2.$$</div>
</p>
<p>For approaching <span class="math">\(\mathbb{E}_p[W H^2]\)</span> we need to leverage the second-order
delta method (see appendix) which gives the following approximation,
<div class="math">\begin{eqnarray}
\mathbb{E}_p[W H^2] & \approx &
(\mathbb{E}_p W)\underbrace{(\mathbb{E}_p H)^2}_{= \mu^2}
+ 2 \underbrace{\mathbb{E}_p[H]}_{\mu} \textrm{Cov}_p(W,H)
+ (\mathbb{E}_p W) \textrm{Var}_p(H)\nonumber\\
& = & (\mathbb{E}_p W) \mu^2 + 2 \mu \textrm{Cov}_p(W,H)
+ (\mathbb{E}_p W) \textrm{Var}_p(H).
\label{eqn:B1}
\end{eqnarray}</div>
</p>
<p>Ok, almost done. We now leverage our work to harvest:
<div class="math">\begin{eqnarray}
\textrm{Var}_q(\bar{\mu}) & \approx &
\frac{1}{n}\big[\underbrace{\textrm{Var}_q(Z)}_{\textrm{(B)}}
- 2 \mu \underbrace{\textrm{Cov}_q(Z,W)}_{\textrm{(A)}}
+ \mu^2 \textrm{Var}_q(W)\big]\nonumber\\
& \approx & \frac{1}{n}\big[
\left(
(\mathbb{E}_p W) \mu^2 + 2 \mu \textrm{Cov}_p(W,H)
+ (\mathbb{E}_p W) \textrm{Var}_p(H) - \mu^2
\right)\nonumber\\
& & \qquad
- 2 \mu \left(\textrm{Cov}_p(H,W) + \mu\mathbb{E}_p W - \mu\right)
\nonumber\\
& & \qquad + \mu^2 \textrm{Var}_q(W)
\big]\nonumber\\
& = & \frac{1}{n}\left[\mu^2 \left(
1 + \textrm{Var}_q(W) - \mathbb{E}_p W\right)
+ (\mathbb{E}_p W) \textrm{Var}_p(H)\right].\label{eqn:H1}
\end{eqnarray}</div>
</p>
<p>Finally, we can reduce <span class="math">\((\ref{eqn:H1})\)</span> further by
<div class="math">$$\mathbb{E}_p W = \mathbb{E}_q[W^2]
= \textrm{Var}_q(W) + (\mathbb{E}_q W)^2
= \textrm{Var}_q(W) + 1.$$</div>
For the other term we have
<div class="math">$$\frac{1}{n} \textrm{Var}_p(H) = \textrm{Var}_p(\hat{\mu}).$$</div>
This simplifies <span class="math">\((\ref{eqn:H1})\)</span> to the following satisfying expression.
<div class="math">$$\textrm{Var}_q(\bar{\mu}) \approx \textrm{Var}_p(\hat{\mu})
(1 + \textrm{Var}_q(W)).$$</div>
This reads as "the variance of the self-normalized importance sampling
estimate is approximately equal to the variance of the simple Monte Carlo
estimate times <span class="math">\(1 + \textrm{Var}_q(W)\)</span>."</p>
<p>Therefore, when taking <span class="math">\(n\)</span> samples to compute <span class="math">\(\bar{\mu}\)</span> the <em>effective
sample size</em> is estimated as
<div class="math">$$\textrm{ESS} = \frac{n}{1 + \textrm{Var}_q(W)}.$$</div>
</p>
<p>Two comments:</p>
<ol>
<li>We can estimate <span class="math">\(\textrm{Var}_q(W)\)</span> by the sample variance of the
normalized importance weights.</li>
<li>This estimate does not depend on the integrand <span class="math">\(h\)</span>.</li>
</ol>
<p>The simpler form of the ESS estimate can be obtained by estimating
<div class="math">\begin{eqnarray}
\textrm{Var}_q(W) & \approx & \frac{1}{n} \sum_{i=1}^n (w_i - \frac{1}{n})^2
\nonumber\\
& = & \frac{1}{n} \sum_{i=1}^n w_i^2 - \frac{1}{n^2}.\nonumber
\end{eqnarray}</div>
which yields
<div class="math">$$\textrm{ESS} = \frac{n}{1 + \frac{1}{n} \sum_i w_i^2 - \frac{1}{n^2}}
= \frac{1}{\sum_{i=1}^n w_i^2}.$$</div>
</p>
<h2>Conclusion</h2>
<p>Monte Carlo methods such as importance sampling and Markov chain Monte Carlo
can fail in case the proposal distribution is not suitable chosen. Therefore,
we should always employ diagnostics, and for importance sampling the effective
sampling size diagnostic has become the standard due to its simplicity,
intuitive interpretation, and its robustness in practical applications.</p>
<p>However, the effective sample size can fail, for example when all proposal
samples are in a region where the target distribution has few probability mass.
In that case, the weights would be approximately equal and the ESS close to
optimal, failing to diagnose the mismatch between proposal and target
distribution. This is, in a way, unavoidable: if we never get to see a high
probability region of the target distribution, the low value of our samples is
hard to recognize.</p>
<p>For another discussion on importance sampling diagnostics and an alternative
derivation, see <a href="http://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf">Section 9.3 in Art Owen's upcoming Monte Carlo
book</a>.
Among many interesting things in that chapter, he proposes an effective sample
size statistic specific to the particular integrand <span class="math">\(h\)</span>. For this, redefine
the weights as
<div class="math">$$w_h(X_i) = \frac{\frac{p(X_i)}{q(X_i)} |h(X_i)|}{
\sum_{i=1}^n \frac{p(X_i)}{q(X_i)} |h(X_i)|},$$</div>
then use the normal <span class="math">\(1/\sum_i w_h(X_i)^2\)</span> estimate. This variant is more
accurate because it takes the integrand into account.</p>
<p><em>Addendum:</em> <a href="http://arxiv.org/abs/1602.03572">This paper</a> by Martino, Elvira,
and Louzada, takes a detailed look at variations of the effective sample size
statistic.</p>
<h2>Appendix: The Multivariate Delta Method</h2>
<p>The <a href="http://en.wikipedia.org/wiki/Delta_method"><em>delta method</em></a> is a classic
method using in asymptotic statistics to obtain limiting expressions for the
mean and variance of functions of random variables. It can be seen as the
statistical analog of the Taylor approximation to a function.</p>
<p>The multivariate extension is also classic, and the following theorem can be
found in many works, I picked the one given as Theorem 3.7 in
<a href="http://www.stat.purdue.edu/~dasgupta/">DasGupta's</a>
<a href="http://www.springer.com/mathematics/probability/book/978-0-387-75970-8">book on asymptotic
statistics</a>
(by the way, this book is a favorite of mine for its accessible presentation
of many practical result in classical statistics).
A more advanced and specialized book on expansions beyond the delta method is
<a href="https://math.uwaterloo.ca/statistics-and-actuarial-science/people-profiles/christopher-small">Christopher
Small</a>'s
<a href="https://books.google.co.uk/books?id=uXexXLoZnZAC">book on the topic</a>.</p>
<h3>Delta Method for Distributions</h3>
<p><em><strong>Theorem</strong> (Multivariate Delta Method for Distributions).</em> Suppose
<span class="math">\(\{T_n\}\)</span> is a sequence of <span class="math">\(k\)</span>-dimensional random vectors such that
<div class="math">$$\sqrt{n}(T_n - \theta) \stackrel{\mathcal{L}}{\rightarrow}
\mathcal{N}_k(0,\Sigma(\theta)).$$</div>
Let <span class="math">\(g:\mathbb{R}^k \to \mathbb{R}^m\)</span> be once differentiable at <span class="math">\(\theta\)</span> with
the gradient vector <span class="math">\(\nabla g(\theta)\)</span>. Then
<div class="math">$$\sqrt{n}(g(T_n) - g(\theta)) \stackrel{\mathcal{L}}{\rightarrow}
\mathcal{N}_m(0, \nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta))$$</div>
provided <span class="math">\(\nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta)\)</span> is positive
definite.</p>
<p>This simply says that if we have a vector <span class="math">\(T\)</span> of random variables and we know
that <span class="math">\(T\)</span> converges asymptotically to a Normal, then we can make a similar
statement about the convergence of <span class="math">\(g(T)\)</span>.</p>
<p>For the effective sample size derivation we will need to instantiate this
theorem for a special case of <span class="math">\(g\)</span>, namely where <span class="math">\(g: \mathbb{R}^2 \to
\mathbb{R}\)</span> and <span class="math">\(g(x,y) = \frac{x}{y}\)</span>. Let's quickly do that.
We have
<div class="math">$$\nabla g(x,y) = \left(\begin{array}{c}
\frac{1}{y} \\
-\frac{x}{y^2}\end{array}\right).$$</div>
We further define <span class="math">\(X_i \sim P_X\)</span>, <span class="math">\(Y_i \sim P_Y\)</span> iid, <span class="math">\(X=X_1\)</span>, <span class="math">\(Y=Y_1\)</span>,
<div class="math">$$T_n=\left(\begin{array}{c} \frac{1}{n}\sum_{i=1}^n X_i\\
\frac{1}{n} \sum_{i=1}^n Y_i\end{array}\right),\qquad
\theta=\left(\begin{array}{c} \mathbb{E}X\\
\mathbb{E}Y\end{array}\right),$$</div>
assuming our sequence <span class="math">\(\frac{1}{n} \sum_{i=1}^n X_i \to \mathbb{E}X\)</span> and
<span class="math">\(\frac{1}{n} \sum_{i=1}^n Y_i \to \mathbb{E}Y\)</span>.
For the covariance matrix we know that the empirical average of <span class="math">\(n\)</span> iid
samples has a variance as <span class="math">\(1/n\)</span>, that is
<div class="math">$$\textrm{Var}(\frac{1}{n}\sum_{i=1}^n X_i)
= \frac{1}{n^2} \textrm{Var}(\sum_{i=1}^n X_i)
= \frac{1}{n^2} \sum_{i=1}^n \textrm{Var}(X_i)
= \frac{1}{n} \textrm{Var}(X),$$</div>
and <a href="http://en.wikipedia.org/wiki/Covariance#Properties">similar for the
covariance</a>, so we have
<div class="math">$$\Sigma(\theta) = \frac{1}{n} \left(\begin{array}{cc}
\textrm{Var}(X) & \textrm{Cov}(X,Y)\\
\textrm{Cov}(X,Y) & \textrm{Var}(Y)\end{array}\right).$$</div>
Applying the above theorem we have for the resulting one-dimensional
transformed variance
<div class="math">\begin{eqnarray}
B(\theta) & := & \nabla g(\theta)^T \Sigma(\theta) \nabla g(\theta)\nonumber\\
& = & \frac{1}{n} \left(\begin{array}{c}
\frac{1}{\mathbb{E}Y} \\
-\frac{\mathbb{E}X}{(\mathbb{E}Y)^2}\end{array}\right)^T
\left(\begin{array}{cc}
\textrm{Var}(X) & \textrm{Cov}(X,Y)\\
\textrm{Cov}(X,Y) & \textrm{Var}(Y)\end{array}\right)
\left(\begin{array}{c}
\frac{1}{\mathbb{E}Y} \\
-\frac{\mathbb{E}X}{(\mathbb{E}Y)^2}\end{array}\right)\nonumber\\
& = & \frac{1}{n} \left[
\frac{1}{(\mathbb{E}Y)^2} \textrm{Var}(X)
- 2 \frac{\mathbb{E}X}{(\mathbb{E}Y)^3} \textrm{Cov}(X,Y)
+ \frac{(\mathbb{E}X)^2}{(\mathbb{E}Y)^4} \textrm{Var}(Y)
\right].\nonumber
\end{eqnarray}</div>
</p>
<p>One way to interpret the quantity <span class="math">\(B(\theta)\)</span> is that the limiting variance of
the ratio <span class="math">\(X/Y\)</span> depends both on the variances of <span class="math">\(X\)</span> and of <span class="math">\(Y\)</span>, but crucially
it depends most sensitively on <span class="math">\(\mathbb{E}Y\)</span> because this quantity appears in
the denominator: small values of <span class="math">\(Y\)</span> have large effects on <span class="math">\(X/Y\)</span>.</p>
<p>This is an asymptotic expression which is based on the assumption that both
<span class="math">\(X\)</span> and <span class="math">\(Y\)</span> are concentrated around the mean so that the linearization of <span class="math">\(g\)</span>
around the mean will incur a small error. As such, this approximation may
deteriorate if the variance of <span class="math">\(X\)</span> or <span class="math">\(Y\)</span> is large so that the linear
approximation of <span class="math">\(g\)</span> deviates from the actual values of <span class="math">\(g\)</span>.</p>
<p>(For an exact expansion of the expectation of a ratio, see <a href="http://www.faculty.biol.ttu.edu/Rice/ratio-derive.pdf">this 2009
note</a> by <a href="http://www.faculty.biol.ttu.edu/Rice/">Sean
Rice</a>.)</p>
<h3>Second-order Delta Method</h3>
<p>The above delta method can be extended to higher-order by a <a href="http://en.wikipedia.org/wiki/Taylor%27s_theorem#Taylor.27s_theorem_for_multivariate_functions">multivariate
Taylor
expansion</a>.
I give the following result without proof.</p>
<p><em><strong>Theorem</strong> (Second-order Multivariate Delta Method).</em> Let <span class="math">\(T\)</span> be a
<span class="math">\(k\)</span>-dimensional random vectors such that <span class="math">\(\mathbb{E} T = \theta\)</span>. Let
<span class="math">\(g:\mathbb{R}^k \to \mathbb{R}\)</span> be twice differentiable at <span class="math">\(\theta\)</span> with
<a href="http://en.wikipedia.org/wiki/Hessian_matrix">Hessian</a> <span class="math">\(H(\theta)\)</span>. Then
<div class="math">$$\mathbb{E} g(T) \approx
g(\theta) + \frac{1}{2} \textrm{tr}(\textrm{Cov}(T) \: H(\theta)).$$</div>
</p>
<p>For the proof of the effective sample size we need to apply this theorem to
the function <span class="math">\(g(X,Y)=XY^2\)</span> so that
<div class="math">$$H(X,Y)=\left[\begin{array}{cc} 0 & 2Y\\ 2Y & 2X\end{array}\right].$$</div>
Then the above result gives
<div class="math">$$\mathbb{E} g(X,Y) \approx
(\mathbb{E}X)(\mathbb{E}Y)^2 + 2 (\mathbb{E}Y) \textrm{Cov}(X,Y)
+ (\mathbb{E}X) \textrm{Var}(Y).$$</div>
</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Reverse Search2015-08-07T21:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-08-07:sebastian/blog/reverse-search.html<p>One of my all-time favorite algorithms is <em>reverse search</em> proposed by
<a href="http://cgm.cs.mcgill.ca/~avis/">David Avis</a> and
<a href="http://www.inf.ethz.ch/personal/fukudak/">Komei Fukuda</a> in 1992,
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.4487&rep=rep1&type=pdf">PDF</a>.</p>
<p>Reverse search is an algorithm to solve enumeration problems, that is,
problems where you would like to list a finite set of typically
combinatorially related elements.
Reverse search is not quite an algorithm, rather it is a general construction
principle that is applicable to a wide variety of problems and often leads to
optimal algorithms for enumeration problems.</p>
<p>Problems in which reverse search is applicable often have the flavour where
the elements have a <em>natural partial order</em> (such as sets, sequences, graphs
where we can define subsets, subsequences, and subgraphs), or where there is a
natural neighborhood relation between elements which can be used to traverse
from one element to the other (such as the linear programming bases considered
in the Avis and Fukuda examples).</p>
<p>The reverse search construction leads to a structured search space that is
also suitable for combinatorial search and optimization algorithms. For
example, we can often readily use the resulting enumeration tree in
<a href="http://en.wikipedia.org/wiki/Branch_and_bound">branch-and-bound search</a>
methods.
I made heavy use of this possibility during my PhD a few years ago during my
work with Koji Tsuda, and reverse search is the working horse in my CVPR 2007,
ICCV 2007, and ICDM 2008 papers. (Needless to say, I have fond memories of
it, but even now I regularly see applications of the reverse search idea.)
In the following, my presentation will differ quite a bit from the Avis and
Fukuda paper.</p>
<h2>Basic Idea</h2>
<p>At its core reverse search is a method to organize all elements to be
enumerated into a tree where the nodes in the tree each represent a single
element.
Each element appears exactly once in the tree and by traversing the tree from
the root we can enumerate all elements exactly once.</p>
<p>Here is the <em>recipe</em>:</p>
<ol>
<li>Define a ``reduction'' operation which takes an enumeration element and
reduces it to a simpler one. This defines an enumeration tree.</li>
<li>Invert the reduction operation.</li>
<li>Enumerate all elements, starting from the root.</li>
</ol>
<p>Let us illustrate this recipe first on a simple example: enumerating subsets
of a given set. Say we are given the set <span class="math">\(\{1,2,3\}\)</span> and would like to
enumerate subsets. To define the reduction operation we simply say ``remove
the largest integer from the set''. Formally, this defines defines a function
<span class="math">\(f\)</span> from the set of sets to the set of sets. Here is an illustration:</p>
<p><img alt="Set of three integers and reduction operation" src="http://www.nowozin.net/sebastian/blog/images/rsearch-123set-f.svg" /></p>
<p>Now we consider the inverse map <span class="math">\(f^{-1}\)</span>, from the set of sets to the set of
powersets. Here is an illustration:</p>
<p><img alt="Inverse reduction operation" src="http://www.nowozin.net/sebastian/blog/images/rsearch-123set-finv.svg" /></p>
<p>The inverse defines an enumeration strategy: we start at <span class="math">\(\emptyset\)</span> and
evaluate <span class="math">\(f^{-1}(\emptyset) = \{\{1\}, \{2\}, \{3\}\}\)</span>. For each set element
we now recurse. This enumerates all elements in the tree exactly once.</p>
<p>The above recipe has the following practical advantages:</p>
<ol>
<li>Reverse search often yields a simple algorithm.</li>
<li>Typically there is no additional memory or bookkeeping required beyond the
recursion call stack, so that the total memory required is <span class="math">\(O(r)\)</span> where <span class="math">\(r\)</span>
is the recursion depth.</li>
<li>Yields a <em>output-linear</em> <em>polynomial-delay</em> enumeration algorithms, which
means that the total time complexity is linear in the number of items
enumerated and for each item only polynomial time is needed. (This slightly
unconventional notion of complexity makes sense for enumeration problems
because the answer is often exponential in the size of the input.)</li>
<li>Often yields optimal enumeration algorithms in terms of memory and runtime.</li>
<li>The resulting algorithms are trivially parallelizable over the enumeration
tree.</li>
</ol>
<p>Ok, the above was a trivial example, let us look at a more complicated example.</p>
<h2>Example: Enumerating all Connected Subgraphs</h2>
<p>Let us consider a non-trivial application of the reverse search idea:
enumerating all connected subgraphs of a given graph.</p>
<p>To apply the recipe, how could the <em>reduction operation</em> look like?
Intuitively, we are given a connected graph and we could remove a single
vertex from the graph, thereby making it smaller. By removing one vertex at a
time we would eventually arrive at the empty graph.</p>
<p>But given a graph, how do we determine which vertex to remove?
For this, let us assume all vertices in the given graph have a unique integer
index. Then, given such a graph we can then attempt to remove the highest
integer vertex, just as in the set example above. Here we hit a complication:
upon removal of the vertex the graph may become disconnected.
For example, consider the chain graph <span class="math">\(1-3-2\)</span>. Here the vertex labeled <span class="math">\(3\)</span>
would be removed, yielding two disconnected components, which violates the
requirement of enumerating only connected subgraphs. Therefore we simply say:
``Remove the highest-index vertex such that the resulting graph remains
connected''.</p>
<p>Here is an example of the reduction operation in action on the following
simple cycle graph:</p>
<p><img alt="Cycle graph with four nodes" src="http://www.nowozin.net/sebastian/blog/images/rsearch-1324.svg" /></p>
<p>The enumeration tree of all fourteen connected subgraphs (counting the empty
graph as well) looks as follows. Here each arrow is the application of one
reduction operation.</p>
<p><img alt="All connected subgraphs of the cycle graph with four nodes" src="http://www.nowozin.net/sebastian/blog/images/rsearch-enumtree-1324.svg" /></p>
<p>Looking at the above tree, you can note the following:</p>
<ul>
<li>The graph <span class="math">\(1-4-2\)</span> has the highest vertex <span class="math">\(4\)</span> but this cannot be removed
because it would yield a disconnected subgraph; therefore the reduction
operation removes <span class="math">\(2\)</span> instead.</li>
<li>By construction, there is a unique path from every graph to the root.</li>
<li>By construction only connected subgraphs are present in the tree, and each
such graph is present exactly once.</li>
</ul>
<p>In order to enumerate all connected subgraphs, we have to <em>invert</em> the
arrows of this graph. That is, we have to invert the reduction operation and
given a graph we have to generate all child nodes in the reversed graph.
This reversion is what gives <em>reverse search</em> its name.</p>
<p>The inverse operation is described as follows: ``given a connected subgraph,
add a vertex which will become the highest-index vertex and whose removal
retains a connected graph." This is quite a mouthful but luckily
the actual implementation is simple.</p>
<p>Here is a <a href="http://julialang.org/">Julia</a> implementation.</p>
<div class="highlight"><pre><span></span><span class="k">using</span> <span class="n">LightGraphs</span>
<span class="n">is_connected1</span><span class="p">(</span><span class="n">g</span><span class="p">::</span><span class="n">Graph</span><span class="p">)</span> <span class="o">=</span> <span class="n">nv</span><span class="p">(</span><span class="n">g</span><span class="p">)</span> <span class="o"><=</span> <span class="mi">1</span> <span class="o">?</span> <span class="n">true</span> <span class="p">:</span> <span class="n">is_connected</span><span class="p">(</span><span class="n">g</span><span class="p">)</span>
<span class="n">is_removable</span><span class="p">(</span><span class="n">g</span><span class="p">::</span><span class="n">Graph</span><span class="p">,</span> <span class="n">vset</span><span class="p">::</span><span class="n">IntSet</span><span class="p">,</span> <span class="n">rmv</span><span class="p">)</span> <span class="o">=</span>
<span class="n">is_connected1</span><span class="p">(</span><span class="n">induced_subgraph</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">setdiff</span><span class="p">(</span><span class="n">vset</span><span class="p">,</span> <span class="n">rmv</span><span class="p">)))</span>
<span class="n">rm_vertex</span><span class="p">(</span><span class="n">g</span><span class="p">::</span><span class="n">Graph</span><span class="p">,</span> <span class="n">vset</span><span class="p">::</span><span class="n">IntSet</span><span class="p">)</span> <span class="o">=</span>
<span class="n">maximum</span><span class="p">(</span><span class="n">filter</span><span class="p">(</span><span class="n">rmv</span> <span class="o">-></span> <span class="n">is_removable</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">vset</span><span class="p">,</span> <span class="n">rmv</span><span class="p">),</span> <span class="n">vset</span><span class="p">))</span>
<span class="k">function</span><span class="nf"> connsubgraphs</span><span class="p">(</span><span class="n">g</span><span class="p">::</span><span class="n">Graph</span><span class="p">)</span>
<span class="k">function</span><span class="nf"> _connsubgraphs</span><span class="p">(</span><span class="n">vset</span><span class="p">::</span><span class="n">IntSet</span><span class="p">)</span>
<span class="n">produce</span><span class="p">(</span><span class="n">copy</span><span class="p">(</span><span class="n">vset</span><span class="p">))</span> <span class="c"># output current subgraph vertex set</span>
<span class="c"># Generate child nodes of the current subgraph.</span>
<span class="c"># Consider all vertices not yet in graph</span>
<span class="k">for</span> <span class="n">add_vi</span> <span class="o">=</span> <span class="n">filter</span><span class="p">(</span><span class="n">v</span> <span class="o">-></span> <span class="o">!</span><span class="k">in</span><span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">vset</span><span class="p">),</span> <span class="n">vertices</span><span class="p">(</span><span class="n">g</span><span class="p">))</span>
<span class="n">push!</span><span class="p">(</span><span class="n">vset</span><span class="p">,</span> <span class="n">add_vi</span><span class="p">)</span> <span class="c"># Add new vertex</span>
<span class="k">if</span> <span class="n">is_connected1</span><span class="p">(</span><span class="n">induced_subgraph</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">vset</span><span class="p">))</span> <span class="o">&&</span>
<span class="n">add_vi</span> <span class="o">==</span> <span class="n">rm_vertex</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">vset</span><span class="p">)</span>
<span class="c"># Recurse</span>
<span class="n">_connsubgraphs</span><span class="p">(</span><span class="n">vset</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">setdiff!</span><span class="p">(</span><span class="n">vset</span><span class="p">,</span> <span class="n">add_vi</span><span class="p">)</span> <span class="c"># Remove new vertex</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">function</span><span class="nf"> _connsubgraphs</span><span class="p">()</span>
<span class="n">_connsubgraphs</span><span class="p">(</span><span class="n">IntSet</span><span class="p">())</span>
<span class="k">end</span>
<span class="n">Task</span><span class="p">(</span><span class="n">_connsubgraphs</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">g</span> <span class="o">=</span> <span class="n">Graph</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span>
<span class="n">add_edge!</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">add_edge!</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">add_edge!</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">add_edge!</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span>
<span class="n">S</span> <span class="o">=</span> <span class="n">collect</span><span class="p">(</span><span class="n">connsubgraphs</span><span class="p">(</span><span class="n">g</span><span class="p">))</span>
</pre></div>
<p>Note the key statements between the <code>push!</code> and <code>setdiff!</code> lines that
govern the recursion.
In the if-condition we check that the new graph remains connected and the
added vertex would be the one that would be removed.</p>
<p>The above code uses the Julia
<a href="http://julia.readthedocs.org/en/latest/manual/control-flow/#tasks-aka-coroutines">producer-consumer</a>
pattern.
When run, it produces the following output, identical to the above diagram.</p>
<div class="highlight"><pre><span></span><span class="mi">14</span><span class="o">-</span><span class="n">element</span> <span class="n">Array</span><span class="p">{</span><span class="kt">Any</span><span class="p">,</span><span class="mi">1</span><span class="p">}:</span>
<span class="n">IntSet</span><span class="p">([])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">1</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">2</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">3</span><span class="p">])</span>
<span class="n">IntSet</span><span class="p">([</span><span class="mi">4</span><span class="p">])</span>
</pre></div>
<h2>Conclusion</h2>
<p>Reverse search is a general recipe to construct tree-structured enumeration
methods useful for enumerating combinatorial sets and optimization over them.</p>
<p>In fact, it is so useful that some authors have reinvented reverse search
without noticing. For example, the popular <a href="http://cs.ucsb.edu/~xyan/papers/gSpan-short.pdf">gSpan
algorithm</a> of Yan and Han
published in 2003 defines a clever total ordering on labeled graphs
essentially in order to be able to define the reduction operation needed in
reverse search.</p>
<p>So, check it out, the <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.4487&rep=rep1&type=pdf">Avis and Fukuda
paper</a>
is very rich and well worth a read! (If you prefer a different presentation
similar to the one above but more technical, have a look at my PhD thesis.)</p>
<p><em>Acknowledgements</em>. I thank <a href="http://tsudalab.org/en/member/koji_tsuda/">Koji
Tsuda</a> for reading a draft version
of the article and providing feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Stochastic Computation Graphs2015-07-24T22:00:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-07-24:sebastian/blog/stochastic-computation-graphs.html<p>This post is about a recent arXiv submission entitled
<a href="http://arxiv.org/abs/1506.05254">Gradient Estimation Using Stochastic Computation
Graphs</a>, and authored by
<a href="http://www.eecs.berkeley.edu/~joschu/">John Schulman</a>,
Nicolas Heess,
<a href="http://thphn.com/">Theophane Weber</a>, and
<a href="http://www.cs.berkeley.edu/~pabbeel/">Pieter Abbeel</a>.</p>
<p>In a nutshell this paper generalizes the <a href="https://en.wikipedia.org/wiki/Backpropagation">backpropagation
algorithm</a> to allow
<em>differentiation through expectations</em>, that is, to compute unbiased estimates
of</p>
<p>
<div class="math">$$\frac{\partial}{\partial \theta} \mathbb{E}_{x \sim q(x|\theta)}[f(x,\theta)].$$</div>
</p>
<p>The paper also provides a nice calculus on directed graphs that allows quick
derivation of unbiased gradient estimates.
The basic technical results in the paper have been known and used in various
communities before and the arXiv submission properly discusses these.</p>
<p>But dismissing the paper as non novel would miss the point in a similar way as
missing the point when stating that backpropagation is ``just an application
of the chain rule of differentiation''.
Instead, the contribution of the current paper is in the practical utility of
the graphical calculus and a rich catalogue of machine learning problems where
the computation of unbiased gradients of expectations is useful.</p>
<p>In typical statistical point estimation tasks
<a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator"><em>unbiasedness</em></a> is often
not quite as important compared to expected risk.
However, here it is crucial.
This is because the applications where stochastic computation graphs are
useful involve <em>optimization</em> over <span class="math">\(\theta\)</span> and <a href="https://en.wikipedia.org/wiki/Stochastic_approximation">stochastic
approximation</a>
methods such as <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient
methods</a> can only
be justified theoretically in the case of unbiased gradient estimates.</p>
<h3>A Neat Derivative Trick</h3>
<p>To get an idea of the flavour of derivatives involving expectations, let us
look at a simpler case explained in Section 2.1 of the paper.
The proof of that case also contains a neat trick worth knowing.
The case is as above but inside the expectation we have only <span class="math">\(f(x)\)</span> instead of <span class="math">\(f(x,\theta)\)</span>.
The ``trick'' is in the identity (obvious in retrospect),
<div class="math">$$\frac{\partial}{\partial \theta} p(x|\theta) =
p(x|\theta) \frac{\partial}{\partial \theta} \log p(x|\theta).$$</div>
</p>
<p>This allows to establish
<div class="math">\begin{eqnarray}
\frac{\partial}{\partial \theta} \mathbb{E}_{x \sim p(x|\theta)}[f(x)]
& = & \frac{\partial}{\partial \theta} \int p(x|\theta) f(x) \,\textrm{d}x\nonumber\\
& = & \int \frac{\partial}{\partial \theta} p(x|\theta) f(x) \,\textrm{d}x\nonumber\\
& = & \int p(x|\theta) f(x) \frac{\partial}{\partial \theta} \log p(x|\theta) \,\textrm{d}x\nonumber\\
& = & \mathbb{E}_{x \sim p(x|\theta)}[f(x) \frac{\partial}{\partial \theta} \log p(x|\theta)].\nonumber
\end{eqnarray}</div>
</p>
<p>In this case the derivation was straightforward but for multiple expectations
a derivation based on this elementary definition of the expectation is
cumbersome and error-prone. Stochastic computation graphs allow a much
quicker derivation of the derivative.</p>
<h2>Stochastic Computation Graphs</h2>
<p>Stochastic computation graphs are directed acyclic graphs that encode the
dependency structure of computation to be performed. The graphical notation
generalizes directed graphical models.
Here is an example graph.</p>
<p><img alt="Stochastic computation graph of problem (1) in Schulman et al." src="http://www.nowozin.net/sebastian/blog/images/stochastic-computation-graphs-example1.svg" /></p>
<p>There are three (or four) types of nodes in a stochastic computation graph:</p>
<ol>
<li><em>Input nodes</em>. These are the fixed parameters we would like to compute the
derivative of. In the example graph, this is the <span class="math">\(\theta\)</span> node and they are
drawn without any container. While technically it is possible to have graphs
without input nodes, in order to compute gradients the graph should include at
least one input node.</li>
<li><em>Deterministic nodes</em>. These compute a deterministic function of their
parents. In the above graph this is the case for the <span class="math">\(x\)</span> and <span class="math">\(f\)</span> nodes.</li>
<li><em>Stochastic nodes</em>. These nodes specify a random variable through a
distribution conditional on their parents. In the above graph this is true
for the <span class="math">\(y\)</span> node, and the circle mirrors the notation used in directed
graphical models.</li>
<li><em>Cost nodes</em>. These are a subset of the deterministic nodes in the graph
whose range are the real numbers. In the above graph the node <span class="math">\(f\)</span> is a cost
node. I draw them shaded, this is not the case in the original paper.</li>
</ol>
<p>The entire stochastic computation graph specifies a single objective function
whose domain are the input nodes and whose scalar objective is the sum of all
cost nodes.
The sum of all cost nodes is taken as an expectation over all stochastic nodes
in the graph.</p>
<p>Therefore the above graph has the objective function
<div class="math">$$F(\theta) = \mathbb{E}_{y \sim p(y|x(\theta))}[f(y)].$$</div>
</p>
<h3>Derivative Calculus</h3>
<p>The notation used in the paper is a bit heavy and (for my taste at least) a
bit too custom, but here it is.
Let <span class="math">\(\Theta\)</span> be the set of input nodes, <span class="math">\(\mathcal{C}\)</span> the set of cost nodes,
and <span class="math">\(\mathcal{S}\)</span> be the set of stochastic nodes.
The notation <span class="math">\(u \prec v\)</span> denotes that there exist a directed path from <span class="math">\(u\)</span> to
<span class="math">\(v\)</span> in the graph.
The notation <span class="math">\(u \prec^D v\)</span> denotes that there exist a path whose nodes are all
deterministic with the exception of the last node <span class="math">\(v\)</span> which may be of any
type.
We write <span class="math">\(\hat{c}\)</span> for a sample realization of a cost node <span class="math">\(c\)</span>.
The final notation needed for the result is
<div class="math">$$\textrm{DEPS}_v = \{ w \in \Theta \cup \mathcal{S} | w \prec^D v\}.$$</div>
</p>
<p>The key result of the paper, Theorem 1, is now stated as follows:
<div class="math">$$\frac{\partial}{\partial \theta} \mathbb{E}\left[\sum_{c \in \mathcal{C}} c\right]
= \mathbb{E}\Bigg[\underbrace{\sum_{w \in \mathcal{S}, \theta \prec^D w} \left(
\frac{\partial}{\partial \theta} \log p(w|\textrm{DEPS}_w)
\right) \sum_{c \in \mathcal{C}, w \prec c} \hat{c}}_{\textrm{(A)}}
+ \underbrace{\sum_{c \in \mathcal{C}, \theta \prec^D c} \frac{\partial}{\partial \theta}
c(\textrm{DEPS}_c)}_{\textrm{(B)}}\Bigg].$$</div>
</p>
<p>The two parts, (A) and (B) can be interpreted as follows. If we only have
deterministic computation so that <span class="math">\(\mathcal{S} = \emptyset\)</span>, as in an ordinary
feedforward neural network for example, the part (B) is just the ordinary
derivative and we have to apply the chain rule to that expression.
The part (A) originates from each stochastic node and the consequences that
originate from the stochastic nodes is absorbed in the sample realizations
<span class="math">\(\hat{c}\)</span>.</p>
<p>It takes a bit of practice to apply Theorem 1 quickly to a given graph, and I
found it easier to instead manually, on a piece of paper, executing
Algorithm 1 of the paper, which generalizes backpropagation and builds the
derivative node by node by traversing the graph backwards.</p>
<h2>Example</h2>
<p>To understand the basic technique I illustrate the stochastic computation
graph technique on the concrete graph above, which is problem (1) in the paper
(Section 2.3), but I make the example concrete.</p>
<p><img alt="Stochastic computation graph of problem (1) in Schulman et al." src="http://www.nowozin.net/sebastian/blog/images/stochastic-computation-graphs-example1.svg" /></p>
<p>
<div class="math">$$x(\theta) = (\theta-1)^2,$$</div>
<div class="math">$$y(x) \sim \mathcal{N}(x,1),$$</div>
<div class="math">$$f(y) = \left(y-\frac{5}{2}\right)^2.$$</div>
</p>
<p>Before we apply Theorem 1 to the graph, here is how the problem actually looks
like. First, the objective <span class="math">\(F(\theta) = \mathbb{E}_{y \sim p(y|x(\theta))}[f(y)]\)</span>.
This objective is just an ordinary one-dimensional deterministic function.</p>
<p><img alt="True objective to be minimized" src="http://www.nowozin.net/sebastian/blog/images/stochastic-computation-graphs-Ef.svg" /></p>
<p>The true gradient of the objective is also just an ordinary function.
You can see three zero-crossings at approximately -0.6, 1, and 2.6,
corresponding to two local minima and a saddle-point of the objective function.</p>
<p><img alt="True gradient of objective" src="http://www.nowozin.net/sebastian/blog/images/stochastic-computation-graphs-grad.svg" /></p>
<p>For this simple example we can find a closed form expression for <span class="math">\(F(\theta)\)</span>,
but in general stochastic computation graphs we are not able to evaluate
<span class="math">\(F(\theta)\)</span> and instead only sample values <span class="math">\(\hat{F}_1, \hat{F}_2, \dots\)</span> which
are unbiased estimates of the true <span class="math">\(F(\theta)\)</span>.
By taking averages of a few samples, say of a 100 samples, we can improve the
accuracy of our estimates.
In order to minimize <span class="math">\(F(\theta)\)</span> over <span class="math">\(\theta\)</span> our goal is to sample unbiased
gradients as well.
The unbiased sample gradients look as follows, for <span class="math">\(1\)</span> sample (shown in green)
and for averages of a <span class="math">\(100\)</span> samples (shown in red), evaluated at a 100 points
equispaced along the <span class="math">\(\theta\)</span> axis shown.</p>
<p><img alt="Sample gradient of objective" src="http://www.nowozin.net/sebastian/blog/images/stochastic-computation-graphs-gradsample.svg" /></p>
<p>To derive the unbiased gradient estimate we apply Theorem 1.
From the summation (A) we will only have one term because our graph contains
only one stochastic node, namely <span class="math">\(y\)</span>.
We will not have any term from (B) as there is no deterministic path from
<span class="math">\(\theta\)</span> to <span class="math">\(f\)</span>.
Therefore we have</p>
<p>
<div class="math">$$\frac{\partial}{\partial \theta} \mathbb{E}_{y \sim p(y|x(\theta))}[f(y)]
= \mathbb{E}_{y \sim p(y|x(\theta))}\left[\frac{\partial}{\partial \theta}
\log p(y|x(\theta)) \hat{f}\right].$$</div>
</p>
<p>For the logarithm we need to differentiate the log-likelihood of the Normal
distribution and compute
<div class="math">\begin{eqnarray}
\frac{\partial x}{\partial \theta} \frac{\partial}{\partial x} \log p(y|x(\theta))
& = &
\frac{\partial x}{\partial \theta} \frac{\partial}{\partial x} \left[
- \frac{(y-x(\theta))^2}{2} - \frac{1}{2} \log 2\pi \right]\nonumber\\
& = &
\frac{\partial x}{\partial \theta} (y-x(\theta))\nonumber\\
& = &
2(\theta - 1)(y - x(\theta)).\nonumber
\end{eqnarray}</div>
</p>
<p>So the overall unbiased gradient estimator is
<div class="math">$$\mathbb{E}\left[\frac{\partial}{\partial \theta} \log p(y|x(\theta)) \hat{f}\right]
= \mathbb{E}[2(\theta-1)(\hat{y}-\hat{x}) \hat{f}].$$</div>
And the last expression in the expectation is the estimate for a single sample
realization.</p>
<h2>Variational Bayesian Neural Networks</h2>
<p>One important application of being able to compute gradients of expectation
objectives is the approximate variational Bayesian posterior inference of
neural network parameters.</p>
<p>The original pioneering work of applying variational Bayes (aka mean field
inference) to neural network learning is <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.5194">this 1993 paper of Hinton and van
Kamp</a>.
Recently this has made a revival in particular through the appearance of
<a href="http://www.cs.princeton.edu/courses/archive/fall11/cos597C/reading/Blei2011.pdf">stochastic variational inference
methods</a>
around 2011, including a <a href="http://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks-spotlight.pdf">paper of Alex
Graves</a>.
Many works followed up on this lead, for example <a href="http://dpkingma.com/wordpress/wp-content/uploads/2014/10/iclr14_vae.pdf">Kingma and
Welling</a>,
<a href="http://arxiv.org/abs/1401.4082">Rezende et al., ICML 2014</a>, <a href="http://arxiv.org/abs/1505.05424">Blundell et al.,
ICML 2015</a>, and <a href="http://arxiv.org/abs/1402.0030">Mnih and
Gregor</a>. They use different estimators of the
gradient with varying quality and the SCG paper provides a nice overview of
the bigger picture.</p>
<p>In any case, here is a visualization of prototypical variational Bayes
learning for feedforward neural networks. A normal feedforward neural network
training objective yields the following computation graph, without any
stochastic nodes.</p>
<p><img alt="Feedforward neural network training objective computation graph" src="http://www.nowozin.net/sebastian/blog/images/scg-nn1.svg" /></p>
<p>Here we have a fixed weight vector <span class="math">\(w\)</span> with a regularizer <span class="math">\(R(w)\)</span>.
We have <span class="math">\(n\)</span> training instances and each input <span class="math">\(x_i\)</span> produces a network output,
<span class="math">\(P_i(x_i,w)\)</span>, for example a distribution over class labels. Together with a
known ground truth label <span class="math">\(y_i\)</span> this yields a loss <span class="math">\(\ell_i(P_i,y_i)\)</span>, for
example the cross-entropy loss.
If we use a likelihood based loss and a regularizer derived from a prior, i.e.
<span class="math">\(R(w)=-\log P(w)\)</span> the training objective becomes just regularized maximum
likelihood estimation.</p>
<p>
<div class="math">$$F(w) = -\log P(w) - \sum_{i=1}^n \log P(y_i|x_i;w).$$</div>
</p>
<p>The <a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">variational
Bayes</a> training
objective yields the following slightly extended <em>stochastic</em> computation
graph.</p>
<p><img alt="Variational Bayes neural network training objective stochastic computation graph" src="http://www.nowozin.net/sebastian/blog/images/scg-nn2.svg" /></p>
<p>Here <span class="math">\(w\)</span> is still a network parameter, but it is now a stochastic vector, <span class="math">\(w
\sim Q(w|\theta)\)</span> and <span class="math">\(\theta\)</span> becomes the parameter we would like to learn.
The additional cost node <span class="math">\(H\)</span> arises from the entropy of the approximating
posterior distribution <span class="math">\(Q\)</span>. (An interesting detail: in principle we would not
need an arrow <span class="math">\(w \to H\)</span> because we can compute <span class="math">\(H(Q)\)</span>. However, if we allow
this arrow, then we can use a Monte Carlo approximation of the entropy for
approximating families which do not have an analytic entropy expression.)
The training objective becomes:</p>
<p>
<div class="math">$$F(\theta) = \mathbb{E}_{w \sim Q(w|\theta)}\left[-\log P(w) + \log Q(w|\theta) - \sum_{i=1}^n \log P(y_i|x_i;w)\right].$$</div>
</p>
<p>The stochastic computation graph rules can now be used to derive the unbiased
gradient estimate.</p>
<p>
<div class="math">$$\frac{\partial}{\partial \theta} F(\theta) =
\mathbb{E}_{w \sim Q(w|\theta)}\left[
\frac{\partial}{\partial \theta} \log Q(w|\theta) \left(
-\log P(w) + \log Q(w|\theta) - \sum_{i=1}^n \log P(y_i|x_i;w)
\right)\right].$$</div>
</p>
<p>This is now quite practical: the expectation can be approximated using simple
Monte Carlo samples of <span class="math">\(w\)</span> values using the current approximating posterior
<span class="math">\(Q(w|\theta)\)</span>. Because the gradient is unbiased we can improve the
approximation by running standard stochastic gradient methods.</p>
<h2>Additional Applications</h2>
<p>The paper contains a large number of machine learning applications, but there
are many others.
Here is one I find useful.</p>
<p><img alt="Experimental design stochastic computation graph" src="http://www.nowozin.net/sebastian/blog/images/scg-experimental-design.svg" /></p>
<p><em>Experimental design.</em> In <a href="http://eu.wiley.com/WileyCDA/WileyTitle/productCd-047149657X.html">Bayesian experimental
design</a> we
make a choice that influences our future measurements and we would like to
make these choices in such a way that we will maximize the future expected
utility or minimize expected loss.
For this we use a model of how our choices relate to the information we will
capture and to how valuable these information will be. Because this is just
decision theory and the idea is general, let me be more concrete. Let us
assume the objective function
<div class="math">$$\mathbb{E}_{z \sim p(z)}[\mathbb{E}_{x \sim p(x|z,\theta)}[\ell(\tilde{z}(x,\theta), z)]].$$</div>
Here <span class="math">\(\theta\)</span> is our design parameter, <span class="math">\(z\)</span> is the true state we are interested
in with a prior <span class="math">\(p(z)\)</span>.
The measurement process produces <span class="math">\(x \sim p(x|z,\theta)\)</span>.
We have an estimator <span class="math">\(\tilde{z}(x,\theta)\)</span> and a loss function which compares
the estimated value against the true state.
The full objective function is then the expected loss of our estimator
<span class="math">\(\tilde{z}\)</span> as a function of the design parameters <span class="math">\(\theta\)</span>.
The above expression looks a bit convoluted but this structure appears
frequently when the type of information that is collected can be controlled.
One example application of this: <span class="math">\(z\)</span> could represent user behaviour and
<span class="math">\(\theta\)</span> some subset of questions we could ask that user to learn more about
his behaviour. We then assume a model <span class="math">\(p(x|z,\theta)\)</span> of how the user would
provide answers <span class="math">\(x\)</span> given questions <span class="math">\(\theta\)</span> and behaviour <span class="math">\(z\)</span>. This allows
us to build an estimator <span class="math">\(\tilde{z}(x,\theta)\)</span>. The design objective then tries
to find the most informative set of questions to ask.</p>
<p><em>Acknowledgements</em>. I thank <a href="http://ei.is.tuebingen.mpg.de/person/mschober">Michael
Schober</a> for discussions about
the paper and Nicolas Heess for feedback on this article.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Multilevel Splitting2015-07-10T22:50:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-07-10:sebastian/blog/multilevel-splitting.html<p>This article is about <em>multilevel splitting</em>, a method for estimating the
probability of rare events.</p>
<p>Estimating the probability of <em>rare events</em> is important in many fields.
One vivid example is in the study of reliability of systems; imagine for
example, that we are responsible for building a mechanical structure such as a
bridge and we aim to design it to last one hundred years.
To provide any kind of guarantee we need to have a model of what could happen
in these 100 years, for example how the bridge will be used during that time,
what weight it will have to bear, how strong winds and floods may be, how
corrosion and other processes deteriorate the structure, etc.
Considering all these factors may only be possible approximately via a
<em>simulation</em> of the structure under different effects.</p>
<p>For concreteness let's say we denote by <span class="math">\(X\)</span> the random variable that represents the
maximum force that is applied to the bridge during the 100 years lifetime.
Each simulation allows us to obtain a sample <span class="math">\(X_i \sim P\)</span> of this force, where
<span class="math">\(P\)</span> is a probabilistic model of everything that can happen during the 100
years.
Given that we designed the bridge to widthstand a certain force, the question
is now to make statements of the form
<div class="math">$$P(X \geq \delta) \leq \epsilon.$$</div>
Often we want the probability of something bad happening (the event <span class="math">\(X \geq
\delta\)</span>) to be exceptionally small, say <span class="math">\(\epsilon = 10^{-9}\)</span>.</p>
<p>Another common example is the <a href="http://www.nowozin.net/sebastian/blog/bayesian-p-values.html">computation of P-values</a>,
where we observe a sample <span class="math">\(x\)</span> and compute a <em>test statistic</em> <span class="math">\(t=T(x)\)</span>. Given
a <em>null model</em> in the form of a distribution <span class="math">\(P(X)\)</span> we are interested in the
<em>P-value</em>, that is, the probability of the event <span class="math">\(P(T(X) \geq t)\)</span>. This
number is the probability under the null of observing a test statistic at
least as extreme as the one actually observed.
Using the multilevel splitting idea we can hope to accurately compute the
P-value as long as we can run an MCMC chain on the null model.
Also, more general P-values for composite null models, such as the <a href="http://www.nowozin.net/sebastian/blog/bayesian-p-values.html"><em>posterior
predictive P-value</em></a> are computable.
So if this sounds good, how does multilevel splitting work and why is it
needed in the first place?</p>
<p>In the absence of an analytic form for <span class="math">\(P\)</span>, a naive simulation approach is to
repeatedly draw samples <span class="math">\(X_i \sim P\)</span> and to count how often the bad event
happens. For rare events as the one above this does not work very well and if
we would exactly meet the guarantee of the above example, <span class="math">\(\epsilon =
10^{-9}\)</span>, then we would on average have to draw around <span class="math">\(1/\epsilon = 10^9\)</span>
samples just to see a single bad event.
But because we would like to estimate the rare event probability we need even
more samples.</p>
<p>There are a number of custom methods for accurate estimation of rare event
probabilities. The remainder of the article discusses <em>multilevel splitting</em>,
but at this point I would like to mention that another popular set of methods
for rare events is based on adaptive importance sampling which is described in
detail in <a href="http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470177942.html">Rubinstein and Kroese's
book on Monte Carlo
methods</a>.</p>
<h1>Multilevel Splitting</h1>
<p><a href="http://en.wikipedia.org/wiki/John_von_Neumann">John von Neumann</a> had an idea
better than naive simulation on how to address the problem of estimating rare
event probabilities. He named his solution <em>multilevel splitting</em>. The first
published description of multilevel splitting is due to <a href="https://dornsifecms.usc.edu/assets/sites/520/docs/kahnharris.pdf">Kahn and Harris in
1951</a> (who
attribute it to John von Neumann).</p>
<p>The basic idea of multilevel splitting is to steer an iterative simulation
process towards the rare event region by removing samples far away from the
rare event and <em>splitting</em> samples closer to the rare event.</p>
<p>The application considered in the 1951 paper is interesting in this regard in
that it clearly relates to nuclear weapon research:</p>
<blockquote>
<p>"We wish to estimate the probability that a particle is transmitted through a
shield, when this probability is of the order of <span class="math">\(10^{-6}\)</span> to <span class="math">\(10^{-10}\)</span>, and
we wish to do this by sampling about a thousand life histories."
...
"In one method of applying this, one defines regions of importance in the
space being studied, and, when the sampled particle goes from a less
important to a more important region, it is split into two independent
particles, each one-half the weight of the original."</p>
</blockquote>
<p>Back in 1951 the algorithm was somewhat adhoc but effective.
In a recent 2011 <a href="http://perso.univ-rennes2.fr/system/files/users/guyader_a/ghm2.pdf">paper by Guyader, Hengartner, and
Matzner-Lober</a>
the authors propose a more practical variant of the same idea and provide
theoretical results.</p>
<h2>Setup</h2>
<p>The general setup is as follows. We have a distribution <span class="math">\(P\)</span> defining our system.
We have <span class="math">\(X \in \mathcal{X}\)</span> for the realizations <span class="math">\(X \sim P\)</span>.
A continuous map <span class="math">\(\phi: \mathcal{X} \to \mathbb{R}\)</span> defines the quantity of interest.
We are interested in computing the probability <span class="math">\(P(\phi(X) \geq q)\)</span>.
To this end we assume we can approximately simulate from <span class="math">\(P\)</span> using a <a href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov
chain</a>, which is
typically possible even in complex models.</p>
<p>The basic idea of the original 1951 algorithm is to fix a set of levels
<span class="math">\(-\infty = L_0 < L_1 < L_2 < \dots < L_k = q\)</span>. Then we can formally write
<div class="math">$$P(\phi(X) \geq q) = \prod_{i=1}^k P(\phi(X) \geq L_i \:|\: \phi(X) \geq L_{i-1}).$$</div>
</p>
<p>The above product can be estimated term-by-term as follows: we use a set of
<span class="math">\(N\)</span> particles <span class="math">\((X_1,\dots,X_N)\)</span> and simulate these according to <span class="math">\(X_i \sim P(X)\)</span>.
Then we estimate the fraction
<div class="math">$$P(\phi(X) \geq L_1 \:|\: \phi(X) \geq L_0) = P(\phi(X) \geq L_1)
\approx \frac{\sum_{i=1}^N 1_{\{\phi(X_i) \geq L_1\}}}{N}.$$</div>
Afterwards we discard all particles with <span class="math">\(\phi(X_i) < L_1\)</span> and use the remaining
particles to resample a set of <span class="math">\(N\)</span> particles (the <em>splitting</em>).
Finally, we update all particles using a number of steps of our MCMC kernel,
but this time restricted to <span class="math">\(\phi(X_i) \geq L_1\)</span>, that is, we reject all
proposals that would violate this condition.
This is one level, and for the multilevel scheme we repeat the above procedure
with the next level. Eventually, when we reach the final level <span class="math">\(L_k\)</span>, we take
the product of the estimated probabilities as the estimate of the rare event
probability. Upon reaching the final level the surviving particles are
properly distributed conditional on the restriction <span class="math">\(\phi(X) \geq q\)</span>.</p>
<p>The above algorithm is effective but has the major drawback of having to fix a
ladder of levels apriori. It would be more practical to instead have an
automatic method to create these levels or to get rid of them entirely. The
algorithm of <em>Guyader et al.</em> achieves this automatic selection by keeping the
particles sorted according to <span class="math">\(\phi\)</span>, with the lowest particle defining the
current level, at the cost of having a random runtime of the algorithm.</p>
<p>The 2011 paper is quite rich in that it also contains an approximate
confidence interval for the true probability as well as an analysis of the
random runtime and an interesting application of estimating the false positive
rate of watermark detection schemes (which ideally should be very small).
Also, a variant of their method can solve for the <em>quantile</em>, that is, given
<span class="math">\(p\)</span> in <span class="math">\(p = P(\phi(X) \geq q)\)</span>, solve for <span class="math">\(q\)</span>.
(Unfortunately, in the paper, as is often the case with many statistics and
applied math papers, the algorithm (in Section 3.2) is not presented very
clearly compared to a typical CS or ML paper.)</p>
<h2>Example</h2>
<p>The following is an implementation in the <a href="http://julialang.org/">Julia
language</a> that estimates <span class="math">\(P(X \geq 16.5)\)</span> where <span class="math">\(X \sim
\mathcal{N}(0,1)\)</span> is a standard Normal random variable.</p>
<div class="highlight"><pre><span></span><span class="k">using</span> <span class="n">Distributions</span>
<span class="n">N</span><span class="o">=</span><span class="mi">2000</span> <span class="c"># number of particles</span>
<span class="n">T</span><span class="o">=</span><span class="mi">10</span> <span class="c"># number of MCMC steps</span>
<span class="n">q</span><span class="o">=</span><span class="mf">16.5</span> <span class="c"># quantile</span>
<span class="n">target</span><span class="o">=</span><span class="n">Normal</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)</span>
<span class="n">K</span><span class="o">=</span><span class="n">Normal</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">)</span> <span class="c"># Markov kernel</span>
<span class="n">m</span><span class="o">=</span><span class="mi">1</span>
<span class="n">X</span><span class="o">=</span><span class="n">sort</span><span class="p">(</span><span class="n">rand</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="n">N</span><span class="p">))</span>
<span class="n">L</span><span class="o">=</span><span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">while</span> <span class="n">L</span> <span class="o"><</span> <span class="n">q</span> <span class="c"># as long as there are particles below q</span>
<span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">rand</span><span class="p">(</span><span class="mi">2</span><span class="p">:</span><span class="n">N</span><span class="p">)]</span>
<span class="c"># Run a Markov chain on the lowermost sample</span>
<span class="k">for</span> <span class="n">t</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">T</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">rand</span><span class="p">(</span><span class="n">K</span><span class="p">)</span>
<span class="n">log_alpha</span> <span class="o">=</span> <span class="n">logpdf</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span> <span class="o">-</span> <span class="n">logpdf</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="k">if</span> <span class="n">log</span><span class="p">(</span><span class="n">rand</span><span class="p">())</span> <span class="o"><=</span> <span class="n">log_alpha</span> <span class="o">&&</span> <span class="n">y</span> <span class="o">></span> <span class="n">L</span>
<span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">sort</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">L</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">m</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">phat</span> <span class="o">=</span> <span class="p">(</span><span class="mf">1.0</span><span class="o">-</span><span class="mf">1.0</span><span class="o">/</span><span class="n">N</span><span class="p">)</span><span class="o">^</span><span class="p">(</span><span class="n">m</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># Estimate, Truth</span>
<span class="n">phat</span><span class="p">,</span> <span class="n">ccdf</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="n">q</span><span class="p">)</span>
</pre></div>
<p>Giving the output</p>
<div class="highlight"><pre><span></span>(1.4581487078794118e-61,1.8344630031647276e-61)
</pre></div>
<p>where the first number is the estimate and the second number is the ground
truth, known in this case analytically.
The relative estimation accuracy in this case is remarkably, given that this
event occurs on average only once every <span class="math">\(10^{61}\)</span> samples. For this
simulation a total of <span class="math">\(m=280,092\)</span> sample updates have been performed until the
algorithm stopped.</p>
<h2>Conclusion</h2>
<p>Multilevel splitting is a useful algorithm for estimating the probability of
rare events and the recent algorithm of Guyader et al. is practical in that it
can be implemented on top of an arbitrary MCMC sampler.</p>
<p>There are caveats, however. In the above example, the problem structure is
almost ideal for the application of multilevel splitting: a slowly varying
continuous function <span class="math">\(\phi\)</span> whose level sets are topologically connected. This
means that the MCMC sampler can mix easily in the restricted subsets and the
resulting rare event probabilities can be accurately estimated.
If these assumptions are not satisfied the algorithm may fail to work, and
current research addresses these more general situations, see, for example
<a href="http://arxiv.org/abs/1507.00919">this recent paper by Walter</a>.</p>
<p>In summary, although some care is required for the application of multilevel
splitting to real problems it is likely to be orders of magnitude more
efficient than naive approaches.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Bayesian P-Values2015-06-27T00:15:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-06-27:sebastian/blog/bayesian-p-values.html<p><a href="https://en.wikipedia.org/wiki/P-value">P-Values</a> (see also <a href="https://stat.duke.edu/~berger/p-values.html">Jim Berger's page
on p-values</a>) are probably one of
the most misunderstood concepts in statistics and certainly have been abused
in statistical practice.
Originally proposed as an informal diagnostic by <a href="http://en.wikipedia.org/wiki/Ronald_Fisher">Ronald Fisher</a>, there are many
reasons for the bad reputation of p-values, and in many relevant situations
good alternatives such as <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1995.10476572">Bayes
factors</a>
can and should be used instead.
One key objection to p-values is that although they provide statistical
evidence against an assumed hypothesis, this does not imply that the deviation
from the hypothesis is large or relevant.
In practice, the largest criticisms are not related to the p-value itself but
related to the widespread misunderstanding of p-values and the arbitrariness
of accepting formal tests of significance based on p-values in
scientific discourse.</p>
<p>In this article I am not going to defend p-values, also because others have
done a good job at giving a modern explanation of their benefits in context,
as well as refuting some common criticisms, for example the article <a href="http://www.dcscience.net/Senn-Two-cheers-2001.pdf">Two
cheers for P-values?</a> by
<a href="http://www.gla.ac.uk/schools/mathematicsstatistics/staff/stephensenn/">Stephen
Senn</a>
and the more recent <a href="http://science.oregonstate.edu/~murtaugh/files/ecol-2014.pdf">In defense of P
values</a> by <a href="http://stat.oregonstate.edu/content/murtaugh-paul">Paul
Murtaugh</a>.</p>
<p>Instead, I will consider a situation which often arises in practice.</p>
<h2>Setup</h2>
<p>Suppose you have decided on a probabilistic model <span class="math">\(P(X)\)</span> or <span class="math">\(P(X|\theta)\)</span>,
where <span class="math">\(\theta\)</span> is some unknown parameter.
With <em>decided</em> I mean that we actually commit and ship our model and we no
longer entertain alternative models.
Alternatives could be too expensive computationally or it could be too
difficult to accurately specify these alternative models. (For example, a
more complicated model may involve additional latent variables for which it is
difficult to elicit prior beliefs.)</p>
<p>Given such a model but no assumed alternative, and some observed data <span class="math">\(X\)</span>, can
we identify whether "the model fits the data"?
This problem is the classic <a href="https://en.wikipedia.org/wiki/Goodness_of_fit">goodness of
fit</a> problem and classical
statistics has a repertoire of methods for standard models. These methods
have their own problems in that they are often unsatisfactory case-by-case
studies or strong results are obtained only in asymptotia.
However, it would be too easy to just criticise these methods.
The real question is whether the problem they address is an important one, and
what alternatives should be used, especially from a Bayesian viewpoint.</p>
<h2>Prediction versus Scientific Theories</h2>
<p>In <em>machine learning</em>, at least in its widespread current industrial use, we
are most often concerned with building predictive models that automatically
make decisions such as showing the right advertisements, classifying spam
emails, etcetera.</p>
<p>This current focus on prediction may shift in the future, for example due to a
revival in artificial intelligence systems or in general more autonomous agent
type systems which do not have a single clearly defined prediction task.</p>
<p>But as it currently stands, model checking and goodness of fit is not so
relevant for building predictive models.</p>
<p><em>First</em>, even when the observation does not comply with model assumptions, your
prediction may still be correct, in which case the non-compliance does not
matter. I.e. the p-value does not use a decision-theoretic viewpoint that
includes a task-dependent utility; cf. <a href="http://arxiv.org/abs/1402.6118">Watson and
Holmes</a>.
To know whether the model is "correct" or not may not be important at all for
prediction, but even likewise within science, as summarized by Bruce Hill in
<a href="http://www3.stat.sinica.edu.tw/statistica/oldpdf/A6n41.pdf">this comment</a>,</p>
<blockquote>
<p>"A major defect of the classical view of hypothesis testing, [...],
is that it attempts to test only whether the model is true.
This came out of the tradition in physics, where models such as Newtonian
mechanics, the gas laws, fluid dynamics, and so on, come so close to being
"true" in the sense of fitting (much of) the data, that one tends to neglect
issues about the use of the model. However, in typical statistical problems
(especially in the biological and social sciences but not exclusively so)
one is almost certain a priori that the model taken literally is false in a
non-trivial way, and so one is instead concerned whether the magnitude of
discrepancies is sufficiently small so that the model can be employed for
some specific purpose."</p>
</blockquote>
<p><em>Second</em>, if the deviation from modelling assumptions leads to incorrect
predictions you would detect this through simple analysis of incorrect
predictions using ground truth holdout data, not through fancy model checking.
Checking accuracy of predictions is easy with annotated ground truth data, and
is the bread-and-butter basic tool of machine learning.</p>
<p>The only useful application of model checking for predictive systems that I
could think of are systems in which a conservative "prefer-not-to-predict"
option exists, so that observations which are violating model assumptions
could be excluded from further automated processing. Yet, much of this
potential benefit may already be accessible through posterior uncertainty of
the model. Only the subset of instances for which the model is certain but
its predictions are wrong could profit from this special treatment.</p>
<p>In contrast to prediction, in science we build models not purely for
prediction, but as a formal approximation to reality. Here I see that model
checking is crucial, because it allows falsification of <em>scientific</em>
hypotheses, leading hopefully to improved scientific understanding in the form
of new models.
One historically efficient method to falsify a scientific model is to check
the predictions it makes, so a scientific model must normally also be a
"predictive model".
This viewpoint of establishing a model not just for making good predictions
but also to understand mechanisms of reality also seems closer to the field of
statistics.</p>
<p>The above separation of prediction versus science is of course not a
simple dichotomy, but just a preference of the practitioner.</p>
<h2>Bayesian Viewpoints?</h2>
<p>So then, what is the Bayesian viewpoint here?
The answer is that some well respected figures in the field accept frequentist
tests and p-values as a method to criticise and attempt to falsify Bayesian
models.
One example can be seen in a <a href="http://www.stat.columbia.edu/~gelman/research/published/philosophy_chapter.pdf">recent article</a>
by <a href="http://andrewgelman.com/">Andrew Gelman</a> and <a href="http://www.stat.cmu.edu/~cshalizi/">Cosma
Shalizi</a> where mechanisms to falsify a
Bayesian model a discussed, stating</p>
<blockquote>
<p>"The main point where we disagree with many Bayesians is that we do not
think that Bayesian methods are generally useful for giving the posterior
probability that a model is true, or the probability for preferring model A
over model <em>B</em>, or whatever. Bayesian inference is good for deductive
inference within a model, but for evaluating a model, we prefer to compare
it to data (what Cox and Hinkley , 1974, call "pure significance testing")
without requiring that a new model be there to beat it."</p>
</blockquote>
<p>(They use pure significance tests and frequentist predictive checks, but no
p-values in that paper.)</p>
<p>Another example is <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2000.10474309">an
article</a>
by <a href="http://www.uv.es/bayarri/">Susie Bayarri</a> and <a href="https://stat.duke.edu/~berger/">James
Berger</a>, where "Bayesian p-values" are
discussed.</p>
<p>A third and maybe more popular pragmatic Bayesian stance is summarized in
Bruce Hill's comment on <a href="http://www3.stat.sinica.edu.tw/statistica/oldpdf/A6n41.pdf">Gelman, Meng, and Stern's article on posterior
predictive
testing</a>,</p>
<blockquote>
<p>"Like many others, I have come to regard the classical p-value as a useful
diagnostic device, particularly in screening large numbers of possibly
meaningful treatment comparisons. It is one of many ways quickly to alert
oneself to some of the important features of a data set. However, in my
opinion it is not particularly suited for careful decision-making in serious
problems, or even for hypothesis testing. Its primary function is to
alert one to the need for making such a more careful analysis, and
perhaps to search for better models. Whether one wishes actually to go
beyond the p-value depends upon, among other things, the importance of the
problem, whether the quality of the data and information about the model and
a priori distributions is sufficiently high for such an analysis to be
worthwhile, and ultimately upon the perceived utility of such an analysis."</p>
</blockquote>
<p>Not arguing by reference to authorities, but given the broad spectrum of
contributions of Andrew Gelman, Bruce Hill, and James Berger (many of us
learned Bayesian methods from the books <a href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data
Analysis</a> and <a href="http://www.springer.com/us/book/9780387960982">Statistical
Decision Theory and Bayesian
Analysis</a>), it should be clear
that if they take frequentist tests and p-values seriously in statistical
practice, they may actually be useful.</p>
<p>So let's look again at our goodness-of-fit problem.</p>
<h2>Simple Models (simple null model)</h2>
<p>The p-value can provide a useful diagnostic of goodness of fit.
For the case of a simple model <span class="math">\(P(X)\)</span> with an observation <span class="math">\(X \in \mathcal{X}\)</span>
we can pick a test statistic <span class="math">\(T: \mathcal{X} \to \mathbb{R}\)</span> where high values
indicate unlikely outcomes, and then compute
<div class="math">$$p_{\textrm{classic}} = \textrm{Pr}_{X' \sim P}(T(X') \geq T(X)),$$</div>
that is, the probability of observing a <span class="math">\(T(X')\)</span> greater than the actually
observed <span class="math">\(T(X)\)</span>, given the assumed model <span class="math">\(P(X)\)</span>.
This probability is the p-value and if the probability of observing a more
extreme test statistic is small we should righly be suspicious of the assumed
model.
The choice of test statistic <span class="math">\(T\)</span> is the only degree of freedom and has to be
made given the model.</p>
<p>This is the classic p-value and its formal definition is completely
unambiguous.
One important observation is that if we assume the null hypothesis is true and
we treat the p-value as a random variable, then this random variable is
uniformly distributed, for any sample size.</p>
<h2>Latent Variable Models (composite null model)</h2>
<p>Now assume a slightly more general setting, where we have a model
<span class="math">\(P(X|\theta)\)</span>, and <span class="math">\(\theta \in \Theta\)</span> is some unknown parameter of the model
which is not observed.</p>
<p>Because it is not observed, the above definition does not apply.
We could apply the definition only if we knew <span class="math">\(\theta\)</span>.
Classic methods assume that we have an estimator <span class="math">\(\hat{\theta}\)</span> so that we can
evaluate the p-value on <span class="math">\(P(X|\hat{\theta})\)</span>, fixing the parameter to a value
hopefully close to it's true value.
The key problem with this approach is that the p-value in general will no
longer be uniformly distributed.
This diminishes its value as a diagnostic for model misspecification.
(Another alternative is to take the supremum probability over all possible
parameters, again yielding a non-uniformly distributed p-value under the
null.)</p>
<p>Bayesians to the rescue! Twice!</p>
<p>First, assume we would like to compute a p-value in the above setting. What
would a Bayesian do? Of course, he would integrate over the unknown
parameter, using a prior.
This yields the so called <a href="http://projecteuclid.org/euclid.aos/1176325622">posterior predictive
p-value</a> going back to
the <a href="http://www.jstor.org/stable/2984569">work of Guttman</a>.
Assuming a prior <span class="math">\(P(\theta)\)</span> we compute the <em>posterior predictive p-value</em> as
<div class="math">$$p_{\textrm{post}} = \mathbb{E}_{\theta \sim P(\theta|X)}[
\textrm{Pr}_{X' \sim P(X'|\theta)}(T(X') \geq T(X))],$$</div>
where <span class="math">\(P(\theta|X) \propto P(X|\theta) P(\theta)\)</span> is the proper posterior.
The definition is simple: take the expectation of the ordinary p-value
weighted by the parameter posterior.
This definition is very general and typically easy to compute during posterior
inference, i.e. it is quite practical computationally.</p>
<p>Unfortunately, it is also overly conservative, as explained in the JASA paper
<a href="http://www.biostat.harvard.edu/robins/p-values.pdf">``Asymptotic Distribution of P Values in Composite Null
Models''</a> by Robins, van
der Vaart, and Ventura from 2000.
Intuitively this is because the observed data <span class="math">\(X\)</span> is used twice, a violation
of the <a href="https://projecteuclid.org/euclid.lnms/1215466210#toc">likelihood principle of Bayesian
statistics</a>:
first it is used to obtain the posterior <span class="math">\(P(\theta|X)\)</span>, and then it is used
again to compute the p-value.</p>
<p>Bayesians to the rescue again!
This time it is Susie Bayarri and Jim Berger, and in their JASA paper <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2000.10474309">P
values for Composite Null
Models</a>
they introduce two alternative p-values which exactly "undo" the effect of
using the data twice by conditioning on the information already observed.
(I will not discuss the <em>U-conditional predictive p-value</em> proposed by Bayarri
and Berger.)
Here is the basic idea: let <span class="math">\(X\)</span> be the observed data and <span class="math">\(t=T(X)\)</span> the test
statistic. We then define the <em>partial posterior</em>,
<div class="math">$$P(\theta|X \setminus t) \propto \frac{P(X|\theta) P(\theta)}{P(t|\theta)}.$$</div>
To understand this definition remember that random variables are functions
from the sample space to another set. Hence, conditioning on <span class="math">\(t\)</span> means that
we condition on the event <span class="math">\(\{X' \in \mathcal{X} | T(X') = t\}\)</span>.
The partial posterior predictive p-value is now defined as
<div class="math">$$p_{\textrm{ppost}} = \mathbb{E}_{\theta \sim P(\theta|X \setminus t)}[
\textrm{Pr}_{X' \sim P(X'|\theta)}(T(X') \geq T(X))].$$</div>
Bayarri and Berger, as well as Robins, van der Vaart, and Ventura analyze the
properties of this particular p-value and show that is asymptotically
uniformly distributed and thus is neither conservative nor anti-conservative.</p>
<p>If you are a Bayesian and consider providing a general model-fit diagnostic in
the absence of a formal alternative hypothesis this partial posterior
predictive p-value is the method to use.</p>
<p>However, there are two drawbacks I can see that have affected it's usefulness
for me:</p>
<ol>
<li>It is much harder to compute. Whereas the posterior predictive p-value can
be well approximated even with naive Monte Carlo as soon as normal
posterior inference is achieved, this is not the case for the partial
posterior predictive p-value. The reason is that <span class="math">\(P(t|\theta)\)</span>, although
typically an univariate density in the test statistic, is the integral over
potentially complicated sets in <span class="math">\(\mathcal{X}\)</span>, that is
<span class="math">\(P(t|\theta) = \int_{\mathcal{X}} 1_{\{T(X)=t\}} P(X|\theta) \,\textrm{d}X\)</span>.
I have not seen generally applicable methods to compute
<span class="math">\(p_{\textrm{ppost}}\)</span> efficiently so far.</li>
<li>The nice results of Bayarri and Berger do not extend to so called
<em>discrepancy statistics</em> as proposed by Xiaoli Meng in his <a href="http://projecteuclid.org/euclid.aos/1176325622">1994
paper</a>. These more general
test statistics include the parameter, i.e. we use <span class="math">\(T(X,\theta)\)</span> instead of
just <span class="math">\(T(X)\)</span>. Why is this useful? For example, and I found this a very
useful test statistic, you can directly use the likelihood of the model
itself as a test statistic: <span class="math">\(T(X,\theta) = -P(X|\theta)\)</span>.</li>
</ol>
<p>Enough thoughts, let's get our hands dirty with a simple experiment.</p>
<h2>Experiment</h2>
<p>We take a simple composite null setting as follows. Our assumed model is
<div class="math">$$X_i \sim \mathcal{N}(\mu, \sigma^2),\qquad i=1,\dots,n.$$</div>
We get to observe <span class="math">\(X=(X_1,\dots,X_n)\)</span> and know <span class="math">\(\sigma\)</span> but consider <span class="math">\(\mu\)</span>
unknown.</p>
<p>After some observations we would like to assess whether our model is accurate
in light of the data.
To this end we would like to use the P-values described above.
We will need two ingredients: we need to define a test statistic and we need
to work out the posterior inference in our model.</p>
<p>For the test statistic we actually use a generalized test statistic
(discrepancy variable in Meng's vocabulary) as
<div class="math">$$T(X,\mu) = - \prod_{i=1}^n p(X_i|\mu)
= -\prod_{i=1}^n \mathcal{N}(X_i ; \mu, \sigma^2).$$</div>
</p>
<p>For the posterior inference, as Bayesians we place a prior on <span class="math">\(\mu\)</span> and we
select
<div class="math">$$\mu \sim \mathcal{N}(\mu_0, \sigma_0).$$</div>
The Bayesian analysis is particularly straightforward in this case, as this
<a href="http://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf">note by Kevin Murphy</a>
details.
In particular, after observing <span class="math">\(n\)</span> samples <span class="math">\(X=(X_1,\dots,X_n)\)</span> the posterior
on <span class="math">\(\mu\)</span> has a simple closed form as
<div class="math">$$p(\mu|X) = \mathcal{N}(\mu_n, \sigma^2_n),$$</div>
with
<div class="math">$$\sigma^2_n = \frac{1}{\frac{n}{\sigma^2}+\frac{1}{\sigma^2_0}},$$</div>
and
<div class="math">$$\mu_n = \sigma^2_n \left(\frac{\mu_0}{\sigma^2_0} +
\frac{n \bar{x}}{\sigma^2}\right),$$</div>
where <span class="math">\(\bar{x} = \frac{1}{n} \sum_i X_i\)</span> is the sample average.</p>
<p>From this simple form of the posterior distribution we can derive the closed
form partial posterior <span class="math">\(P(\mu|X\setminus t)\)</span> as well (not shown here, but
essentially using known properties of the <span class="math">\(\chi^2\)</span> distribution).
Here is a picture of the posterior <span class="math">\(P(\mu|X)\)</span> and the partial posterior
<span class="math">\(P(\mu | X \setminus t)\)</span>, where the data <span class="math">\(X\)</span> actually
comes from the assumed model with true <span class="math">\(\mu=4.5\)</span> and <span class="math">\(n=10\)</span>.
Interestingly the partial posterior is more concentrated (which makes sense
from the theory derived in Robins et al.).</p>
<p><img alt="Posterior and partial posterior in mu" src="http://www.nowozin.net/sebastian/blog/images/pvalue-posterior.svg" /></p>
<p>Let us generate data from the assumed prior and model and see how our p-values
behave. Because the null model is then correct, we can hope that the
resulting p-values will be uniformly distributed.
Indeed, if they were perfectly uniformly distributed they would be proper
frequentist p-values.
Because of the paper of Robins et al. we know that they will only be
asymptotically uniformly distributed as <span class="math">\(n \to \infty\)</span>. But here we are also
outside the theory because our test statistic <span class="math">\(T(X,\mu)\)</span> includes the unknown
parameter <span class="math">\(\mu\)</span>.
So, walking on thin theory, let's verify the distribution for <span class="math">\(n=10\)</span> by taking
a histogram over <span class="math">\(10^6\)</span> replicates.</p>
<p><img alt="Distribution over P-values" src="http://www.nowozin.net/sebastian/blog/images/pvalue-histogram.svg" /></p>
<p>This looks good, and the partial posterior predictive p-value is more
uniformly distributed obtaining better frequentist properties, in line with
the claims in Bayarri and Berger and in Robins et al.</p>
<p>Finally, let us check with data from a model that is different from the
assumed model. Here I sample from <span class="math">\(\mathcal{N}(\mu, s^2)\)</span>, where <span class="math">\(s \in
[0,2]\)</span>. For <span class="math">\(s=1\)</span> this is the assumed model, but ideally we can refute the
model for values that differ from one by detecting this deviation through a
p-value close to zero.
The plot below shows, for each <span class="math">\(s\)</span>, the average p-value over 1000 replicates.</p>
<p><img alt="Deviation experiment" src="http://www.nowozin.net/sebastian/blog/images/pvalue-deviation-sensitivity.svg" /></p>
<p>Clearly for <span class="math">\(s < 0.6\)</span> or so we can reliably discover that our assumed model is
problematic. Interestingly the partial posterior predictive p-value has
significantly more power, in line with the theory.</p>
<p>For <span class="math">\(s > 1\)</span> however, our p-value goes to one! How can this be?
Well, remember that the choice of test statistic determines which deviations
from our assumptions we can detect and that the p-value cannot verify the
correctness of our assumed model but instead may only provide one-sided
evidence against the model.
With our current test statistic clearly this significant deviation passes
undetected.
We could replace our test statistic using the negative of our current test
statistic and would be able detect the above deviation for <span class="math">\(s > 1\)</span>, but this
implicitly more or less starts the process of <em>thinking about alternative
models</em>, a point Bruce Hill mentioned above.</p>
<p>If we would like to consider alternative models we should ideally consider
them in a formal way, and as a result we would be better off using a fully
Bayesian approach over an enlarged model class.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Entropy of a Normal Distribution2015-06-13T23:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-06-13:sebastian/blog/the-entropy-of-a-normal-distribution.html<p>The <a href="http://en.wikipedia.org/wiki/Multivariate_normal_distribution">multivariate normal
distribution</a>
is one of the most important probability distributions for multivariate data.
In this post we will look at the entropy of this distribution and how to
estimate the entropy given an iid sample.</p>
<p>For a multivariate normal distribution in <span class="math">\(k\)</span> dimensions in standard form with
mean vector <span class="math">\(\mathbf{\mu} \in \mathbb{R}^k\)</span> and covariance matrix
<span class="math">\(\mathbf{\Sigma}\)</span> we have the density function</p>
<p>
<div class="math">$$f(\mathbb{x};\mathbf{\mu},\mathbf{\Sigma}) =
\frac{1}{\sqrt{(2\pi)^k |\mathbf{\Sigma}|}}
\exp\left(-\frac{1}{2} (\mathbf{x}-\mathbf{\mu})^T
\mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu})\right).$$</div>
</p>
<p>For this density, the differential entropy takes the simple form</p>
<p>
<div class="math">\begin{equation}
H = \frac{k}{2} + \frac{k}{2} \log(2\pi)
+ \frac{1}{2} \log |\mathbf{\Sigma}|.\label{eqn:Hnormal}
\end{equation}</div>
</p>
<p>In practice we are often provided with a sample</p>
<p>
<div class="math">$$\mathbf{x}_i \sim \mathcal{N}(\mathbf{\mu},\mathbf{\Sigma}),
\quad i=1,\dots,n,$$</div>
</p>
<p>without knowledge of <span class="math">\(\mathbf{\mu}\)</span> nor <span class="math">\(\mathbf{\Sigma}\)</span>.
We are then interested in estimating the entropy of the distribution from the
sample.</p>
<h2>Plugin Estimator</h2>
<p>The simplest method to estimate the entropy is to first estimate the mean as
the empirical mean,</p>
<p>
<div class="math">$$\hat{\mathbf{\mu}} =
\frac{1}{n} \sum_{i=1}^n \mathbf{x}_i,$$</div>
</p>
<p>and the sample covariance as</p>
<p>
<div class="math">$$\hat{\mathbf{\Sigma}} =
\frac{1}{n-1} \sum_{i=1}^n (\mathbf{x}_i - \hat{\mathbf{\mu}})
(\mathbf{x}_i - \hat{\mathbf{\mu}})^T.$$</div>
</p>
<p>Given these two estimates we simply use equation <span class="math">\((\ref{eqn:Hnormal})\)</span> on
<span class="math">\(\mathcal{N}(\hat{\mathbf{\mu}},\hat{\mathbf{\Sigma}})\)</span>. (We can also use
<span class="math">\(\mathcal{N}(\mathbf{0},\hat{\mathbf{\Sigma}})\)</span> instead as the entropy is
invariant under translation.)</p>
<p>This is called a <em>plugin estimate</em> because we first estimate parameters of a
distribution, then plug these into the analytic expression for the quantity of
interest.</p>
<p>It turns out that the plugin estimator systematically underestimates the true
entropy and that one can use improved estimators.
This is not special and plugin estimates are often biased or otherwise
deficient.
In case of the problem of estimating the entropy of an unknown normal
distribution however, the known results are especially beautiful.
In particular,</p>
<ul>
<li>there exist unbiased estimators,</li>
<li>there exist an estimator that is a <a href="http://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator">uniformly minimum variance unbiased
estimator</a>
(within a restricted class, see below),</li>
<li>this estimator is also a (generalized) Bayesian estimator under the
squared-loss, with an improper prior distribution.</li>
</ul>
<p>Hence, for this case, a single estimator is satisfactory from both a Bayesian
and frequentist viewpoint, and moreover it is easily computable.</p>
<p>Great, we will look at this estimator, but first look at an earlier work that
studies a simpler case.</p>
<h2>Ahmed and Gokhale, 1989</h2>
<p>An optimal <a href="http://en.wikipedia.org/wiki/Minimum-variance_unbiased_estimator">UMVUE
estimator</a>
for the problem of a zero-mean Normal distribution
<span class="math">\(\mathcal{N}(\mathbf{0},\Sigma)\)</span> has been found by <a href="http://ee364b.googlecode.com/svn-history/r24/trunk/references/00030996.pdf">(Ahmed and Gokhale,
1989)</a>.
This is a restricted case: while the entropy does not depend on the mean of
the distribution, it does affect the estimation of the sample covariance
matrix.</p>
<p>For a sample their estimator is</p>
<p>
<div class="math">$$\hat{H}_{\textrm{AG}} =
\frac{k}{2} \log(e\pi)
+ \frac{1}{2} \log \left|\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i^T\right|
- \frac{1}{2} \sum_{j=1}^d \psi\left(\frac{n+1-j}{2}\right),$$</div>
</p>
<p>where <span class="math">\(\psi\)</span> is the <a href="http://en.wikipedia.org/wiki/Digamma_function">digamma
function</a>.</p>
<p>If you know the mean of your distribution (so you can center your data to
ensure <span class="math">\(\mu=0\)</span>), this estimator provides a big improvement over the plugin
estimate. Here is an example in mean squared error and bias, were <span class="math">\(\Sigma
\sim \textrm{Wishart}(\nu,I_k)\)</span> and <span class="math">\(\mathbf{x}_i \sim
\mathcal{N}(\mathbf{0},\Sigma)\)</span>, with <span class="math">\(k=3\)</span> and <span class="math">\(n=20\)</span>.
The plot below shows a Monte Carlo result with <span class="math">\(80,000\)</span> replicates.</p>
<p><img alt="Ahmed and Gokhale estimator" src="http://www.nowozin.net/sebastian/blog/images/normal-entropy-ag.svg" /></p>
<p>As promised, we can observe a big improvement over the plugin estimate, and we
also see that the Ahmed Gokhale estimator is indeed unbiased.</p>
<p>Here is a <a href="http://julialang.org/">Julia</a> implementation.</p>
<div class="highlight"><pre><span></span><span class="k">function</span><span class="nf"> entropy_ag</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c"># X is a (k,n) matrix, samples in columns</span>
<span class="n">k</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">zeros</span><span class="p">(</span><span class="n">k</span><span class="p">,</span><span class="n">k</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">n</span>
<span class="n">C</span> <span class="o">+=</span> <span class="n">X</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="n">X</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span><span class="o">'</span>
<span class="k">end</span>
<span class="n">H</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">k</span><span class="o">*</span><span class="p">(</span><span class="mf">1.0</span> <span class="o">+</span> <span class="n">log</span><span class="p">(</span><span class="nb">pi</span><span class="p">))</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">logdet</span><span class="p">(</span><span class="n">C</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">k</span>
<span class="n">H</span> <span class="o">-=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">digamma</span><span class="p">(</span><span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="o">-</span><span class="n">i</span><span class="p">))</span>
<span class="k">end</span>
<span class="n">H</span>
<span class="k">end</span>
</pre></div>
<p>Because the case of a known mean is maybe less interesting, we go
straight to the general case.</p>
<h2>Misra, Singh, and Demchuk, 2005</h2>
<p>In <a href="http://www.sciencedirect.com/science/article/pii/S0047259X03001787">(Misra, Singh, and Demchuk,
2005)</a>
(here is the
<a href="http://www.researchgate.net/profile/Neeraj_Misra4/publication/23644689_Estimation_of_the_entropy_of_a_multivariate_normal_distribution/links/00b7d51df6ad392c72000000.pdf">PDF</a>)
the authors do a thorough job of analyzing the general case.
Beside a detailed bias and risk analysis the paper proposes two estimators for
the general case:</p>
<ul>
<li>An UMVUE estimator in a restricted class of estimators, that is a slight
variation of the Ahmed and Gokhale estimator;</li>
<li>A shrinkage estimator in a larger class, which is proven to dominate the
UMVUE estimator in the restricted class.</li>
</ul>
<p>The authors are apparently unaware of the work of Ahmed and Gokhale.
For their UMVUE estimator <span class="math">\(\hat{H}_{\textrm{MSD}}\)</span> they use the matrix</p>
<p>
<div class="math">$$S = \sum_{i=1}^n (\mathbf{x}_i-\hat{\mu})(\mathbf{x}_i-\hat{\mu})^T$$</div>
</p>
<p>and define</p>
<p>
<div class="math">\begin{equation}
\hat{H}_{\textrm{MSD}} =
\frac{k}{2} \log(e\pi)
+ \frac{1}{2} \log |S|
- \frac{1}{2} \sum_{j=1}^d \psi\left(\frac{n-j}{2}\right).
\label{Hmsd}
\end{equation}</div>
</p>
<p>Can you spot the difference to the Ahmed and Gokhale estimator? There are
two: the matrix <span class="math">\(S\)</span> is centered using the <em>sample mean</em> <span class="math">\(\hat{\mu}\)</span>, and, to
adjust for the use of the sample mean for centering, the argument to the
digamma function is shifted by <span class="math">\(1/2\)</span>.</p>
<p>Here is a <a href="http://julialang.org/">Julia</a> implementation.</p>
<div class="highlight"><pre><span></span><span class="k">function</span><span class="nf"> entropy_msd</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c"># X is a (k,n) matrix, samples in columns</span>
<span class="n">k</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">Xbar</span> <span class="o">=</span> <span class="n">mean</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">Xs</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">*</span><span class="n">Xbar</span>
<span class="n">S</span> <span class="o">=</span> <span class="n">zeros</span><span class="p">(</span><span class="n">k</span><span class="p">,</span><span class="n">k</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">n</span>
<span class="n">S</span> <span class="o">+=</span> <span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span><span class="o">-</span><span class="n">Xbar</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span><span class="o">-</span><span class="n">Xbar</span><span class="p">)</span><span class="o">'</span>
<span class="k">end</span>
<span class="n">res</span> <span class="o">=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">k</span><span class="o">*</span><span class="p">(</span><span class="mf">1.0</span> <span class="o">+</span> <span class="n">log</span><span class="p">(</span><span class="nb">pi</span><span class="p">))</span> <span class="o">+</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">logdet</span><span class="p">(</span><span class="n">S</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">k</span>
<span class="n">res</span> <span class="o">-=</span> <span class="mf">0.5</span><span class="o">*</span><span class="n">digamma</span><span class="p">(</span><span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="n">i</span><span class="p">))</span>
<span class="k">end</span>
<span class="n">res</span>
<span class="k">end</span>
</pre></div>
<h3>Outline of Derivation</h3>
<p>The key result that is used for deriving both the MSD and the AG estimator is
a lemma due to <a href="http://www.stat.illinois.edu/statnews/wijsman_memoriam.htm">Robert
Wijsman</a> from 1957
(<a href="http://projecteuclid.org/euclid.aoms/1177706969">PDF</a>).</p>
<p>Wijsman proved a result relating the determinants of two matrices:
the covariance matrix <span class="math">\(\Sigma\)</span> of a multivariate Normal distribution, and
the empirical outer product matrix <span class="math">\(X X^T\)</span> of a sample <span class="math">\(X \in
\mathbb{R}^{n\times k}\)</span> from that Normal.
In Lemma 3 of the above paper he showed</p>
<p>
<div class="math">$$\frac{|X X^T|}{|\Sigma|} = \prod_{i=1}^k \chi_{n-i+1}^2.$$</div>
</p>
<p>By taking the logarithm of this equation we can relate the central quantity in
the differential entropy, namely <span class="math">\(\log |\Sigma|\)</span> to the log-determinant of the
sample outer product matrix.</p>
<p>The sample outer product matrix of a zero-mean multivariate Normal sample with
<span class="math">\(n \geq k\)</span> is known to be distributed according to a Wishart distribution,
with many known analytic properties.
By using the known properties of the Wishart and <span class="math">\(\chi^2\)</span> distributions this
allows the derivation and proving unbiasedness of the AG and MSD estimators.</p>
<h3>Generalized Bayes</h3>
<p>Misra, Singh, and Demchuk also show that their MSD estimator is the mean of a
posterior that arises from a full Bayesian treatment with an improper prior.
This prior is shown to be, in (Theorem 2.3 in Misra et al., 2005),</p>
<p>
<div class="math">$$\pi(\mu,\Sigma) = \frac{1}{|\Sigma|^{(k+1)/2}}$$</div>
</p>
<p>This is a most satisfying result: a frequentist-optimal estimator in large
class of possible estimators is shown to be also a Bayes estimator for a
suitable matching prior.</p>
<p>Because the posterior is proper for <span class="math">\(n \geq k\)</span>, one could also use the
proposed prior to derive posterior credible regions for the entropy, and most
likely this is a good choice in that it could achieve good coverage
properties.</p>
<h2>Brewster-Zidek estimator</h2>
<p>Going even further, Misra and coauthors also show that while the MSD estimator
is optimal in the class of <em>affine-equivariant</em> estimators, when one enlarges
the class of possible estimators there exist estimators which uniformly
dominate the MSD estimator by achieving a lower risk.</p>
<p>They propose a shrinkage estimator, termed <em>Brewster-Zidek estimator</em> which I
give here without further details.</p>
<p>
<div class="math">$$\hat{H}_{BZ} = \frac{k}{2} \log(2 e \pi)
+ \frac{1}{2} \log |S + YY^T| + \frac{1}{2}(\log T - d(T))$$</div>
</p>
<p>
<div class="math">$$d(r) = \frac{\int_r^1 t^{\frac{n-k}{2}-1} (1-t)^{\frac{k}{2}-1}
\left[\log t + k \log 2 + \sum_{i=1}^k \psi\left(\frac{n-i+1}{2}\right)
\right]
\textrm{d}t}{\int_r^1 t^{\frac{n-k}{2}-1}(1-t)^{\frac{k}{2}-1}
\textrm{d}t}$$</div>
</p>
<p>
<div class="math">$$T = |S| |S+YY^T|^{-1}$$</div>
</p>
<p>
<div class="math">$$Y = \sqrt{n} \hat{\mu}$$</div>
</p>
<p>Here is a Julia implementation using numerical integration for evaluating
<span class="math">\(d(r)\)</span>.</p>
<div class="highlight"><pre><span></span><span class="k">function</span><span class="nf"> entropy_bz</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c"># X is a (p,n) matrix, samples in columns</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">Bfun</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="o">=</span> <span class="n">t</span><span class="o">^</span><span class="p">((</span><span class="n">n</span><span class="o">-</span><span class="n">p</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="n">t</span><span class="p">)</span><span class="o">^</span><span class="p">(</span><span class="n">p</span><span class="o">/</span><span class="mi">2</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="k">function</span><span class="nf"> Afun</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">log</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="o">+</span> <span class="n">p</span><span class="o">*</span><span class="n">log</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">p</span>
<span class="n">res</span> <span class="o">+=</span> <span class="n">digamma</span><span class="p">(</span><span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">n</span><span class="o">-</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="k">end</span>
<span class="n">res</span> <span class="o">*</span> <span class="n">Bfun</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">A</span><span class="p">(</span><span class="n">r</span><span class="p">::</span><span class="kt">Float64</span><span class="p">)</span> <span class="o">=</span> <span class="n">quadgk</span><span class="p">(</span><span class="n">Afun</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">B</span><span class="p">(</span><span class="n">r</span><span class="p">::</span><span class="kt">Float64</span><span class="p">)</span> <span class="o">=</span> <span class="n">quadgk</span><span class="p">(</span><span class="n">Bfun</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">d</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">=</span> <span class="n">A</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">/</span> <span class="n">B</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
<span class="n">Xbar</span> <span class="o">=</span> <span class="n">mean</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
<span class="n">Xs</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">*</span><span class="n">Xbar</span>
<span class="n">S</span> <span class="o">=</span> <span class="n">zeros</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="n">p</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">n</span>
<span class="n">S</span> <span class="o">+=</span> <span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span><span class="o">-</span><span class="n">Xbar</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span><span class="o">-</span><span class="n">Xbar</span><span class="p">)</span><span class="o">'</span>
<span class="k">end</span>
<span class="n">T</span> <span class="o">=</span> <span class="n">det</span><span class="p">(</span><span class="n">S</span><span class="p">)</span><span class="o">/</span><span class="n">det</span><span class="p">(</span><span class="n">S</span><span class="o">+</span><span class="n">Xs</span><span class="o">*</span><span class="n">Xs</span><span class="o">'</span><span class="p">)</span>
<span class="n">dBZ</span> <span class="o">=</span> <span class="n">logdet</span><span class="p">(</span><span class="n">S</span> <span class="o">+</span> <span class="n">Xs</span><span class="o">*</span><span class="n">Xs</span><span class="o">'</span><span class="p">)</span> <span class="o">-</span> <span class="n">d</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">+</span> <span class="n">log</span><span class="p">(</span><span class="n">T</span><span class="p">)</span>
<span class="mf">0.5</span><span class="o">*</span><span class="p">(</span><span class="n">p</span><span class="o">*</span><span class="p">(</span><span class="mi">1</span><span class="o">+</span><span class="n">log</span><span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="nb">pi</span><span class="p">))</span><span class="o">+</span><span class="n">dBZ</span><span class="p">)</span>
<span class="k">end</span>
</pre></div>
<h2>Shoot-out</h2>
<p>Remember the zero-mean case? Let us start with this case.
I use <span class="math">\(k=3\)</span> and <span class="math">\(n=20\)</span> as before, and <span class="math">\(\Sigma \sim \textrm{Wishart}(\nu,I_k)\)</span>.
Then samples are generated as
<span class="math">\(\mathbf{x}_i \sim \mathcal{N}(\mathbf{0},\Sigma)\)</span>.
All numbers are from <span class="math">\(80,000\)</span> replications of the full procedure.</p>
<p><img alt="Four estimators" src="http://www.nowozin.net/sebastian/blog/images/normal-entropy-all-zeromean.svg" /></p>
<p>What you can see from the above plot is that the AG estimator which is UMVUE
for this special case dominates the MSD estimator.
Both unbiased estimators are indeed unbiased.
In terms of risk the Brewster-Zidek estimator is indistinguishable from the
MSD estimator.</p>
<p>Now, what about <span class="math">\(\mu \neq \mathbf{0}\)</span>?
Here, for the simulation the setting is as before, but the mean is
<span class="math">\(\mu \sim \mathcal{N}(\mathbf{0},2I)\)</span>, so that samples are distributed as
<span class="math">\(\mathbf{x}_i \sim \mathcal{N}(\mu,\Sigma)\)</span>.</p>
<p><img alt="Four estimators" src="http://www.nowozin.net/sebastian/blog/images/normal-entropy-all.svg" /></p>
<p>The result shows that the AG estimator becomes useless if its assumption is
violated, as is to be expected. (Interestingly, if we were to try using the
scaled sample covariance matrix <span class="math">\(n \hat{\Sigma}\)</span> with the AG estimator it is
reasonable but biased, that is, it has lost its UMVUE property.)
The MSD estimator and the Brewster-Zidek estimators are virtually
indistinguishable and seem to be both unbiased in this case.</p>
<h2>Conclusion</h2>
<p>Estimating the entropy of a multivariate Normal distribution from a sample has
a satisfying solution, the MSD estimator <span class="math">\((\ref{Hmsd})\)</span>, which can be robustly
used in all circumstances. It is computationally efficient, and with
sufficient samples, <span class="math">\(n \geq k\)</span>, the Bayesian interpretation also provides a
proper posterior distribution over <span class="math">\(\mu\)</span> and <span class="math">\(\Sigma\)</span> which can be used to
derive a posterior distribution over the entropy.</p>
<p><em>Acknowledgements</em>. I thank
<a href="http://ei.is.tuebingen.mpg.de/person/jpeters">Jonas Peters</a> for reading a
draft version of the article and providing feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>A quick summary of CVPR 20152015-06-11T22:00:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-06-11:sebastian/blog/a-quick-summary-of-cvpr-2015.html<p><a href="http://www.pamitc.org/cvpr15/">CVPR 2015</a>, "Computer Vision and Pattern
Recognition" is the main conference of the computer vision community and just
finished.
I unfortunately was only able to stay for the three main conference days, but
here is my short subjective summary.</p>
<p>For an overview of individual research papers, see <a href="http://cs.stanford.edu/people/karpathy/cvpr2015papers/">this excellent summary
page</a> by <a href="http://cs.stanford.edu/people/karpathy/">Andrej
Karpathy</a>.</p>
<p>From the papers I have seen at the conference my personal favorite is Barron
et al., "Fast Bilateral-Space Stereo for Synthetic Defocus",
<a href="http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Barron_Fast_Bilateral-Space_Stereo_2015_CVPR_paper.pdf">PDF here</a>.
I liked it for a number of reasons. First, this research is already
successfully productized in a high-profile product and the presentation of the
work was excellent. Second, the flavour of this work is to take a data
structure (the permutohedral lattice) which has been used for one problem
successfully (bilateral filtering), and use it to solve a more difficult
problem (disparity from stereo) within the domain of the data structure. This
general idea may be useful in other contexts. To admit the truth, I never
liked pixels as a representation of image data, and many statistical models
are just awkward to specify on the pixel level; for this reason we as a
community often use higher representations such as superpixels or region
proposals. This paper provides an alternative method on how a regular
representation that is more aligned with the semantic content of the image
could be used to solve problems in such a way that one can reconstruct a
solution on the pixel level.</p>
<h2>Research Trends</h2>
<ul>
<li><strong>Deep Learning and Convolutional Neural Networks</strong>. Since the seminal ECCV
2012 workshop presentation by <a href="http://www.cs.toronto.edu/~kriz/">Alex
Krizhevsky</a> that announced the ImageNet
results and was <a href="http://books.nips.cc/papers/files/nips25/NIPS2012_0534.pdf">published as a NIPS paper the same
year</a> the
computer vision community has rapidly adopted convolutional networks and
some of the largest vision labs developed toolkits that democratized this
technology, such as <a href="http://caffe.berkeleyvision.org/">Caffe</a>, and existing
toolkits such as <a href="http://torch.ch/">Torch</a>, and
<a href="http://deeplearning.net/software/theano/">Theano</a> are also used.
In effect, I estimate that around 30 percent of all papers used
convolutional networks or features derived from them in their work, often
substantially increasing predictive performance on the given task.
Significant research directions remain open to everyone, but it is fair to
say that standard convnets are now a mature vision technology regularly used
by large parts of the community.</li>
<li><strong>Rich Linguistic Outputs</strong>. Automatic image captioning is now feasible.
There is an <a href="https://pdollar.wordpress.com/2015/01/21/image-captioning/">excellent
summary</a> of the
may works at <a href="https://pdollar.wordpress.com">Piotr Dollar's blog</a> and also
in another <a href="http://blogs.technet.com/b/machinelearning/archive/2014/11/18/rapid-progress-in-automatic-image-captioning.aspx">summary by John
Platt</a>.
Many of these works are enabled by the recent
<a href="http://mscoco.org/">Microsoft COCO dataset</a> and by <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">recurrent neural
networks</a>.</li>
</ul>
<h2>Non-Research Trends and the IEEE Controversy</h2>
<ul>
<li><strong>Growth in attendance</strong>. Attendance was at more than 2,400 persons,
continuing the rapid growth of the computer vision community.</li>
<li><strong>More code published</strong>. On almost every second poster there was a
<a href="http://github.com/">github</a> URL and the licenses are generally very liberal
(MIT, BSD, etc.) so as to permit wide distribution; this is great as it
further accelerates the speed at which efforts can be redirected towards
promising approaches.</li>
<li><strong>IEEE splits from CVPR</strong>. The conference has always been organized in part
by <a href="http://www.ieee.org/">IEEE</a> in various capacities as an insurer,
organizer, and publisher. However, with traditional publishing models being
obsoleted, and with examples of independent conferences and journals in the
machine learning community (<a href="http://nips.cc/">NIPS</a>,
<a href="http://icml.cc/">ICML</a>, and <a href="http://jmlr.org/">JMLR</a>), and considering that
CVPR as one of the premier conferences in all of computer science, the power
balance has shifted away from IEEE towards the computer vision community; as
a result, over the last few years the ties with IEEE have been weakened and
now seem to be lost. To be fair, following CVPR 2011, IEEE has moved and
negotiated a fairer deal, with CVPR papers made available
<a href="http://www.cv-foundation.org/openaccess/">open-access</a> since
CVPR 2013, and allowing co-sponsoring arrangements with the
<a href="http://www.cv-foundation.org/">Computer Vision Foundation</a>.
But now, after threats made by IEEE, it has been voted at the PAMI-TC
meeting that future CVPR conferences (starting with CVPR 2016) that the
computer vision foundation will take over the functions previously carried
out by the IEEE. More details will be announced shortly, I am sure.
Whether this has any repercussions for the
<a href="http://www.computer.org/web/tpami">TPAMI journal</a> is unclear at the current
point, but before making threats and actions that would serve as a catalyst
for community action, IEEE would be wise to consider what has happened to Springer's
<em>Machine Learning</em> journal in 2001 and the events that led to the founding
of the <a href="http://en.wikipedia.org/wiki/Journal_of_Machine_Learning_Research"><em>Journal of Machine Learning
Research</em></a>,
a <a href="http://blogs.law.harvard.edu/pamphlet/2012/03/06/an-efficient-journal/">very successful
experiment</a>.</li>
</ul>Demosaicing2015-05-29T23:00:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-05-29:sebastian/blog/demosaicing.html<p>This article describes the basic problem of image demosaicing and a recent work
of mine providing a research dataset for demosaicing research.</p>
<p><a href="http://en.wikipedia.org/wiki/Demosaicing">Image demosaicing</a> is a procedure
used in almost all digital cameras.
From your smartphone camera to the top-of-the-line digital SLR cameras, they
all use a demosaicing algorithm to convert the captured sensor information into
a color image.
So what is this algorithm doing and why is it needed?</p>
<h2>Why do we need Demosaicing?</h2>
<p>Modern imaging sensors are based on semiconductors which have a large number of
photo-sensitive sensor elements, called <em>sensels</em>.
When a quantum of light hits a sensel it creates an electric charge.
The amount of the charge created depends on the energy of the photon which
depends on the wavelength of the incident light.
Unfortunately, in current imaging sensors, once the electric charge is created
it is no longer possible to deduce the color of the light.
(The exception is the <a href="http://www.foveon.com/">Foveon sensor</a> which uses a
layered silicon design in which photons of higher energy levels (green and
blue) penetrate into lower silicon layers than photons of lower energy levels
(red)).</p>
<p>To produce color images current sensors therefore do not record all wavelengths
equally at each sensor element.
Instead, each element has it's own color filter.
A typical modern sensor uses three distinct filter types, each most sensitive
to a particular range of wavelengths.
The three types are abbreviated red (R), green (G), and blue (B), although in
reality they are remain sensitive to all wavelengths.
For a detailed plot of the wavelength sensitivities, <a href="http://www.dxomark.com/About/In-depth-measurements/Measurements/Color-sensitivity">this
page</a>
has a nice graph.</p>
<p>Each sensor element therefore records only one measurement: the charge related
to a certain range of wavelengths.
It does not record the full color information.
To reproduce an image suitable for human consumption we require three
measurements, such as red/green/blue values. (This is a simplification, and in
real systems the concept of a color space is used; a camera records in a
camera-specific color space which is then transformed into a perceptual color
space such as Adobe sRGB.)</p>
<p>The most popular arrangement of color filters is the so called Bayer filter
and has a layout as shown below.</p>
<p><img alt="Bayer RGGB color filter array" src="http://www.nowozin.net/sebastian/blog/images/demosaicing-bayer.png" /></p>
<p><em>Image demosaicing</em> is the process of recovering the missing colors at each
sensor element.
For example, in the top left sensel of the above figure only the blue response
is measured and we need to recover the value of the green and red responses at
this spatial location.</p>
<p>In principle, why should this even be possible?
Because images of the natural world are slowly changing across the sensor,
we can use color information from adjacent sensels (but different filter types)
to provide the missing information.</p>
<h2>Challenges for Demosaicing Algorithms</h2>
<p>The above description is correct in that all demosaicing algorithms use
correlations among spatially close sensels to restore the missing information.
However, there are around three dozen publically available demosaicing
algorithms and probably many more proprietary ones.
Beside differences in resource requirements and complexity, these algorithms
also differ widely in their demosaicing performance.</p>
<p>Without considering implementation concerns for a moment, what makes a good
demosaicing algorithm?
A good demosaicing method has the following desirable properties:</p>
<ul>
<li>Visually pleasing demosaiced output images;</li>
<li>No visible high-frequency artifacts (<em>zippering</em>), no visible color artifacts;</li>
<li>Robustness to noise present in the input;</li>
<li>Applicable to different color filter array layouts (not just <a href="http://en.wikipedia.org/wiki/Bayer_filter">Bayer</a>);</li>
</ul>
<p>To achieve this, a demosaicing algorithm has to be highly adapted to the
statistics of natural images.
That is, it has to have an understanding of typical image components such as
textures, edges, smooth surfaces, etcetera.</p>
<h2>Research Dataset</h2>
<p>One approach to image demosaicing is to treat it is a statistical regression
problem.
By learning about natural image statistics from ground truth data, one should
be able -- given sufficient data -- to approach the optimal demosaicing
performance possible.</p>
<p>The problem is, perhaps surprisingly, that there are no suitable datasets.
Current comparisons of demosaicing algorithms in the literature resort to two
approaches to provide results for their algorithms:</p>
<ol>
<li>Use a small set of Kodak images that were scanned onto Photo-CD's (remember
those?) in the mid-1990'ies from analogue films. To me it is unclear whether
this scanning involved demosaicing, and whether the properties of the analogue
films are an adequate proxy for digital imaging sensors.</li>
<li>Download sRGB images from the Internet and remove color channels to obtain a
mosaiced image. But all these images have been demosaiced already, so we
merely measure the closeness of one demosaicing algorithm to another one.</li>
</ol>
<p>This is appalling on the one hand, but it is certainly challenging to improve
on it, if only for the reason that currently no sensor can capture ground truth
easily.
There have been ideas to obtain ground truth using a Foveon sensor or by using
a global switchable color filter and multiple captures.
The first idea (using a Foveon camera) sounds feasible but the noise and
sensitivity characteristics of a Foveon sensor are quite different from popular
CFA-CMOS sensors.
The second idea sounds ideal but would only work in a static lab setup.</p>
<p>We introduce the <a href="http://research.microsoft.com/en-us/um/cambridge/projects/msrdemosaic/">Microsoft Research Demosaicing
Dataset</a>,
our attempt at providing a suitable dataset.
Our dataset is described in detail in an <a href="http://www.nowozin.net/sebastian/papers/khashabi2014demosaicing.pdf">IEEE TIP
paper</a>.
The dataset contains 500 images captured by ourselves containing both indoor
and outdoor imagery. Here are some example images.</p>
<p><img alt="MSR Demosaicing dataset example images" src="http://www.nowozin.net/sebastian/blog/images/msr-demosaicing-images.jpg" /></p>
<p>How did we overcome the problem of creating suitable ground truth images?
The basic idea is as follows: it is difficult to capture ground truth for
demosaicing for a full image sensor, but if we group multiple sensels into one
virtual sensel then we can interpret this group as possessing all necessary
color information.
That is, we simultaneously reduce the image resolution and perform demosaicing.
There are multiple proposals in the paper how to do this technically in a
sound manner, but to see it visually, here is an example of downsampling using
3-by-3 sensel blocks on a Bayer filter.</p>
<p><img alt="Bayer RGGB 3-by-3 downsampling" src="http://www.nowozin.net/sebastian/blog/images/demosaicing-oddblock.png" /></p>
<p>As you can see, within each 3-by-3 sensel block we may have an unequal number
of measurements of each color, but the spatial distribution of sensels of
different types is uniform in the sense that their center of gravity is the
center of the 3-by-3 block.
This is not the case in general, for example when averaging 4-by-4 blocks of
Bayer measurements, then the red channels will have a higher density in the
upper left corner of each block.</p>
<h2>Algorithm Comparison</h2>
<p>So how do common algorithms (and our novel algorithm) fare on our benchmark
data set?
Performance is typically measured as a function of the mean-squared-error of
the predicted image intensities. The most common measurement is the <a href="http://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio">peak
signal to noise
ratio</a> measured in
decibels (dB), where higher is better.
We also report another performance measure based on a perceptual similarity
metric, the structural similarity index (SSIM) which measures mean and
variance statistics in image blocks, and again a higher score means a better
demosaiced image.</p>
<p>The top algorithms achieve the following performance. I also include bilinear
interpolation as a baseline method.</p>
<table>
<thead>
<tr>
<th align="left">Method</th>
<th align="center">PSNR (dB)</th>
<th align="center">SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Bilinear interpolation</td>
<td align="center">30.86</td>
<td align="center">0.882</td>
</tr>
<tr>
<td align="left">Non-Local means</td>
<td align="center">38.42</td>
<td align="center">0.978</td>
</tr>
<tr>
<td align="left">Contour stencils</td>
<td align="center"><strong>39.41</strong></td>
<td align="center"><strong>0.980</strong></td>
</tr>
<tr>
<td align="left">RTF (our method)</td>
<td align="center">39.39</td>
<td align="center"><strong>0.980</strong></td>
</tr>
</tbody>
</table>
<p>Hence we achieve a result comparable to the state of the art.
The experiments become interesting when we perform simultaneous denoising and
demosaicing.
Performing both operations simultaneously is desirable in a real imaging
pipeline because both they happen at the same stage in the processing.
For the task of simultaneous denoising and demosaicing the results tell a
different story.</p>
<table>
<thead>
<tr>
<th align="left">Method</th>
<th align="center">PSNR (dB)</th>
<th align="center">SSIM</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Bilinear interpolation</td>
<td align="center">30.40</td>
<td align="center">0.859</td>
</tr>
<tr>
<td align="left">Non-Local means</td>
<td align="center">36.46</td>
<td align="center">0.949</td>
</tr>
<tr>
<td align="left">Contour stencils</td>
<td align="center">37.17</td>
<td align="center">0.953</td>
</tr>
<tr>
<td align="left">RTF (our method)</td>
<td align="center"><strong>37.78</strong></td>
<td align="center"><strong>0.961</strong></td>
</tr>
</tbody>
</table>
<p>In the paper we compare more than a dozen methods.
The proposed method achieves an improved demosaicing performance of over 0.5dB
in realistic conditions which is visually significant.
Our method is based on the non-parametric <a href="http://www.nowozin.net/sebastian/papers/jancsary2012rtf.pdf">regression tree field model
(RTF)</a>
which we have published earlier; essentially this is a Gaussian conditional
random field (CRF) with very rich potential functions defined by regression
trees. Due to its high capacity it can learn a lot about image statistics
relevant to demosaicing.</p>
<p>The next best method is the <a href="http://www.ipol.im/pub/art/2012/g-dwcs/">contour stencils method of
Getreuer</a>.
This method performs smoothing and completion of values along a graph defined
on the sensor positions. While the method works well it is manually defined
for the Bayer pattern and may not be easily generalized to arbitrary color
filter arrays.</p>
<h2>Outlook</h2>
<p>Demosaicing for the Bayer layout is largely solved, but for novel color filter
array layouts there currently is no all-around best method. While our
machine learning approach is feasible and leads to high quality demosaicing
results, the current loss functions used (such as peak signal to noise ratio
(PSNR) and structural similarity (SSIM)) are not sufficiently aligned with
human perception to accurately measure image quality, in particular for
zippering artifacts along edge structures.
Whatever demosaicing method is adopted, it is beneficial to simultaneously
perform demosaicing and denoising, because either task becomes more difficult
if performed in isolation.</p>Becoming a Bayesian, Part 32015-05-15T21:00:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-05-15:sebastian/blog/becoming-a-bayesian-part-3.html<p>This post continues the previous post, <a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-1.html">part 1</a> and
<a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-2.html">part 2</a>,
outlining my criticism towards a ''naive'' subjective Bayesian viewpoint:</p>
<ol>
<li><a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-1.html">The consequences of model misspecification</a>.</li>
<li><a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-2.html">The ''model first computation last'' approach</a>.</li>
<li>Denial of methods of classical statistics, in this post.</li>
</ol>
<h2>Denial of the Value of Classical Statistics</h2>
<p>Suppose for the sake of a simple example that our task is to estimate the
unknown mean <span class="math">\(\mu\)</span> of an unknown probability distribution <span class="math">\(P\)</span> with bounded
support over the real line.
To this end we receive a sequence of <span class="math">\(n\)</span> iid samples
<span class="math">\(X_1\)</span>, <span class="math">\(X_2\)</span>, <span class="math">\(\dots\)</span>, <span class="math">\(X_n\)</span>.</p>
<p>Now suppose that <em>after</em> receiving these <span class="math">\(n\)</span> samples I do not use the obvious
sample mean estimator but I take only the first sample <span class="math">\(X_1\)</span> and estimate
<span class="math">\(\hat{\mu} = X_1\)</span>.
Is this a good estimator?
Intuition tells us that it is not, because it ignores part of the useful input
data, namely <span class="math">\(X_i\)</span> for any <span class="math">\(i > 1\)</span>, but how can we formally analyze this?</p>
<p>From a subjective Bayesian viewpoint the <a href="http://projecteuclid.org/download/pdf_1/euclid.lnms/1215466214">likelihood
principle</a>
does not permit us to ignore evidence which is already available.
If we posit a model <span class="math">\(P(X_i|\theta)\)</span> and a prior <span class="math">\(P(\theta)\)</span> we have to work
with the posterior
<div class="math">$$P(\theta | X_1,\dots,X_n) \propto
\prod_{i=1,\dots,n} P(X_i|\theta) P(\theta).$$</div>
Therefore our estimator <span class="math">\(\hat{\mu}=X_1\)</span> cannot correspond to a Bayesian
posterior mean of any non-trivial model for all parameters.
This is of course a very strict viewpoint and one may object that we
<em>can</em> talk about properties of the sequence of posteriors <span class="math">\(P(\theta | X_1)\)</span>,
<span class="math">\(P(\theta | X_1, X_2)\)</span>, etc.
But even in this generous view, <em>after</em> observing all samples we are not
permitted to ignore part of them. (If you are still not convinced, consider
the estimator defined by <span class="math">\(\hat{\mu} = X_2\)</span> if <span class="math">\(X_1 > 0\)</span>, and <span class="math">\(\hat{\mu} = X_3\)</span>
otherwise.)
So Bayesian statistics does not offer us a method to analyze our proposed
estimator.</p>
<p>A classical statistician <em>can</em> analyze pretty much arbitrary procedures,
including ones of the silly type <span class="math">\(\hat{\mu}\)</span> that we proposed.
The analysis may be technically difficult or apply only in the asymptotic
regime but does not rule out any estimator apriori.
Typical results may take the form of a derivation of the variance or bias of
the estimator.
In our case we have an
<a href="http://en.wikipedia.org/wiki/Bias_of_an_estimator"><em>unbiased</em></a>
estimate of the mean, <span class="math">\(\mathbb{E}[\hat{\mu}]-\mu = 0\)</span>.
As for the variance, because we only take the first sample, even as <span class="math">\(n \to
\infty\)</span> the variance <span class="math">\(\mathbb{V}[\hat{\mu}]\)</span> remains constant, so the
estimator is
<a href="http://en.wikipedia.org/wiki/Consistent_estimator"><em>inconsistent</em></a>, a clear
indication that our <span class="math">\(\hat{\mu}\)</span> is a bad estimator.</p>
<p>Another typical result is in the form of a <a href="http://en.wikipedia.org/wiki/Confidence_interval">confidence
interval</a> of a parameter of
interest.
One can argue that confidence intervals are <a href="http://en.wikipedia.org/wiki/Credible_interval#Confidence_interval">not exactly answering the
question of
interest</a>
(that is, whether the parameter really is in the given interval), but if they
are of interest, one can sometimes <a href="http://onlinelibrary.wiley.com/doi/10.1111/rssb.12080/abstract">obtain them also from a Bayesian
analysis</a>.</p>
<p>There exist cases where existing statistical procedures can be reinterpreted
from a Bayesian viewpoint. This is achieved by proposing a model and prior
such that inferences under this model and prior exactly or approximately match
the answers of the existing procedure or at least have satisfying frequentist
properties.
Two cases of this are the following:</p>
<ul>
<li><a href="http://utstat.toronto.edu/pub/reid/research/vaneeden.pdf">Matching priors</a>,
where in some cases it is possible to establish an exact equivalence for
simple parametric models without latent variables.
One recent example for even a non-parametric model is the <a href="http://arxiv.org/abs/1401.0303">Good-Turing estimator
for the missing mass</a>, where an asymptotic
equivalence between the classic <a href="http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation">Good-Turing
estimator</a>
and a Bayesian non-parametric model is established.</li>
<li><a href="http://projecteuclid.org/euclid.aos/1236693154">Reference priors</a>, a
generalization of the Jeffrey prior, in which the prior is constructed to be
least informative. Here least informative is in the sense that when you
sample from the prior and consider the resulting posterior using the sample,
the divergence to the original prior should be large in expectation; that is,
samples from the prior should be able to change your beliefs to the maximum
possible extend. When it is possible to derive reference priors, these
typically have excellent frequentist robustness properties, and are useful
default prior choices.
Unfortunately, in models with multiple parameters there is no unique reference
prior, and generally the set of <a href="http://www.stats.org.uk/priors/noninformative/YangBerger1998.pdf">known reference
priors</a>
seems to be quite small.
This problematic case-by-case state is nicely summarized in this recent work
on <a href="http://projecteuclid.org/euclid.ba/1422556416">overall objective priors</a>.</li>
</ul>
<p>Should we care at all about these classic notions of qualities of an
estimators?
I have seen <em>Bayesians</em> dismiss properties such as <em>unbiasedness</em>
and <em>consistency</em> as unimportant, but I cannot understand this stance.
For example, an unbiased estimator operating on iid sampled data immediately
implies a scalable parallel estimator applicable to the big data setting,
simply by separately estimating the quantity of interest, then taking the
average of estimates. This is a practical and useful consequence of the
unbiasedness property. Similarly, <em>consistency</em> is at least a guarantee that
when more data is available the qualities of your inferences are improving,
and this should be of interest to anyone whose goal it is to build systems
which can learn. (There do exist some results on Bayesian posterior
consistency, for a summary see Chapter 20 of
<a href="http://www.stat.purdue.edu/~dasgupta/">DasGupta's</a>
<a href="http://www.springer.com/mathematics/probability/book/978-0-387-75970-8">book</a>.)</p>
<p>Let me summarize.
Bayesian estimators are often superior to alternatives.
But the set of procedures yielding Bayesian estimates is strictly smaller than
the set of all statistical procedures.
We need methods to analyze the larger set, in particular to characterize the
subset of useful estimators, where <em>useful</em> is application dependent.</p>
<p><em>Acknowledgements</em>. I thank <a href="http://www.jancsary.net/">Jeremy Jancsary</a>,
<a href="http://files.is.tue.mpg.de/pgehler/">Peter Gehler</a>,
<a href="http://pub.ist.ac.at/~chl/">Christoph Lampert</a>, and
<a href="http://www.ong-home.my/">Cheng Soon-Ong</a> for feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Becoming a Bayesian, Part 22015-05-02T18:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-05-02:sebastian/blog/becoming-a-bayesian-part-2.html<p>This post continues the previous post, <a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-1.html">part 1</a>,
outlining my criticism towards a ''naive'' subjective Bayesian viewpoint:</p>
<ol>
<li><a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-1.html">The consequences of model misspecification</a>.</li>
<li>The ''model first computation last'' approach, in this post.</li>
<li><a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-3.html">Denial of methods of classical statistics</a>.</li>
</ol>
<h2>The ''Model First Computation Last'' approach</h2>
<p>Without a model (not necessarily probabilistic) we cannot learn anything.
This is true for science, but it is also true for any machine learning system.
The model may be very general and make only a few general assumptions (e.g.
''the physical laws remain constant over time and space''),
or it may be highly specific (e.g. ''<span class="math">\(X \sim \mathcal{N}(\mu,1)\)</span>''),
but we need a model in order to relate observations to quantities of interest.</p>
<p>But in contrast to science, when we build machine learning systems we are also
engineers. We build models not in isolation or on a piece of whiteboard, but
instead we build them to run on our current technology.</p>
<p>Many <em>Bayesians</em> adhere to a strict separation of model and inference
procedure; that is, the model is independent of any inference procedure.
They argue convincingly that the goal of inference is to approximate the
posterior under the assumed model, and that for each model there exist a large
variety of possible approximate inference methods that can be applied, such as
Markov chain Monte Carlo (MCMC), importance sampling, mean field, belief
propagation, etc.
By selecting a suitable inference procedure, different accuracy and runtime
trade-offs can be realized.
In this viewpoint, the <em>model comes first and computation comes last</em>, once the
model is in place.</p>
<p>In practice this beautiful story does not play out very often.
What is more common is that instead of spending time building and refining a
model, time is spent on tuning the parameters of inference procedures, such
as:</p>
<ul>
<li><em>MCMC</em>: Markov kernel, diagnostics, burn-in, possible extensions (annealing,
parallel tempering ladder, HMC parameters, etc.);</li>
<li>Importance sampling: selecting the proposal distribution, effective sample
size, possible extensions (e.g. multiple importance sampling);</li>
<li>Mean field and belief propagation: message initialization, schedule, damping
factor, convergence criterion.</li>
</ul>
<p>In fact, it seems to me, that many works describing novel models ultimately
also describe inference procedures that are required to make their models
work.
I say this not to diminish the tremendeous progress we as a community have
made in probabilistic inference; it is just an observation that the separation
of model and inference is not plug-and-play in practice.
(Other pragmatic reasons for deviating from the subjective Bayesian viewpoint
are provided in a <a href="http://projecteuclid.org/euclid.ba/1340371036">paper by
Goldstein</a>.)</p>
<p>Suppose we have a probabilistic model and we are provided an
approximate inference procedure for it.
Let us draw a big box around these two components and call this the
<em>effective model</em>, that is, the system that takes observations and produces
some probabilistic output.
How similar is this effective model to the model on our whiteboard?
I know of only very few results, for example <a href="http://papers.nips.cc/paper/4649-the-bethe-partition-function-of-log-supermodular-graphical-models">Ruozzi's analysis of the Bethe
approximation</a>.</p>
<p>Another practial example along these lines was given to me by Andrew Wilson is
to compare an analytically tractable model such as a Gaussian process against
a richer but intractable model such as a Gaussian process with Student-T
noise. The latter model is certainly more capable formally but requires
approximate inference.
In this case, the approximate inference implicitly changes the model and it is
not clear at all whether it is it worth to give up analytic tractability.</p>
<h3>Resource-Constrained Reasoning</h3>
<p>It seems that when compared to machine learning, the field of artificial
intelligence is somewhat ahead; in 1987 <a href="http://research.microsoft.com/en-us/um/people/horvitz/">Eric
Horvitz</a> had a nice
paper at UAI on <a href="http://arxiv.org/abs/1304.2759">reasoning and decision making under limited
resources</a>. When read liberally the problem
of adhering to the orthodox (normative) view he described in 1987 seems to
mirror the current issues faced by large scale probabilistic models used in
machine learning, namely that exact analysis in any but the simplest models is
intractable and resource constraints are not made explicit in the model or the
inference procedures.</p>
<p>But some recent work is giving me new hopes that we will treat computation as
a first class citizen when building our models, here is some of that work from
the computer vision and natural language processing community:</p>
<ul>
<li><a href="http://stat.fsu.edu/~abarbu/">Adrian Barbu's</a> <a href="http://stat.fsu.edu/~abarbu/papers/denoise_tip.pdf">active random
fields</a> from 2009, where
he explicitly considers the effects of using suboptimal inference procedure in
graphical models.</li>
<li>Stoyanov, Ropson, and Eisner's work on <a href="http://www.cs.jhu.edu/~jason/papers/stoyanov+al.aistats11.pdf">predicting with approximate
inference
procedures</a> at
AISTATS 2011; although this is an empirical risk minimization approach.</li>
<li><a href="http://users.cecs.anu.edu.au/~jdomke/">Justin Domke's</a> work on <a href="http://users.cecs.anu.edu.au/~jdomke/papers/2013pami.pdf">unrolling
approximate inference
procedures</a> and
training the resulting models end-to-end using backpropagation.</li>
</ul>
<p>Cheng Soon Ong pointed me to <a href="https://www.aaai.org/Papers/Workshops/2002/WS-02-15/WS02-15-002.pdf">work on anytime probabilistic
inference</a>,
which I am not familiar with, but the goal of having inference algorithms
which adapt to the available resources is certainly desirable. The <a href="http://en.wikipedia.org/wiki/Anytime_algorithm">anytime
setting</a> is practically
relevant in many applications, particular in the real-time systems.</p>
<p>All these works share the characteristic that they take a probabilistic model
and approximate inference procedure and construct a new "effective
model" by entangling the model and inference.
By doing so the resulting model is tractable by construction and retains to a
large extent the specification of the original intractable model.
However, the separation between model and inference procedure is lost.</p>
<p>This is the first step towards a <em>computation first approach</em>, and I believe
we will see more machine learning works which recognize available
computational primitives and resources as equally important to the model
specification itself.</p>
<p><em>Acknowledgements</em>. I thank <a href="http://www.jancsary.net/">Jeremy Jancsary</a>,
<a href="http://files.is.tue.mpg.de/pgehler/">Peter Gehler</a>,
<a href="http://pub.ist.ac.at/~chl/">Christoph Lampert</a>,
<a href="http://www.cs.cmu.edu/~andrewgw/">Andrew Wilson</a>, and
<a href="http://www.ong-home.my/">Cheng Soon-Ong</a> for feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Becoming a Bayesian, Part 12015-04-19T17:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-04-19:sebastian/blog/becoming-a-bayesian-part-1.html<p>I have used probabilistic models for a number of years now and over this time
I have used different paradigms to build my models, to estimate them from
data, and to perform inference and predictions.</p>
<p>Overall I have slowly become a Bayesian; however, it has been a rough journey.
When I say that "I became a Bayesian" I mean that my default view on problems
now is to think about a probabilistic model that relates observables to
quantities of interest and of suitable prior distributions for any unknowns
that are present in this model.
When it comes to solving the practical problem using a
computer program however, I am ready to depart from the model on my whiteboard
whenever the advantages to do so are large enough, for example in simplicity,
runtime speed, tractability, etc.
Some recent work to that end:</p>
<ul>
<li>Our work on <a href="http://arxiv.org/abs/1402.0859">informed sampling for generative computer vision
models</a> with
<a href="http://ps.is.tue.mpg.de/person/jampani">Varun</a>,
<a href="http://ps.is.tue.mpg.de/person/loper">Matthew</a> and
<a href="http://files.is.tue.mpg.de/pgehler/">Peter</a>, where we argue for a generative
and Bayesian approach to computer vision problems;</li>
<li>Our <a href="http://arxiv.org/abs/1402.3580">Bayesian NMR work</a> (and
<a href="http://www.nowozin.net/sebastian/papers/wu2014porousmedia.pdf">here</a>) with
<a href="http://www.cs.cmu.edu/~andrewgw/">Andrew Wilson</a> and collaborators from the
Cambridge university chemical department we have taken a full Bayesian
viewpoint, with great success over conventional NMR Fourier analysis;</li>
<li>Our work on using <a href="http://www.nowozin.net/sebastian/papers/bratieres2014scalablegpstruct.pdf">GPs for structured
prediction</a>
with <a href="http://mlg.eng.cam.ac.uk/sebastien/">Sebastien</a>,
<a href="http://www.sussex.ac.uk/Users/nq28/">Novi</a>, and
<a href="http://mlg.eng.cam.ac.uk/zoubin/">Zoubin</a>, which was motivated by the
struggle to scale up a conceptually satisfying model.</li>
<li>My work on <a href="http://www.nowozin.net/sebastian/papers/nowozin2014intersectionoverunion.pdf">maximum expected utility in some structured prediction
models</a>
at CVPR 2014, which was motivated by applying basic decision theory, but ended
up trying to cope with resulting intractabilities.</li>
</ul>
<p>However, I have remained skeptical of the naive and unconditional adoption of
the <em>subjective Bayesian viewpoint</em>.
In particular, I object to the viewpoint that every model and every system
ought to be Bayesian, or to the view that at the very least, if a statistical
system is useful that it should have an approximate Bayesian interpretation.
In this post and the following two posts I will try to explain my skepticism.</p>
<p>There is a risk of barking up the wrong tree by attacking a caricature of a
Bayesian here, which is not my intention. In fact, to be frank, every one of
the researchers I have interacted with in the past few years holds a nuanced
view of their principles and methods and more often than not is aware of their
principles' limitations and willing to adjust if circumstances require it.</p>
<p>Let me summarize the subjective Bayesian viewpoint.
In my experience this view of the world is arguably the most prevalent among
Bayesians in the machine learning community, for example at NIPS and at
machine learning summer schools.</p>
<h3>The Subjective Bayesian Viewpoint</h3>
<p>The subjective Bayesian viewpoint on any system under study is as
follows:</p>
<ul>
<li>Specify a probabilistic model relating what is known to what is unknown;</li>
<li>Specify a proper prior probability distribution over unknowns based on any
information that is available to you;</li>
<li>Obtain the posterior distribution over unknowns given the known data
(using <a href="http://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes
rule</a>);</li>
<li>Draw conclusions based on the posterior distribution; for example, solve a
decision problem or select a model among the alternative models.</li>
</ul>
<p>This approach is used exclusively for any statistical problem that may
arise.
This approach is strongly advocated, for example by
<a href="http://www.jstor.org/discover/10.2307/2681060">Lindley</a> and
in a <a href="http://projecteuclid.org/euclid.ba/1340371036">paper by Michael
Goldstein</a>.</p>
<p>Alternative Bayesian views deviate from this recipe. For example, they may
allow for <em>improper</em> prior distributions or instead aim to select
uninformative prior distributions, or even select the prior as a function of
the inferential question at hand.</p>
<h1>Criticism</h1>
<p>My main criticism towards a ''naive'' subjective Bayesian viewpoint are
related to the following three points:</p>
<ol>
<li>The consequences of model misspecification.</li>
<li><a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-2.html">The ''model first computation last'' approach</a>.</li>
<li><a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-3.html">Denial of methods of classical statistics</a>.</li>
</ol>
<h2>The Consequences of Model Misspecification</h2>
<p>To model some system in the world we often use probabilistic models of the
form
<div class="math">$$p(x;\theta),\qquad \theta \in \Theta,$$</div>
where <span class="math">\(x \in \mathcal{X}\)</span> is a random variable of interest and <span class="math">\(\Theta\)</span> is the
set of possible parameters <span class="math">\(\theta\)</span>. We are interested in <span class="math">\(p(x)\)</span> and thus
would like to find a suitable parameter given some observed data <span class="math">\(x_1, x_2,
\dots, x_n \in \mathcal{X}\)</span>. Because we can never be entirely certain about
our parameters we may represent our current beliefs through a posterior
distribution <span class="math">\(p(\theta|x_1,\dots,x_n)\)</span>.</p>
<p><em>Misspecification</em> is the case when no parameter in <span class="math">\(\Theta\)</span> leads to a
distribution <span class="math">\(p(x;\theta)\)</span> that behaves like the true distribution.
This is not exceptional, infact most models of real world systems are
misspecified.
It also is not a property of any inferential approach but rather a fundamental
limitation of building expressive models given our limited knowledge. If we
could observe all relevant quantities and know their deterministic relationships we
would not need a probabilistic model.
Hence the need for a probabilistic model arises because we cannot observe
everything and we do not know all the dependencies that exist in the real
world. (Alas, as Andrew Wilson pointed out to me, the previous two sentences
expose my deterministic world view.)
So what can be said about this common case of misspecified models?</p>
<p>Let us talk about calibration of probabilities, and what happens in case your
model is wrong.
Informally, you are <em>well-calibrated</em> if you neither overestimate nor
underestimate the probability of certain events.
Crucially, this does not imply a degree of certainty, only that your
uncertain statements (forecasted probabilities of events) are on average
correct.</p>
<p>For any probabilistic model, being well-calibrated is a desirable goal.
There are <a href="http://en.wikipedia.org/wiki/Scoring_rule">various methods</a> to
assess calibration and to check the forecasts of your model.
In 1982 <a href="http://www.statslab.cam.ac.uk/~apd/">Dawid</a>, in a
<a href="http://www.jstor.org/stable/2287720">seminal paper</a>,
established a general theorem whose consequence (in Section 4.3 of that paper)
is to guarantee that a Bayesian using a parametric model will eventually be
well-calibrated.</p>
<p>This is reassuring, except there is one catch:
it does not apply in the case when the model is misspecified.
Unfortunately, in most practical applications of probabilistic modelling,
misspecification is the rule rather than the exception (''All models are
wrong'').
We could hope for a ''graceful degradation'', in that we are still at least
approximately calibrated. But this is not the case.</p>
<h3>Calibration and Misspecification</h3>
<p>In the misspecified case, there are <a href="http://delong.typepad.com/sdj/2013/01/cosma-shalizi-vs-the-fen-dwelling-bayesians.html">simple
examples</a>
due to <a href="http://delong.typepad.com/sdj/">Brad Delong</a> and <a href="http://www.stat.cmu.edu/~cshalizi/">Cosma
Shalizi</a>
where beliefs in a parametric model do not converge and
become less-calibrated over time.
In their example two contradicting things happen at the same time:
the beliefs become very confident, yet a single new observation revises the
belief to the other extreme, again confident.</p>
<h3>Improving the model?</h3>
<p>One can object that in these examples, and more generally, one should revise
the model to more accurately reflect the system under study.
But then, in order not to end up in an infinite loop of trying to improve
a model, how to determine when to stop?
Actually, how to even determine the accuracy of the model?
<a href="http://en.wikipedia.org/wiki/Marginal_likelihood">Model evidence</a> cannot be
used to this end, as it is conditioned on the set of possible models being
used. (In fact, in Delong's example the evidence would assure us that
everything is fine.)
The answers to how model's can be criticised and improved are not simple, and
quite likely not Bayesian.</p>
<p>Andrew Gelman and Cosma Shalizi discuss this issue and others in a <a href="http://www.stat.columbia.edu/~gelman/research/published/philosophy_chapter.pdf">position
paper</a>,
and I find myself agreeing with their assessment that there is no answer to
wrong model assumptions within the (strictly) subjective Bayesian viewpoint:</p>
<blockquote>
<p>"We fear that a philosophy of Bayesian statistics as subjective, inductive
inference can encourage a complacency about picking or averaging over existing
models rather than trying to falsify and go further.
Likelihood and Bayesian inference are powerful, and with great power comes
great responsibility. Complex models can and should be checked and
falsified."</p>
</blockquote>
<h3>Non-parametric Models to the Rescue?</h3>
<p>Another objection is that this is all well-known and hence we should use
non-parametric models which endow us with prior support over essentially all
reasonable alternatives.</p>
<p>Unfortunately, while the resulting models are richer and are practically
useful in real applications, we now may have other problems: even when there
is prior support for the true model simple properties like <em>consistency</em>
(which were guaranteed to hold in the parametric case) <a href="http://projecteuclid.org/euclid.aos/1176349830">can no longer be taken
for granted</a>. The current
literature and basic results on this topic are nicely summarized in
Section 20.12 of <a href="http://www.stat.purdue.edu/~dasgupta/">DasGupta's</a>
<a href="http://www.springer.com/mathematics/probability/book/978-0-387-75970-8">book</a>.</p>
<h3>Conclusion</h3>
<p>Misspecification is not a Bayesian problem, and applies equally to other
estimation approaches, for example in the case of maximum likelihood
estimation see the <a href="https://books.google.com/books?isbn=0521574463">book by
White</a>.
However, a subjective Bayesian has no Bayesian means to test for the presence
of misspecification and that makes it hard to deal with the consequences.</p>
<p>There are some ideas for applying Bayesian inference in a
misspecification-aware manner, for example the <a href="http://homepages.cwi.nl/~pdg/ftp/alt12longer.pdf"><em>Safe
Bayesian</em></a> approach, and an
interesting analysis of approximate Bayesian inference using the Bootstrap in
a relatively unknown <a href="http://projecteuclid.org/euclid.bj/1126126768">paper of
Fushiki</a>.</p>
<p>Are these alternatives practical and do they somehow overcome the
misspecification problem? To be frank, I am not aware of any satisfactory
solution and common practice seems to be a careful model criticism using tools
such as predictive model checking and graphical inspection. But these require
first acknowledging the problem.</p>
<p>When the model is wrong ideally it would be reassuring to have,</p>
<ul>
<li>a reliable diagnostic and quantification on how wrong it is (say, an
estimate <span class="math">\(D(q\|p^*)\)</span> where <span class="math">\(q\)</span> is the true distribution), and</li>
<li>a test for whether the type of model error present will matter for making
certain predictions (say, an error bound on the deviation of certain
expectations, <span class="math">\(\mathbb{E}_q[f(x)] - \mathbb{E}_{p^*}[f(x)]\)</span> for a given
function <span class="math">\(f\)</span>).</li>
</ul>
<p>To me it appears the (pure) subjective Bayesian paradigm cannot provide the
above.</p>
<h3>Addendum</h3>
<p>Andrew Wilson pointed out to me that in most cases of statistical problems we
cannot know the <em>true distribution</em>, even in principle. I agree, and indeed
if we pursue such elusive ideal then this may divert our attention away from
the practical issue of building a model good enough for the task at hand.
I entirely agree with taking such pragmatic stance and this follows Francis
Bacon's ideal of assessing the worth of a model (scientific theory in his
case) not by an abstract ideal of truthfulness, but instead by its utility.</p>
<p>In machine learning and most industrial applications building the model is
<em>easy</em> because we merely focus on predictive performance which can be
reliably assessed using holdout data.
For scientific discovery however, things are more subtle in that our goal is
in establishing the truth of certain statements with sufficient confidence;
but this truth is only a conditional truth, conditioned on assumptions we have
to make.</p>
<p>A Bayesian makes all assumptions explicit and then proceeds by formally
treating them as truth, correctly inferring the consequences.
A classical/frequentist approach also makes assumptions by positing a model,
but then may be able to make statements that hold uniformly over all
possibilities encoded in the model.
Therefore, in my mind the Bayesian is an optimist, believing entirely in their
assumptions, whereas the classical approach is more pessimistic, believing
in their model but then providing worst-case results over all possibilities.
Misspecification affects both approaches.</p>
<p>If you want to continue reading, <a href="http://www.nowozin.net/sebastian/blog/becoming-a-bayesian-part-2.html">the second part of this
post</a> is now available.</p>
<p><em>Acknowledgements</em>. I thank <a href="http://www.jancsary.net/">Jeremy Jancsary</a>,
<a href="http://files.is.tue.mpg.de/pgehler/">Peter Gehler</a>,
<a href="http://pub.ist.ac.at/~chl/">Christoph Lampert</a>, and
<a href="http://www.cs.cmu.edu/~andrewgw/">Andrew Wilson</a> for feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Extended Formulations2015-04-05T16:30:00+02:00Sebastian Nowozintag:www.nowozin.net,2015-04-05:sebastian/blog/extended-formulations.html<p>An amazing fact in high dimensions is this:
<em>Projecting a simple convex set described by a small number of inequalities
can create complicated convex set with an exponential number of inequalities.</em></p>
<p>It is amazing because it contradicts our everyday human experience.
We are most familiar with projections of objects in three dimensions down to
two dimensions, namely when objects cast shadows, like this:</p>
<p><img alt="Object with shadow, by Cloud Nines
Designs" src="http://www.nowozin.net/sebastian/blog/images/cloudninesdesigns-object.jpg" />
(Image courtesy to <a href="http://cloudninesdesigns.deviantart.com/art/Abstract-Vector-Object-117257819">Cloud Nines
Designs</a>.)</p>
<p>In three dimensions any polyhedral object, when projected onto a plane,
becomes <em>simpler</em>, i.e. the number of facets stays the same or becomes
smaller. Think of a three dimensional cube that casts a shadow. The cube has
six facets but <a href="http://www.etudes.ru/en/etudes/teni/">its shadow has four or
six</a>, depending on the position of the
light and plane.
<em>[Edit and correction, July 2015: Thanks to reader Paul (comment below), I
have been made aware that it is not true that the number of facets cannot
increase when projecting form three dimensions onto the plane.
A great example is provided by <a href="http://www.pokutta.com/Homepage/Bio.html">Sebastian
Pokutta</a>, where a convex 3D
polytope with six facets projects onto the 2D plane as an octagon with eight
facets. Thanks Paul!]</em></p>
<p>Now, how I can I convince you that a convex set can become more complex when
projected? Here is an impressive example.</p>
<h2>Ben-Tal/Nemirovski Polyhedron</h2>
<p>The following example is from
<a href="http://pubsonline.informs.org/doi/abs/10.1287/moor.26.2.193.10561">(Ben-Tal, Nemirovski,
2001)</a>,
(<a href="http://www2.isye.gatech.edu/~nemirovs/ApprLor_fin.pdf">PDF</a>).
In this paper the authors are motivated by approximating certain second order
cones using extended polyhedral formulations, in order to be able to perform
<a href="http://en.wikipedia.org/wiki/Robust_optimization">robust optimization</a> using
linear programming.
As a special case of their results I select the problem of approximating a
unit disk in the 2D plane. (The following is a specialization of equation (8)
in the paper.)</p>
<p>First, let us fix some notation. Let
<span class="math">\(x=(x_1,x_2,\dots,x_n,\alpha_1,\dots,\alpha_m) \in \mathbb{R}^{n+m}\)</span> be a
vector, where <span class="math">\(x_1\)</span> to <span class="math">\(x_n\)</span> represent the <em>basic dimensions</em> and <span class="math">\(\alpha_1\)</span> to
<span class="math">\(\alpha_m\)</span> represent the <em>extended dimensions</em>. For any set <span class="math">\(\mathcal{E}
\subseteq \mathbb{R}^{n+m}\)</span> we define the <em>projection</em> as
<div class="math">$$\textrm{proj}_x(\mathcal{E}) = \{ (x_1,\dots,x_n) \:|\:
\exists (\alpha_1,\dots,\alpha_m):
(x_1,\dots,x_n,\alpha_1,\dots,\alpha_m) \in \mathcal{E} \}.$$</div>
This corresponds to the familiar notion of a projection.</p>
<p>For the 2D unit disk the following is an extended polyhedral formulation,
parametrized by an integer accuracy parameter <span class="math">\(k \geq 2\)</span>. The formulation
has the basic dimensions <span class="math">\(x_1\)</span> and <span class="math">\(x_2\)</span>, and the extended dimensions
<span class="math">\(\mathbf{\alpha}=(\xi_j,\eta_j)_{j=0,\dots,k}\)</span>.
Defining the constants <span class="math">\(c_j = \cos(\pi / 2^{j})\)</span>, <span class="math">\(s_j = \sin(\pi / 2^j)\)</span>,
and <span class="math">\(t_j = \tan(\pi / 2^j)\)</span> the polyhedral set <span class="math">\(\mathcal{E}_k\)</span> is given by the
following intersection of linear inequality and equality constraints.
<div class="math">\begin{eqnarray}
\xi_0 - x_1 & \geq & 0,\nonumber\\
\xi_0 + x_1 & \geq & 0,\nonumber\\
\eta_0 - x_2 & \geq & 0,\nonumber\\
\eta_0 + x_2 & \geq & 0,\nonumber\\
\xi_j - c_{j+1} \xi_{j-1} - s_{j+1} \eta_{j-1} & = & 0,
\qquad\textrm{for $j=1,\dots,k$,}\nonumber\\
\eta_j + s_{j+1} \xi_{j-1} - c_{j+1} \eta_{j-1} & \geq & 0,
\qquad\textrm{for $j=1,\dots,k$,}\nonumber\\
\eta_j - s_{j+1} \xi_{j-1} + c_{j+1} \eta_{j-1} & \geq & 0,
\qquad\textrm{for $j=1,\dots,k$,}\nonumber\\
\xi_k & \leq & 1,\nonumber\\
\eta_k - t_{k+1} \xi_k & \leq & 0.\nonumber
\end{eqnarray}</div>
</p>
<p>Note that the set <span class="math">\(\mathcal{E}_k\)</span> can be described by <span class="math">\(6+3k\)</span> sparse linear
constraints. The intersection of these convex constraint sets is of course
again a convex set. Thus, the description of the set takes <span class="math">\(O(k)\)</span> space,
where <span class="math">\(k\)</span> is the approximation parameter.</p>
<p>If we write <span class="math">\(\mathcal{D}_k := \textrm{proj}_{x_1,x_2} \mathcal{E}_k\)</span> for the
projection onto the first two dimensions, the following figure illustrates
just how remarkably accurate the formulation is as we increase <span class="math">\(k\)</span>.</p>
<p><img alt="Ben-Tal/Nemirovski polyhedral approximation to the unit
disk" src="http://www.nowozin.net/sebastian/blog/images/bental-nemirovski.svg" /></p>
<p>How accurate is it? Ben-Tal and Nemirovski say that a set <span class="math">\(\mathcal{D}\)</span> is an
<span class="math">\(\epsilon\)</span>-approximation to a set <span class="math">\(\mathcal{L}\)</span> if <span class="math">\(\mathcal{L} \subseteq
\mathcal{D}\)</span> and if for all <span class="math">\(x \in \mathcal{D}\)</span> it holds that
<span class="math">\((\frac{1}{1+\epsilon} x) \in \mathcal{L}\)</span>.
They then show that the above formulation is an <span class="math">\(\epsilon_k\)</span>-approximation,
where
<div class="math">$$\epsilon_k = \frac{1}{\cos(\pi / 2^{k+1})} - 1 = O(\frac{1}{4^k}).$$</div>
That is, despite having a compact description in <span class="math">\(O(k)\)</span> space the accuracy
improves exponentially.
In the basic dimensions the set <span class="math">\(\mathcal{D}_k\)</span> has exponentially many facets
and cannot be described compactly through a polynomial sized collection of
linear inequalities.
(The paper further generalizes the above results to the family of
<span class="math">\(d\)</span>-dimensional Lorentz cones.)</p>
<h2>Is there a Recurring Pattern?</h2>
<p>The abstract idea behind obtaining complicated structures in one space by
means of something <em>like an extended formulation</em> can be found in other
domains; for example, in probabilistic <a href="http://en.wikipedia.org/wiki/Graphical_model">graphical
models</a>.</p>
<p>Suppose we would like to specify a potentially complicated probability
distribution <span class="math">\(P(X)\)</span>.
Akin to an extended formulation we may proceed as follows. We define an
extended set of random variables <span class="math">\(\alpha\)</span> and a distribution <span class="math">\(P(\alpha)\)</span>.
We then couple both spaces by means of a conditional specification,
<span class="math">\(P(X|\alpha) P(\alpha)\)</span>.
We then <em>project</em>, that is, marginalize out, the extended dimensions <span class="math">\(\alpha\)</span>
to obtain
<div class="math">$$P(X) = \int P(X|\alpha) P(\alpha) \,\textrm{d}\alpha.$$</div>
In practice this construction is often used in the form of a
<a href="http://en.wikipedia.org/wiki/Bayesian_network">hierarchical graphical model</a>,
for example when using a <a href="http://sumsar.net/blog/2013/12/t-as-a-mixture-of-normals/">Normal mixture to define a student T
distribution</a>.</p>
<p>The increase in flexibility of the resulting marginal distribution can be as
impressive as for the above polyhedral sets: for example, if <span class="math">\(P(X|\alpha)\)</span> is
a Normal distribution and <span class="math">\(P(\alpha)\)</span> is a distribution over Normal
parameters, then the infinite Normal mixture can essentially represent any
absolutely continuous distribution.</p>
<p>Another observation, which may be just a coincidence, but maybe there is more
to it: the extended formulation construction in both cases suggests a
practical implementation. In the polyhedral set this was through linear
programming in the extended space, for the graphical model it would be
ancestral sampling or MCMC inference.</p>
<p>This leaves me with the following questions:</p>
<ul>
<li>Are there more examples of similar constructions (extension, coupling,
projection)?</li>
<li>What is the shared mathematical structure behind this similarity (e.g.
permitting a projection operation that leads to complexity in the basic
dimensions that no longer admits a compact description in this space)?</li>
</ul>
<p>Feedback very much welcome :-)</p>
<h2>Conclusion</h2>
<p>I first learned of extended formulations from <a href="http://www.springer.com/business+%26+management/production/book/978-0-387-29959-4">this book of Pochet and
Wolsey</a>,
who pioneered the technique for practical scheduling optimization problems.
(Yes, I had enough time for tinkering during my PhD to take such creative diversions.)
A recent summary of extended formulations for <em>combinatorial optimization</em>
problems is <a href="http://integer.tepper.cmu.edu/webpub/ExtendedFormulation.pdf">Conforti, Cornuejols, Zambelli,
2012</a>.</p>
<p>Many so called <em>higher-order interactions</em> in computer vision random field
models are representable as extended formulations, a point I elaborated on in
a <a href="http://users.cecs.anu.edu.au/~julianm/cvpr2011/slides/nowozin.pdf">talk</a> I
gave at the <a href="http://cseweb.ucsd.edu/~jmcauley/cvpr2011.html">Inference in Graphical Models with Structured Potentials
workshop</a> at the CVPR 2011
conference. Another relevant work is <a href="http://link.springer.com/article/10.1007/s10107-003-0397-3">Miller and Wolsey,
2003</a>.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>How to report uncertainty2015-03-19T22:30:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-03-19:sebastian/blog/how-to-report-uncertainty.html<p><a href="http://en.wikipedia.org/wiki/Error_bar">Error bars</a> and the
<a href="http://en.wikipedia.org/wiki/Plus-minus_sign"><span class="math">\(\pm\)</span>-notation</a>
are used to quantitatively convey uncertainty in experimental results.
For example, you would often read statements like <span class="math">\(140.7 \textrm{Hz} \pm 2.8
\textrm{ Hz SEM}\)</span> in a paper to report both an experimental average and its
uncertainty.</p>
<p>Unfortunately, in many fields (such as computer vision, and, to a lesser
extent, machine learning) researchers often do not report uncertainty or if
they do, <a href="http://archpedi.jamanetwork.com/article.aspx?articleid=510667">they may do it
wrong</a>.</p>
<p>Of course, dear reader, I am sure <em>you</em> always do report it properly, so the
following remarks may only serve as a reminder to your common practice.</p>
<p>First, when reporting a quantitative measurement of uncertainty, it is
important to establish the goal of doing so. The two popular goals are as
follows.</p>
<h2>1. Convey Variability</h2>
<p>Here the focus is on the variability itself.
For example, take a look at this table of <a href="http://www.nature.com/ijo/journal/v27/n7/fig_tab/0802297t1.html">food intake of
US teenagers</a>.
The variability among the participants of the study is reported through the
<a href="http://en.wikipedia.org/wiki/Standard_deviation">standard deviation</a>, the
square root of the variance.</p>
<p>The reason why the standard deviation (SD) is prefered over the variance is
that the SD is on the same scale as the original values. That is, if the
original measurements were in <span class="math">\(\textrm{Hz}\)</span> the standard deviation is also in
the unit of <span class="math">\(\textrm{Hz}\)</span>, whereas the variance is the square.</p>
<p>One easy question you can ask yourself when thinking about the results you
would like to report in an experiment is this:
Do you expect the error bars to shrink with more available data?
If your goal is to convey variability they would <em>not</em> shrink but remain of a
certain size, no matter how many samples are available.</p>
<p>The correct wording to report this type of uncertainty is something similar to</p>
<blockquote>
<p>"We report the mean and one unit standard deviation."</p>
</blockquote>
<h2>2. Convey Uncertainty about an Unknown Parameter</h2>
<p>Here the focus is on your remaining uncertainty about a fixed quantity which
does not vary.
For example, take a look at <a href="http://loristalter.com/wp-content/uploads/CDC_1960_2002.pdf">Table 1
in Ogden et al.,
2004</a> where the
average weight of US children is reported.
Together with the mean weight in pounds the authors report the <a href="http://en.wikipedia.org/wiki/Standard_error">standard error
of the mean</a>. (Sometimes this is
just called standard error.)</p>
<p>Here the uncertainty represents a measurement of uncertainty about the average
weight. It is related to the standard deviation <span class="math">\(\sigma\)</span> by means of</p>
<p>
<div class="math">$$\textrm{SEM} = \frac{\sigma}{\sqrt{n}},$$</div>
</p>
<p>where <span class="math">\(n\)</span> is the sample size of the experiment. For example, in Table 1 of
the above paper the authors report that between 1963 and 1965 for boys of age
6 years living in the USA the average weight was <span class="math">\(\hat{\mu}=48.4\)</span> pounds with
<span class="math">\(\textrm{SEM}=0.3\)</span> standard error of the mean and a sample size of <span class="math">\(n=575\)</span>.
Using the above formula this immediately gives</p>
<p>
<div class="math">$$\sigma \approx \sqrt{n} \textrm{SEM} = \sqrt{575} \cdot 0.3 \approx 7.19.$$</div>
</p>
<p>What is the use of the standard error?
Because of the <a href="http://en.wikipedia.org/wiki/Central_limit_theorem">central limit
theorem</a> for independent
samples the standard error provides approximate confidence intervals for the
unknown true mean of the population, as</p>
<p>
<div class="math">$$[\hat{\mu} - 1.96 \textrm{SEM}, \hat{\mu} + 1.96 \textrm{SEM}].$$</div>
</p>
<p>Using the above numbers we then know that with 95% confidence over the
sampling variation the true average weight <span class="math">\(\mu \in [47.8,49.0]\)</span>.
(Note that for a <em>single experiment</em> this does not mean we
<a href="http://en.wikipedia.org/wiki/Coverage_probability">cover</a> the true
value with a certain probability, because either we cover it or we do not
cover it. The 95% probability is the probability associated to a
(hypothetical) repetition of the experiment.)</p>
<p>The correct wording to report this type of uncertainty is</p>
<blockquote>
<p>"We report the average of <span class="math">\(n=123\)</span> samples and the standard error of the
mean."</p>
</blockquote>
<h2>How many digits to report?</h2>
<p>When writing out numbers a natural question that arises is how many
significant digits to report.
<a href="http://arxiv.org/abs/1301.1034">Richard Clymo has some advice</a> on how many
digits to report.</p>
<blockquote>
<p>Most bioscientists need to report mean values, yet many have little idea of
how many digits are significant, and at what point further digits are mere
random junk. Thus a recent report that the mean of 17 values was 3.863 with
a standard error of the mean (SEM) of 2.162 revealed only that none of the
seven authors understood the limitations of their work.
The simple rule derived here by experiment for restricting a mean value to
its significant digits (sig-digs) is this: the last sig-dig in the mean
value is at the same decimal decade as the first sig-dig (the first
non-zero) in the SEM.
...
For the example above the reported values should be a mean of 4 with SEM
2.2. Routine application of these simple rules will often show that a result
is not as compelling as one had hoped.</p>
</blockquote>
<p>Let's compare with the numbers from before: the average height was reported as
48.4 and the SEM as 0.3.
The last significant digit in the mean is the four after the decimal point,
and this is the same decimal decade as the first significant digit of the SEM.
So the study did it right.</p>
<p>Clymo develops the following simple-to-follow rules for reporting the sample
average and SEM:</p>
<ol>
<li>
<p><em>Rule 1</em> (for determining the significant digits in the reported mean):
the <em>last</em> significant digit in the mean is in the same decade as the <em>first</em>
non-zero digit in the SEM.</p>
</li>
<li>
<p><em>Rule 2</em> (for determining significant digits in the reported SEM):
depending on the sample size <span class="math">\(n\)</span>, as per the following table:</p>
</li>
</ol>
<table>
<thead>
<tr>
<th align="left">Sample size <span class="math">\(n\)</span></th>
<th align="center">Significant digits to report</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"><span class="math">\(2 \leq n \leq 6\)</span></td>
<td align="center">1</td>
</tr>
<tr>
<td align="left"><span class="math">\(7 \leq n \leq 100\)</span></td>
<td align="center">2</td>
</tr>
<tr>
<td align="left"><span class="math">\(101 \leq n \leq 10,000\)</span></td>
<td align="center">3</td>
</tr>
<tr>
<td align="left"><span class="math">\(10,001 \leq n \leq 10^6\)</span></td>
<td align="center">4</td>
</tr>
<tr>
<td align="left"><span class="math">\(n > 10^6\)</span></td>
<td align="center">5</td>
</tr>
</tbody>
</table>
<h2>Quiz</h2>
<p>Ok, that is enough information. Let's practice.</p>
<h3>Question 1</h3>
<p>You sample the height of male students in a German school class (grade 6)
in centimeters: 148, 148, 137, 152, 140, 149, 152, 152, 159, 155. Report your
estimate of the population height (here the population is all German male
students in grade 6).</p>
<p><em>Answer</em>: <span class="math">\(149\textrm{cm} \pm 2.1\textrm{cm}\)</span> SEM.
Explanation: we are interested in the population mean and hence would like to
convey the remaining uncertainty of our estimate.
The sample mean is <span class="math">\(\hat{\mu} \approx 149.24467\textrm{cm}\)</span>, the standard
deviation is <span class="math">\(6.579429\textrm{cm}\)</span>, and the sample size is <span class="math">\(n=10\)</span>. This gives
a <span class="math">\(\textrm{SEM} = 6.579429/\sqrt{10} \approx 2.080598\)</span>. Applying the above
rules: Rule 1 tells us that the first significant digit is in the <span class="math">\(10^0\)</span>
decade, so we report <span class="math">\(149\textrm{cm}\)</span> as mean. Rule 2 tells us that for a
sample size of <span class="math">\(n=10\)</span> we should report two digits in the SEM, which needs to be
properly rounded to <span class="math">\(2.1\textrm{cm}\)</span>.</p>
<h3>Question 2</h3>
<p>You run a company and regularly send bills to customers for payment.
You measure the time in days between sending the bill and receiving the
payment: 10, 7, 10, 7, 12, 10, 8, 4, 15, 3, 9, 4. Report the average and
variability.</p>
<p><em>Answer</em>: <span class="math">\(8 \pm 3.5\)</span> SD.
Explanation: we are interested in the average time and the variability, so a
standard deviation is appropriate. Rule 1 from Clymo still applies and we
truncate the sample mean of <span class="math">\(8.25\)</span> after the first digit. Rule 2 does not
apply (this is the standard deviation, not the SEM), but because we have
truncated the mean it makes no sense to be more accurate than the mean except
for one additional digit.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Estimating Discrete Entropy, Part 32015-03-07T16:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-03-07:sebastian/blog/estimating-discrete-entropy-part-3.html<p>In the last two parts (<a href="http://www.nowozin.net/sebastian/blog/estimating-discrete-entropy-part-1.html">part one</a>,
<a href="http://www.nowozin.net/sebastian/blog/estimating-discrete-entropy-part-2.html">part two</a>) we looked at the problem of
entropy estimation and several popular estimators.</p>
<p>In this final article we will take a look at two Bayesian approaches to the
problem.</p>
<h2>Bayesian Estimator due to Wolpert and Wolf</h2>
<p>The first Bayesian approach to entropy estimation was proposed by David
Wolpert and David Wolf in 1995 in their paper "Estimating functions of
probability distributions from a finite set of samples", published in Physical
Review E, Vol. 52, No. 6, 1995,
<a href="http://journals.aps.org/pre/abstract/10.1103/PhysRevE.52.6841">publisher link</a>, and
a longer <a href="http://www.santafe.edu/media/workingpapers/93-07-046.pdf">tech report from
1993</a>.</p>
<p>The idea is simple and elegant Bayesian reasoning: specify a model relating
the known observations to the unknown quantity, then compute the posterior
distribution over the entropy given the observations.</p>
<p>The model is the following <a href="http://en.wikipedia.org/wiki/Dirichlet-multinomial_distribution">Dirichlet-Multinomial
model</a>,
assuming a given non-negative vector <span class="math">\(\mathbb{\alpha} \in \mathbb{R}^K_+\)</span>,</p>
<ol>
<li><span class="math">\(\mathbb{p} \sim \textrm{Dirichlet}(\mathbb{\alpha})\)</span>,</li>
<li><span class="math">\(x_i \sim \textrm{Categorical}(\mathbb{p})\)</span>, <span class="math">\(i=1,\dots,n\)</span>, iid.</li>
</ol>
<p>If we define, for each bin <span class="math">\(k \in \{1,2,\dots,K\}\)</span> the count</p>
<p>
<div class="math">$$n_k = \sum_{i=1}^n 1_{\{x_i = k\}},$$</div>
</p>
<p>so that <span class="math">\((n_1,n_2,\dots,n_K)\)</span> is a histogram over <span class="math">\(K\)</span> outcomes, which is
distributed according to a multinomial distribution.
Then, due to <a href="http://en.wikipedia.org/wiki/Conjugate_prior"><em>conjugacy</em></a>, the
posterior over the unknown distribution <span class="math">\(\mathbb{p}\)</span> is again a Dirichlet
distribution and given as</p>
<p>
<div class="math">$$P(\mathbb{p} | x_1,\dots,x_n) = \textrm{Dirichlet}(\alpha_1 + n_1, \dots,
\alpha_K + n_K).$$</div>
</p>
<p>We can now attempt to compute the squared-error optimal point estimate of the
entropy under this posterior. One of the main contributions of Wolpert and
Wolf is to provide a family of results that enable moment computations of the
Shannon entropy under the Dirichlet distribution.</p>
<p>In particular, with <span class="math">\(n = \sum_{k=1}^K n_k\)</span> and <span class="math">\(\alpha = \sum_{k=1}^K
\alpha_k\)</span>, they provide the posterior mean of the entropy as</p>
<p>
<div class="math">$$\hat{H}_{\textrm{Bayes}} = \mathbb{E}[H(\mathbb{p}) | n_1,\dots,n_K] =
\psi(n + \alpha + 1)
- \sum_{k=1}^K \frac{n_k+\alpha_k}{n+\alpha} \psi(n_k + \alpha_k + 1),$$</div>
</p>
<p>where <span class="math">\(\psi\)</span> is the <a href="http://en.wikipedia.org/wiki/Digamma_function">digamma
function</a>.
This expression is efficient to compute, and similarly the second moment and
hence the variance of <span class="math">\(H(p)\)</span> under the posterior can be computed efficiently.</p>
<p>The only open question is how to select the prior vector of <span class="math">\(\mathbb{\alpha}\)</span>.
In absence of further information about the distribution we can assume
symmetry.
Then there are four common options,</p>
<ol>
<li><span class="math">\(\alpha_k = 1\)</span>, due to Bayes in 1763 and Laplace in 1812.</li>
<li><span class="math">\(\alpha_k = 1/K\)</span>, due to
<a href="http://www.actuaries.org.uk/system/files/documents/pdf/0285-0334.pdf">Perks in 1947</a>.</li>
<li><span class="math">\(\alpha_k = 1/2\)</span>, due to Jeffreys in 1946 and 1961.</li>
<li><span class="math">\(\alpha_k = 0\)</span>, due to Haldane in 1948. This yields an <a href="http://en.wikipedia.org/wiki/Prior_probability#Improper_priors">improper
prior</a>.</li>
</ol>
<p>It may not be clear which choice is the best, but I found an interesting
discussion in a paper by <a href="http://www.aaai.org/ocs/index.php/IJCAI/IJCAI11/paper/viewFile/3292/3802">de Campos and
Benavoli</a>.
Further down in this article we will be better equiped to assess the above
choices.</p>
<p>Independent of the choice of the prior parameter Wolpert and Wolf are very
optimistic about their model and highlight the advantages that come from the
Bayesian approach:</p>
<blockquote>
<p>"One of the strength of Bayesian analysis is its power for dealing with such
small-data cases. In particular, not only are Bayesian estimators in many
respects more 'reasonable' than non-Bayesian estimators for small data, they
also naturally provide error bars to govern one's use of their results.
...
In addition, the Bayesian formalism automatically tells you when it is unsure
of its estimate, through its error bars."</p>
</blockquote>
<p>Also, on the empirical performance they comment,</p>
<blockquote>
<p>"... for all N the Bayes estimator has a smaller mean-squared error than the
frequency-counts estimator."</p>
</blockquote>
<p>And indeed, also asymptotically the prior has support for every possible
distribution, so consistency of the estimated entropy is guaranteed as
<span class="math">\(n\to\infty\)</span>.</p>
<p>All good then?</p>
<p>Here is the comparison of the squared error and bias of various Bayes
estimators with different choices of prior <span class="math">\(\alpha\)</span>. The plot shows, like in
the previous article, the performance when evaluated on data generated from a
<em>different</em> Dirichlet prior. Each value on the x-axis is a different
generating distribution, but the prior of the estimator remains fixed.</p>
<p><img alt="RMSE and bias experiments for different alpha
hyperparameters" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-exp.svg" /></p>
<p>While all of the Bayes estimators perform better than the plugin estimator,
overall they all fare quite badly:
there is a low error and bias only at the matching <span class="math">\(\alpha\)</span> value, but they
deteriorate quickly at different values of <span class="math">\(\alpha\)</span>.</p>
<p>How can this be the case?</p>
<h2>Nemenman-Shafee-Bialek</h2>
<p>In 2002 <a href="http://arxiv.org/abs/physics/0108025">Nemenman, Shafee, and Bialek</a>
recognized that the innocent looking Dirichlet-Multinomial model implies a
very concentrated prior belief over the entropy of the distribution:</p>
<blockquote>
<p>"Thus a seemingly innocent choice of the prior ... leads to a disaster:
fixing <span class="math">\(\alpha\)</span> specifies the entropy almost uniquely. Furthermore, the
situation persists even after we observe some data: until the distribution
is well sampled, our estimate of the entropy is dominated by the prior!"</p>
</blockquote>
<h3>The Implied Beliefs over the Entropy</h3>
<p>The following experiment visualizes this: each of the following histograms
shows the <em>implied prior</em> over <span class="math">\(H(\mathbb{p})\)</span>.
To create each histogram, I fixed <span class="math">\(K\)</span> and <span class="math">\(\alpha\)</span> and take 1,000,000 samples
of distributions <span class="math">\(\mathbb{p}\)</span>, then record its entropy.
In each histogram plot the x-axis covers exactly the full range over
possible entropies.</p>
<p><img alt="Induced prior on the entropy, K=2" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-prior-K2.svg" /></p>
<p>For the case <span class="math">\(K=2\)</span> everything looks fine: the implied prior spreads well over
the entire range of possible entropies.
But look what happens for <span class="math">\(K=10\)</span> and <span class="math">\(K=100\)</span>:</p>
<p><img alt="Induced prior on the entropy, K=10" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-prior-K10.svg" />
<img alt="Induced prior on the entropy, K=100" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-prior-K100.svg" /></p>
<p>Here, the implied prior clearly concentrates sharply. (The least possible
concentration of the entropy can be achieved using Perks choice of <span class="math">\(\alpha =
1/K\)</span>.)
In fact, there is no choice of <span class="math">\(\alpha\)</span> for which the prior belief over the
very quantity to be estimated does not concentrate as <span class="math">\(K \to \infty\)</span>.
If we have no reason to believe that the entropy really is in the range where
the prior dictates it should be, then this is a bad prior.</p>
<p>How did Nemenman, Shafee, and Bialek solve this problem?</p>
<h2>NSB estimator</h2>
<p>They construct a mixture-of-Dirichlet prior by defining a hyperprior on
<span class="math">\(\alpha\)</span> itself. The hyperprior <span class="math">\(P(\alpha)\)</span> is chosen such that</p>
<p>
<div class="math">$$P(\alpha) \propto \frac{\textrm{d} \mathbb{E}[H|\alpha]}{\textrm{d} \alpha}.$$</div>
</p>
<p>Let us take a look at how this can be derived.
Nemenman and collaborators first show that under the Dirichlet-Multinomial
model the expected entropy is a strictly monotonic continuous function in
<span class="math">\(\alpha\)</span>, and therefore it is invertible.
Let us define the shorthand <span class="math">\(g^{-1}(\alpha) := \mathbb{E}[H|\alpha]\)</span> as the function
that takes <span class="math">\(\alpha\)</span> to the expected entropy.
Now, by the <a href="http://en.wikipedia.org/wiki/Random_variable#Functions_of_random_variables">transformation formula for random
vairables</a>,
we have the induced density</p>
<p>
<div class="math">$$P_{\alpha}(\alpha) = P_H(g^{-1}(\alpha))
\cdot \left|\frac{\textrm{d} g^{-1}(\alpha)}{\textrm{d} \alpha}\right|.$$</div>
</p>
<p>If we assume that <span class="math">\(P(H|\alpha)\)</span> is highly concentrated (at least for large <span class="math">\(K\)</span>
in the above plots, this holds), then
<span class="math">\(P_H(g^{-1}(\alpha)) \approx P(H|\alpha)\)</span>, and we want this density to be
constant. Hence, we have</p>
<p>
<div class="math">$$P_{\alpha}(\alpha) \propto
\left|\frac{\textrm{d} g^{-1}(\alpha)}{\textrm{d} \alpha}\right|.$$</div>
</p>
<p>Because the right hand side is positive, with <span class="math">\(g^{-1}(\alpha) =
\mathbb{E}[H|\alpha]\)</span> this yields exactly the original expression above.
This expression has an analytic solution which, properly normalized is</p>
<p>
<div class="math">$$P(\alpha) = \frac{1}{\log K} \left(K \psi_1(K \alpha + 1)
- \psi_1(\alpha+1)\right),$$</div>
</p>
<p>where <span class="math">\(\psi_1\)</span> is the <a href="http://en.wikipedia.org/wiki/Trigamma_function">trigamma
function</a>.</p>
<p><img alt="NSB Dirichlet-mixture prior for entropy estimation, K=2,10, and
100" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-nsb.svg" /></p>
<p>Let us look at the <em>implied prior</em> of the entropy when using the NSB prior.
They are much more uniform now:</p>
<p><img alt="Implied entropy beliefs when using the NSB Dirichlet-mixture prior, K=2,10, and
100" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-nsb-entropy-hist.svg" /></p>
<p>This uniformity results in the NSB estimator having excellent robustness
properties and small bias. It is probably the best general purpose discrete
entropy estimator available.
One drawback however is the increased computational cost: in order to compute
the estimator we need to solve a 1D integral numerically over <span class="math">\(\alpha\)</span>.
Each pointwise evaluation of the integrand function corresponds to computing
<span class="math">\(\hat{H}_{\textrm{Bayes}}\)</span> for a fixed value of <span class="math">\(\alpha\)</span>.
High accuracy requires several hundred such evaluations, and this may be
prohibitively expensive in some applications (for example, decision tree
induction). </p>
<h3>Addendum: Undersampled Regime</h3>
<p>After a comment from Ilya Nemenman on the previous version of this article, I
also did an experiment in the undersampled regime (<span class="math">\(N < K\)</span>), where we observe
fewer outcomes than there are bins. I am glad I did perform this experiment!</p>
<p>I select <span class="math">\(N=100\)</span> and <span class="math">\(K=2000\)</span>, with <span class="math">\(500\)</span> replicates and compare the same
methods as in the second part of the article. The results are as follows.</p>
<p><img alt="RMSE and bias experiments for undersampled
regime" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-undersampled-exp.svg" /></p>
<p>Almost all estimators perform very poorly in this setting, with the naive
Miller correction even being off the chart.
Only the NSB and the Hausser-Strimmer estimator can be considered usable in
this severely undersampled regime, with clear preference towards the NSB
estimator.</p>
<p><a href="http://www.physics.emory.edu/home/people/faculty/nemenman-ilya.html">Ilya
Nemenman</a>,
the inventor of the NSB estimator, was kind enough to share his feedback on
these experiments with me and to allow me to post them here:</p>
<blockquote>
<p>I am glad to hear that NSB estimator did well on this test. It's also not
surprising that HS estimator did rather well too -- in some sense, it's a
frequentist version of NSB. Both NSB and HS perform shrinking towards the
uniform distribution (infinite pseudocounts or "alpha" in your notation),
and then they lift the shrinkage as <span class="math">\(N\)</span> grows. However, HS shrinks much
stronger than NSB does. As a result, HS performs very well for large entropy
(large alpha) distributions, and worse for lower entropies. It's probably
possible to set up a frequentist shrinkage estimator that would shrink
towards entropy being half of the maximum value, or shrink towards the
maximum value, but less strongly than HS — I think that such an estimator
would do better over the whole range of alpha. In practice, the strong
shrinkage imposed by HS becomes problematic when the alphabet size is very
large, say <span class="math">\(2^{150}\)</span>, which is what one gets when one takes a 30ms long
spike train and discretizes it at 0.2 ms resolution (yes spike = 1, no spike
=0). We had numbers like this in our 2008 PLoS Comp Bio paper. With entropy
of <span class="math">\(\approx 15\)</span> bits, alphabet size of <span class="math">\(2^{150}\)</span>, and 100-1000 samples, NSB
may work (more on this below), and HS will shrink towards 150 bits, and will
likely overestimate. One way to see this problem is to realize that, in your
comparison plots, once you use <span class="math">\(\alpha > 1\)</span>, the entropy
is nearly the maximum possible entropy. This is why HS works well there, but
fails for <span class="math">\(\alpha \ll 1\)</span>, where the entropy is substantially smaller than the
maximum. If you were to replot the data putting the true entropy of the
analyzed probability distribution (rather than alpha) on the horizontal
axis, this will be visible, I think.</p>
</blockquote>
<p>He continues,</p>
<blockquote>
<p>A key point for both NSB and HS is that both may work in the regime of
<span class="math">\(N \sim \sqrt{K}\)</span> (better yet, <span class="math">\(\sim \sqrt{2^{H/2}}\)</span>). On the contrary,
most other estimators you analyzed work well only up to <span class="math">\(N \sim 2^H\)</span> (unless
I am missing something important). This is because NSB and HS require not
good sampling of the underlying distribution, but coincidences in the data
only. They estimate entropy, effectively, by inverting the usual birthday
paradox, and using the frequency of coincidences to measure the diversity of
data. One can illustrate this by pushing <span class="math">\(K\)</span> to even larger values in your
last plot, 10000 or even more, if you limit yourself to smaller alpha.</p>
</blockquote>
<p>These comments are very insightful and show that my earlier discussion and
results were, in a way, limited to the simple case where we have a reasonable
number of samples per bin. The case Ilya considers in his work is the
severely undersampled regime.</p>
<p>One difficulty in producing the plot he suggests that plots the entropy of the
distribution along the x-axis is that it would require an additional binning
operation along that axis, so I have not produced this plot yet.</p>
<h3>Reference Prior, Anyone?</h3>
<p>I wonder whether the NSB prior is a simplification of a full <a href="http://projecteuclid.org/euclid.aos/1236693154">reference
prior</a>
treatment. This is not exactly the standard setting of reference priors
because we are interested in a function (the entropy) of our random variables,
so there is an additional indirection. But I believe it could work as
follows: find in the space of all priors on <span class="math">\(\alpha\)</span> the prior that maximizes
the KL divergence between implied entropy prior and entropy posterior.</p>
<p>Using the numerical method suggested in the paper above, I obtained a
numerical reference prior (with one additional ABC approximation for the
likelihood) for <span class="math">\(K=2\)</span> and this closely matches the NSB prior.</p>
<p><img alt="Numerically obtained reference prior for K=2" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-3-refprior.svg" /></p>
<p>(Interestingly, I recently discovered this work
on <a href="http://projecteuclid.org/euclid.ba/1422556416">overall objective priors</a>
in which their <em>hierarchical reference prior</em> approach for the
Dirichlet-Multinomial model yields an analytic proper prior which is very
similar to the NSB and numerical reference priors.)</p>
<h3>Further Reading</h3>
<p>As you hopefully have noticed, the problem of discrete entropy estimation is
quite rich and still actively being worked on.
Current work focuses on the case where the distribution is countably infinite.
For example, the probability distribution of English words in popular usage is
an example: there are infinitely many possible words, but a total lexical
order implies a countable discrete distribution.</p>
<p>A great up-to-date overview of discrete entropy estimation, including a
summary of the current work on the countable infinite case is given in this
<a href="https://memming.wordpress.com/2014/02/09/a-guide-to-discrete-entropy-estimators/">article</a>
by <a href="http://www.memming.com/">Memming Park</a>.</p>
<p>To a general introduction to the difficulties of the entropy estimation
problem, this
<a href="http://stat.columbia.edu/~liam/research/pubs/info_est-nc.pdf">2003 paper</a> by
<a href="http://www.stat.columbia.edu/~liam/">Liam Paninski</a> is still the best entry
point.
Another <a href="http://arxiv.org/abs/cond-mat/0403192">very nice overview on entropy
estimation</a> is due to Thomas Schürmann
in 2004.
To me the best introduction to the family of Bayesian estimators is <a href="http://arxiv.org/abs/1302.0328">(Archer,
Park, Pillow, 2013)</a>.</p>
<p>If you wonder why I care about entropy estimation, then, <a href="http://www.nowozin.net/sebastian/papers/nowozin2012infogain.pdf">my ICML 2012
paper</a> was
the application that originally led me to consider the problem.</p>
<p><em>Acknowledgements</em>. I thank <a href="http://www.memming.com/">Il Memming Park</a>,
<a href="http://ei.is.tuebingen.mpg.de/person/jpeters">Jonas Peters</a>, and
<a href="http://www.physics.emory.edu/home/people/faculty/nemenman-ilya.html">Ilya Nemenman</a>
for reading a draft version of the article and providing very helpful feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Machine Learning in Cambridge 20152015-02-26T20:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-02-26:sebastian/blog/machine-learning-in-cambridge-2015.html<p>This year we (<a href="http://mlg.eng.cam.ac.uk/zoubin/">Zoubin</a>, together with
<a href="http://lopezpaz.org/">David</a> and myself) are again organizing a workshop
event for the local Cambridge (UK) machine learning community.
The schedule is available at the workshop homepage,
<a href="http://research.microsoft.com/en-us/um/cambridge/events/CamML2015/">Machine Learning in Cambridge
2015</a>, and
we also plan to make all talks available as video recordings after the event.</p>
<p>See you at the event!</p>Estimating Discrete Entropy, Part 22015-02-21T19:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-02-21:sebastian/blog/estimating-discrete-entropy-part-2.html<p>In the <a href="http://www.nowozin.net/sebastian/blog/estimating-discrete-entropy-part-1.html">last part</a> we have looked at the
basic problem of discrete entropy estimation.
In this article we will see a number of proposals of improved estimators.</p>
<h3>Miller Correction</h3>
<p>In 1955 George Miller proposed a simple correction to the naive plugin
estimator <span class="math">\(\hat{H}_N\)</span> by adding the constant offset in the bias expression as
follows.</p>
<p>
<div class="math">$$\hat{H}_M = \hat{H}_N + \frac{K-1}{2n}.$$</div>
</p>
<p>This is an improvement over the plugin estimator but the added offset does not
depend on the distribution but only on the sample size. We can do better.</p>
<p>(A variant of the Miller estimator for the infinite alphabet case is the so
called Miller-Madow estimator in which the quantity <span class="math">\(K\)</span> is estimated from the
data as well.)</p>
<h3>Jackknife Estimator</h3>
<p>A classic method for bias correction is the
<a href="http://en.wikipedia.org/wiki/Jackknife_resampling">jackknife</a> <a href="http://en.wikipedia.org/wiki/Resampling_(statistics)">resampling
method</a> due to
<a href="http://www.jstor.org/discover/10.2307/2332914">(Quenouille, 1947)</a>, although
the somewhat catchy name is due to <a href="http://en.wikipedia.org/wiki/John_Tukey">John
Tukey</a>.
(The literature on the jackknife methodology is quite classic now. A very
readable modern summary of the jackknife methodology can be found in
<a href="http://www.stat.purdue.edu/~dasgupta/">DasGupta's</a>
<a href="http://www.springer.com/mathematics/probability/book/978-0-387-75970-8">book</a>.
An older but still readable introduction is <a href="http://biomet.oxfordjournals.org/content/61/1/1.short">(Miller,
1974)</a>,
<a href="http://www.csee.wvu.edu/~xinl/library/papers/math/statistics/jackknife.pdf">PDF</a>.)</p>
<p>In a nutshell, jackknife resampling methods are used to estimate bias and
variance of estimators. They are typically simple to implement,
and often computationally cheaper than the bootstrap.
For the bias reduction application, they often manage to reduce bias
considerably, often knocking the bias down to <span class="math">\(O(n^{-2})\)</span>.</p>
<p>The use of jackknife bias estimation to improve entropy estimation was
suggested by <a href="http://www.jstor.org/discover/10.2307/1936227">(Zahl, 1977)</a>.
The jackknife bias-corrected estimator of the plugin estimator is given as
follows.</p>
<p>
<div class="math">\begin{equation}
\hat{H}_{J} = n \hat{H}_N - (n-1) \hat{H}^{(\cdot)}_N.
\label{H:jackknife}
\end{equation}</div>
</p>
<p>The quantity <span class="math">\(\hat{H}^{(\cdot)}_N\)</span> is the average of <span class="math">\(n\)</span> estimates
obtained by leaving out a single observation. Thus, writing
<span class="math">\(\mathbb{h}=(h_1,\dots,h_K)\)</span> for the histogram of bin counts on the full
sample, and <span class="math">\(\mathbb{h}_{\setminus i}\)</span> for the histogram with the count in
<span class="math">\(X_i=k\)</span> reduced by one, we have</p>
<p>
<div class="math">$$\hat{H}^{\setminus i}_N := \hat{H}_N(\mathbb{h}_{\setminus i}),$$</div>
</p>
<p>and the mean of these quantities,</p>
<p>
<div class="math">$$\hat{H}^{(\cdot)}_N :=
\frac{1}{n} \sum_{i=1}^n \hat{H}^{\setminus i}_N.$$</div>
</p>
<p>Interestingly, normally it would be expensive to compute <span class="math">\(n\)</span> leave-one-out
estimates. Here however, two tricks are possible: First, because the
histogram is a sufficient statistic, we need to compute only <span class="math">\(K\)</span> holdout
estimates instead of <span class="math">\(n\)</span>. Second, one can interleave computation in such a
way that computing each holdout estimate is <span class="math">\(O(1)\)</span>, reducing the overall
computation of <span class="math">\((\ref{H:jackknife})\)</span> to <span class="math">\(O(K)\)</span> runtime and no additional
memory over the plugin estimate. In essence, the computational complexity is
comparable to that of the inexpensive plugin estimate, making the jackknife
estimator computationally cheap.</p>
<h3>Grassberger Estimator</h3>
<p>Another proposal for an improved estimator is due to
<a href="http://www.ucalgary.ca/complexity/people/faculty/peter">Peter Grassberger</a>.
In <a href="http://arxiv.org/abs/physics/0307138">(Grassberger, 2003)</a>
he derives two estimators based on an argument using analytic continuation
which I have to admit is somewhat beyond my grasp.
The better of the two estimators is the following:</p>
<p>
<div class="math">$$\hat{H}_G = \log n - \frac{1}{n} \sum_{k=1}^K h_k G(h_k),$$</div>
</p>
<p>where the logarithm of the original naive estimator <span class="math">\(\hat{H}_N\)</span> have been
replaced by a scalar function <span class="math">\(G\)</span>, defined as</p>
<p>
<div class="math">$$G(h) = \psi(h) + \frac{1}{2} (-1)^{h}
\left(\psi(\frac{h+1}{2}) - \psi(\frac{h}{2})\right).$$</div>
</p>
<p>The function <span class="math">\(\psi\)</span> is the <a href="http://en.wikipedia.org/wiki/Digamma_function">digamma
function</a>.
(The function <span class="math">\(G(h)\)</span> is the solution of <span class="math">\(G(h)=\psi(h)+(-1)^h \int^1_0
\frac{x^{h-1}}{x+1} \textrm{d}x\)</span> given as equation <span class="math">\((30)\)</span> in <a href="http://arxiv.org/pdf/physics/0307138v2.pdf">the
paper</a>.)
Computationally this estimator is almost as efficient as the original plugin
estimator, because for integer arguments the digamma function can be
accurately approximated by an <a href="http://en.wikipedia.org/wiki/Digamma_function#Computation_and_approximation">efficient series
expansion</a>.</p>
<p>When compared to the plugin estimator (in <a href="http://www.nowozin.net/sebastian/blog/estimating-discrete-entropy-part-1.html">histogram count
form</a>), we can see an upwards
correction of this estimator but also an interesting difference between even
and odd histogram counts.</p>
<p><img alt="xlogx compared to xGx" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-2-gb.svg" /></p>
<p>Unfortunately, the original derivation in Grassberger's paper is quite
involved and beyond my full understanding.
However, for practical purposes, among the computationally efficient
estimators, the 2003 Grassberger estimator is probably the most useful and
robust estimator.</p>
<h2>Experiment</h2>
<p>The following plots show a simple evaluation of some popular discrete entropy
estimators. We assume a categorical distribution with <span class="math">\(K=64\)</span> outcomes with
the probability vector <span class="math">\(\mathbb{p}\)</span> sampled from a symmetric Dirichlet
distribution with hyperparameter <span class="math">\(\alpha \in [0.25,5.0]\)</span>.
We obtain <span class="math">\(n=100\)</span> samples from the distribution and estimate the entropy based
on this sample. For each <span class="math">\(\alpha\)</span> we repeat this procedure 5,000 times to
estimate the root mean squared error (RMSE) and bias.</p>
<p>We plot all the estimators discussed above, but also plot four additional
estimators:</p>
<ul>
<li><a href="http://arxiv.org/abs/0811.3579">Hausser-Strimmer estimator</a>, based on
a shrinkage estimate,</li>
<li><a href="http://www.martinvinck.com/page2/assets/PRE_entropy00_subm_rev_00.pdf">Polynomial estimator</a>
due to Vinck et al.; this is equivalent to <a href="http://math2.uncc.edu/~zzhang/papers/ZhangNUCO2012.pdf">Zhiyi Zhang's
estimator</a>,
but numerically simpler and more stable to evaluate. (I only mention this
here because I have not found this mentioned elsewhere.)</li>
<li>Two Bayesian estimators (Bayes and NSB), to be discussed in the next post.</li>
</ul>
<p><img alt="RMSE and bias experiments for seven different discrete entropy
estimators" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-2-exp.svg" /></p>
<p>These experiments are not fully representative because the Dirichlet prior
makes assumptions which may not be satisfied in your application; in
particular, this experiment considers the well-sampled case (<span class="math">\(N > K\)</span>), and
simple bias correction methods work well in this regime (this was pointed out
to me by Ilya and Jonas); however, some clear trends are visible:</p>
<ul>
<li>The Plugin estimator fares badly on both RMSE and bias;</li>
<li>The Miller estimator fares less badly, suggesting RMSE is affected mainly by
bias, however, significant errors remain for small values of <span class="math">\(\alpha\)</span>;</li>
<li>The Bayes estimator fares almost as bad as the plugin estimator, except for
<span class="math">\(\alpha=1/2\)</span>. More on this point in the next post;</li>
<li>The Jackknife, Grassberger 2003, and NSB estimators provide excellent
performance throughout the whole range of <span class="math">\(\alpha\)</span> values.</li>
<li>The performance of the Polynomial and Hausser estimates are mediocre.</li>
</ul>
<p>In the next part we will be looking at Bayesian estimators.</p>
<p><em>Acknowledgements</em>. I thank <a href="http://www.memming.com/">Il Memming Park</a>,
<a href="http://ei.is.tuebingen.mpg.de/person/jpeters">Jonas Peters</a>, and
<a href="http://www.physics.emory.edu/home/people/faculty/nemenman-ilya.html">Ilya Nemenman</a>
for reading a draft version of the article and providing very helpful feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Estimating Discrete Entropy, Part 12015-02-07T14:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-02-07:sebastian/blog/estimating-discrete-entropy-part-1.html<p>Estimation of the
<a href="http://en.wikipedia.org/wiki/Entropy_%28information_theory%29">entropy</a> of a
random variable is an important problem that has many applications.
If you can estimate entropy accurately, you can also estimate <a href="http://en.wikipedia.org/wiki/Mutual_information">mutual
information</a>, which allows
you to find dependent random variables in large data sets.
There are <a href="http://en.wikipedia.org/wiki/Mutual_information#Applications_of_mutual_information">numerous
applications</a>.</p>
<p>The setting of discrete entropy estimation with a finite number of outcomes is
as follows.
There is an unknown categorical distribution over <span class="math">\(K \geq 2\)</span> different
outcomes, defined by means of a probability vector
<span class="math">\(\mathbb{p} = (p_1,p_2,\dots,p_K)\)</span>, such that <span class="math">\(p_k \geq 0\)</span> and
<span class="math">\(\sum_k p_k = 1\)</span>.
We are interested in the quantity</p>
<p>
<div class="math">\begin{equation}
H(\mathbb{p}) = -\sum_{k=1}^K p_k \log p_k,\label{eqn:Hdiscrete}
\end{equation}</div>
</p>
<p>where <span class="math">\(0 \log 0 = 0\)</span> by convention.</p>
<p>Because the probability vector is unknown to us we cannot directly use
<span class="math">\((\ref{eqn:Hdiscrete})\)</span>.
Instead we assume that we observe <span class="math">\(n\)</span> samples <span class="math">\(X_i\)</span>, <span class="math">\(i=1,\dots,n\)</span>, from the
categorical distribution in order to estimate <span class="math">\(H(\mathbb{p})\)</span>.</p>
<h3>Naive Plugin Estimator of the Discrete Entropy</h3>
<p>The naive plugin estimator uses the frequency estimates of the categorical
probabilities in the expression for the true entropy, that is,</p>
<p>
<div class="math">\begin{equation}
\hat{H}_N = - \sum_{k=1}^K \hat{p}_k \log \hat{p}_k,\label{Hplugin1}
\end{equation}</div>
</p>
<p>where <span class="math">\(\hat{p}_k = h_k / n\)</span> are the maximum likelihood estimates of each
probability <span class="math">\(p_k\)</span>, and <span class="math">\(h_k = \sum_{i=1}^n 1_{\{X_i = k\}}\)</span> is simply the
histogram over outcomes.
The form <span class="math">\((\ref{Hplugin1})\)</span> is equivalent to the simpler form</p>
<p>
<div class="math">$$\hat{H}_N = \log n - \frac{1}{n} \sum_{k=1}^K h_k \log h_k.$$</div>
</p>
<h3>Problems of the Naive Plugin Estimator</h3>
<p>It has long been known, due to <a href="http://epubs.siam.org/doi/abs/10.1137/1104033">(Basharin,
1959)</a> and <a href="http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA020217">(Harris,
1975)</a> that the estimator
<span class="math">\((\ref{Hplugin1})\)</span> underestimates the true entropy <span class="math">\((\ref{eqn:Hdiscrete})\)</span>.
In fact, we have for any distribution specified by <span class="math">\(\mathbb{p}\)</span> that</p>
<p>
<div class="math">$$H(\mathbb{p}) - \mathbb{E}[\hat{H}_N] =
\frac{K-1}{2n}
- \frac{1}{12 n^2} \left(1-\sum_k^{K} \frac{1}{p_k}\right)
+ O(n^{-3}) \geq 0,$$</div>
</p>
<p>so that most often the true entropy is at least as large as what <span class="math">\(\hat{H}_N\)</span>
claims it is.
Why is this the case? There is a simple explanation illustrated by
the following figure and description.</p>
<p><img alt="xlogx bias explanation" src="http://www.nowozin.net/sebastian/blog/images/entropy-estimation-1-bias.svg" /></p>
<p>Let us only consider a single bin <span class="math">\(k\)</span> with true probability <span class="math">\(p_k\)</span>.
If we would know <span class="math">\(p_k\)</span> exactly, the contribution this bin makes to the true
entropy of the distribution is <span class="math">\(-p_k \log p_k\)</span>.
We do not know <span class="math">\(p_k\)</span> and instead estimate it using its frequency estimate
<span class="math">\(\hat{p}_k = h_k / n\)</span>. The marginal distribution of <span class="math">\(\hat{p}_k\)</span> is a
<a href="http://en.wikipedia.org/wiki/Binomial_distribution">Binomial distribution</a>.</p>
<p>I have shown an empirical histogram of 50,000 samples from a
<span class="math">\(\textrm{Binomial}(1000,p_k)\)</span> distribution in red, where <span class="math">\(p_k=0.27\)</span> in this
case. As you can see, there is significant sampling variance about the true
<span class="math">\(p_k\)</span>, despite having seen 1,000 samples. It is however exactly centered at
<span class="math">\(p_k\)</span> because <span class="math">\(\hat{p}_k\)</span> is an <em>unbiased</em> estimate of <span class="math">\(p_k\)</span>, that is we have
<span class="math">\(\mathbb{E} \hat{p}_k = p_k\)</span>. It also is <a href="http://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation">approximately
normally
distributed</a>,
as can be clearly seen in the Gaussian shape of the red histogram.</p>
<p>When we now evaluate the function <span class="math">\(f(x) = -x \log x\)</span> we evaluate it at the
slightly wrong place <span class="math">\(\hat{p}_k\)</span> instead of the true place <span class="math">\(p_k\)</span>.
Because <span class="math">\(f\)</span> is concave in this case, the famous <a href="http://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen's
inequality</a> tells us that</p>
<p>
<div class="math">$$H = \sum_k f(p_k) = \sum_k f(\mathbb{E} \hat{p}_k) \geq \sum_k \mathbb{E}
f(\hat{p}_k) = \mathbb{E} \sum_k f(\hat{p}_k) = \mathbb{E} H_N,$$</div>
</p>
<p>so that for each <span class="math">\(p_k\)</span> the contribution to the entropy is underestimated on
average. (This does not imply that each particular finite sample estimate is
below the true entropy however.)</p>
<p>In the next part we will take a look at some improved estimators of the
discrete entropy.</p>
<p><em>Acknowledgements</em>. I thank <a href="http://www.memming.com/">Il Memming Park</a> and
<a href="http://ei.is.tuebingen.mpg.de/person/jpeters">Jonas Peters</a>
for reading a draft version of the article and providing feedback.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Advanced Structured Prediction2015-01-29T22:30:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-01-29:sebastian/blog/advanced-structured-prediction.html<p>In December 2014, just in time for NIPS, <a href="http://mitpress.mit.edu/">MIT Press</a>
released an edited volume on structured prediction models and their
applications in natural language processing, computer vision, and
computational biology.</p>
<p><img alt="Advanced Structured Prediction cover image" src="http://www.nowozin.net/sebastian/blog/images/asp-cover.jpg" /></p>
<p><a href="http://mitpress.mit.edu/books/advanced-structured-prediction">Advanced Structured Prediction</a>,
Editors Sebastian Nowozin, Peter V. Gehler, Jeremy Jancsary,
Christoph H. Lampert,
(<a href="http://mitpress.mit.edu/">MIT Press</a>,
<a href="http://www.amazon.com/Advanced-Structured-Prediction-Information-Processing/dp/0262028379">Amazon</a>)</p>
<p>The volume offers an overview of the recent research on structured prediction
in order to make the work accessible to a broader research community. The
chapters, by leading researchers in the field, cover a range of topics,
including research trends, the linear programming relaxation approach,
innovations in probabilistic modeling, recent theoretical progress, and
resource-aware learning.</p>
<h3>Contributors</h3>
<p><a href="https://www1.ethz.ch/bsse/cbg/people/behrj">Jonas Behr</a>,
<a href="http://yutianchen.com/">Yutian Chen</a>,
<a href="http://www.cs.cmu.edu/~ftorre/">Fernando De La Torre</a>,
<a href="http://users.cecs.anu.edu.au/~jdomke/">Justin Domke</a>,
<a href="http://files.is.tue.mpg.de/pgehler/">Peter V. Gehler</a>,
<a href="http://labs.yahoo.com/author/agelfand/">Andrew E. Gelfand</a>,
<a href="http://graal.ift.ulaval.ca/">Sébastien Giguère</a>,
<a href="http://www.cs.huji.ac.il/~gamir/">Amir Globerson</a>,
<a href="http://hci.iwr.uni-heidelberg.de/Staff/fhamprec/">Fred A. Hamprecht</a>,
<a href="http://www3.cs.stonybrook.edu/~minhhoai/">Minh Hoai</a>,
<a href="http://people.csail.mit.edu/tommi/">Tommi Jaakkola</a>,
<a href="http://www.jancsary.net/">Jeremy Jancsary</a>,
<a href="http://u.cs.biu.ac.il/~jkeshet/">Joseph Keshet</a>,
<a href="http://www2.informatik.hu-berlin.de/~kloftmar/">Marius Kloft</a>,
<a href="http://pub.ist.ac.at/~vnk/">Vladimir Kolmogorov</a>,
<a href="http://pub.ist.ac.at/~chl/">Christoph H. Lampert</a>,
<a href="http://www2.ift.ulaval.ca/~laviolette/">François Laviolette</a>,
<a href="http://www.xinghua-lou.org/">Xinghua Lou</a>,
<a href="http://www2.ift.ulaval.ca/~mmarchand/">Mario Marchand</a>,
<a href="http://www.cs.cmu.edu/~afm/Home.html">André F. T. Martins</a>,
<a href="http://ttic.uchicago.edu/~meshi/">Ofer Meshi</a>,
<a href="http://www.nowozin.net/">Sebastian Nowozin</a>,
<a href="http://ttic.uchicago.edu/~gpapan/">George Papandreou</a>,
<a href="http://cmp.felk.cvut.cz/~prusapa1/">Daniel Prusa</a>,
<a href="http://cbio.mskcc.org/directory/gunnar-ratsch/index.html">Gunnar Rätsch</a>,
<a href="http://graal.ift.ulaval.ca/">Amélie Rolland</a>,
<a href="http://hci.iwr.uni-heidelberg.de/Staff/bsavchyn/">Bogdan Savchynskyy</a>,
<a href="http://ipa.iwr.uni-heidelberg.de/dokuwiki/doku.php?id=schmidt">Stefan Schmidt</a>,
<a href="http://user.phil-fak.uni-duesseldorf.de/~tosch/">Thomas Schoenemann</a>,
<a href="http://homepages.inf.ed.ac.uk/gschweik/">Gabriele Schweikert</a>,
<a href="http://www.bentaskar.com/">Ben Taskar</a>,
<a href="http://web.engr.oregonstate.edu/~sinisa/">Sinisa Todorovic</a>,
<a href="http://www.mlplatform.nl/researchgroups/machine-learning-group-university-of-amsterdam/">Max Welling</a>,
<a href="http://scholar.google.com/citations?user=wMV9MiMAAAAJ&hl=en">David Weiss</a>,
<a href="http://cmp.felk.cvut.cz/~werner/">Thomas Werner</a>,
<a href="http://www.stat.ucla.edu/~yuille/">Alan Yuille</a>,
<a href="http://www.cs.ox.ac.uk/Stanislav.Zivny/homepage/">Stanislav Zivny</a>.</p>Streaming Mean and Variance Computation2015-01-25T21:30:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-01-25:sebastian/blog/streaming-mean-and-variance-computation.html<p>Given a sequence of observed data we would often like to estimate simple
quantities like the mean and variance.</p>
<p>Sometimes the data is available in a <em>streaming</em> setting, that is, we are
given one sample at a time. For example, this is the case when</p>
<ul>
<li>the number of samples is apriori unknown,</li>
<li>we have to perform some stopping test after each sample,</li>
<li>the number of samples is very large and we cannot store all samples.</li>
</ul>
<p>More formally, given weighted observations <span class="math">\(X_1\)</span>, <span class="math">\(X_2\)</span>, <span class="math">\(\dots\)</span>, with <span class="math">\(X_i
\in \mathbb{R}\)</span>, and <span class="math">\(w_1\)</span>, <span class="math">\(w_2\)</span>, <span class="math">\(\dots\)</span>, with <span class="math">\(w_i \geq 0\)</span> we would like to
calculate simple statistics like the weighted mean or weighted variance of the
sample without having to store all samples, and by processing them one-by-one.</p>
<p>In this situation we can compute the mean and variance of a sample (and, more
generally, any higher-order moments) using a <a href="http://en.wikipedia.org/wiki/Streaming_algorithm">streaming
algorithm</a>.
Many possibilities exist but because of the incremental computation particular
attention needs to be paid to numerical stability.
If we were to ignore numerical accuracy we could use a simple derivation to
show that the following updates for <span class="math">\(i=1,2,\dots\)</span> are correct, when
initializing <span class="math">\(S^{(0)} = T^{(0)} = U^{(0)} = 0\)</span>:</p>
<p>
<div class="math">$$S^{(i+1)} = S^{(i)} + w_i$$</div>
<div class="math">$$T^{(i+1)} = T^{(i)} + w_i X_i$$</div>
<div class="math">$$U^{(i+1)} = U^{(i)} + w_i X_i^2$$</div>
</p>
<p>Then <span class="math">\(\hat{\mu} = T^{(n)} / S^{(n)}\)</span> is the weighted sample mean, and
<span class="math">\(\hat{\mathbb{V}} = \frac{n}{(n-1) S^{(n)}} (U^{(n)} - S^{(n)} \hat{\mu}^2)\)</span>
is an unbiased estimate of the weighted variance.</p>
<p>The problem with this naive derivation arises when <span class="math">\(n\)</span> is very large. Then
the in all three updates the summation may sum quantities of very different
magnitude, leading to <a href="http://www.validlab.com/goldberg/paper.pdf">large round-off
errors</a>.
By the way, this can even arise when one is computing the simple sum of many
numbers, and a classic solution in that case is <a href="http://en.wikipedia.org/wiki/Kahan_summation_algorithm">Kahan
summation</a>.</p>
<p>A clever solution to this problem for streaming mean and variance computation
was proposed by West in 1979.
In his algorithm the summed quantities are controlled to be on average of
comparable size. (It is not the only alternative, for a detailed numerical
study of possible options, see the paper linked below.)</p>
<p>The West algorithm supports mean and variance computation for
positively weighted samples <span class="math">\((w_i, X_i)\)</span> with <span class="math">\(w_i \geq 0\)</span>, <span class="math">\(X_i \in
\mathbb{R}\)</span> and the original paper is</p>
<ul>
<li>D.H.D. West, <a href="http://people.xiph.org/~tterribe/tmp/homs/West79-_Updating_Mean_and_Variance_Estimates-_An_Improved_Method.pdf">"Updating Mean and Variance Estimates: An Improved
Method"</a>
(<a href="http://dl.acm.org/citation.cfm?id=359153">publisher link</a>),
Comm. of the ACM, Vol. 22, Issue 9, 532--535, 1979.</li>
</ul>
<p>It outputs</p>
<ol>
<li>The weighted unbiased mean estimate, <span class="math">\(\hat{\mu} = (\sum_i w_i X_i) / (\sum_i w_i)\)</span>,</li>
<li>The weighted unbiased variance estimate, <span class="math">\(\hat{\mathbb{V}} = \left(\sum_i w_i (X_i - \mu)^2\right) / (\frac{n-1}{n} \sum_i w_i)\)</span>.</li>
</ol>
<p>Here is an implementation for the <a href="http://julialang.org/">Julia programming
language</a>.</p>
<div class="highlight"><pre><span></span><span class="k">type</span><span class="nc"> MeanVarianceAccumulator</span>
<span class="n">sumw</span><span class="p">::</span><span class="kt">Float64</span>
<span class="n">wmean</span><span class="p">::</span><span class="kt">Float64</span>
<span class="n">t</span><span class="p">::</span><span class="kt">Float64</span>
<span class="n">n</span><span class="p">::</span><span class="kt">Int</span>
<span class="k">function</span><span class="nf"> MeanVarianceAccumulator</span><span class="p">()</span>
<span class="nb">new</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">function</span><span class="nf"> observe</span><span class="o">!</span><span class="p">(</span><span class="n">mvar</span><span class="p">::</span><span class="n">MeanVarianceAccumulator</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">weight</span><span class="p">)</span>
<span class="p">@</span><span class="nb">assert</span> <span class="n">weight</span> <span class="o">>=</span> <span class="mf">0.0</span>
<span class="n">q</span> <span class="o">=</span> <span class="n">value</span> <span class="o">-</span> <span class="n">mvar</span><span class="o">.</span><span class="n">wmean</span>
<span class="n">temp_sumw</span> <span class="o">=</span> <span class="n">mvar</span><span class="o">.</span><span class="n">sumw</span> <span class="o">+</span> <span class="n">weight</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">q</span><span class="o">*</span><span class="n">weight</span> <span class="o">/</span> <span class="n">temp_sumw</span>
<span class="n">mvar</span><span class="o">.</span><span class="n">wmean</span> <span class="o">+=</span> <span class="n">r</span>
<span class="n">mvar</span><span class="o">.</span><span class="n">t</span> <span class="o">+=</span> <span class="n">q</span><span class="o">*</span><span class="n">r</span><span class="o">*</span><span class="n">mvar</span><span class="o">.</span><span class="n">sumw</span>
<span class="n">mvar</span><span class="o">.</span><span class="n">sumw</span> <span class="o">=</span> <span class="n">temp_sumw</span>
<span class="n">mvar</span><span class="o">.</span><span class="n">n</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">nothing</span>
<span class="k">end</span>
<span class="n">count</span><span class="p">(</span><span class="n">mvar</span><span class="p">::</span><span class="n">MeanVarianceAccumulator</span><span class="p">)</span> <span class="o">=</span> <span class="n">mvar</span><span class="o">.</span><span class="n">n</span>
<span class="n">mean</span><span class="p">(</span><span class="n">mvar</span><span class="p">::</span><span class="n">MeanVarianceAccumulator</span><span class="p">)</span> <span class="o">=</span> <span class="n">mvar</span><span class="o">.</span><span class="n">wmean</span>
<span class="n">var</span><span class="p">(</span><span class="n">mvar</span><span class="p">::</span><span class="n">MeanVarianceAccumulator</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">mvar</span><span class="o">.</span><span class="n">t</span><span class="o">*</span><span class="n">mvar</span><span class="o">.</span><span class="n">n</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">mvar</span><span class="o">.</span><span class="n">sumw</span><span class="o">*</span><span class="p">(</span><span class="n">mvar</span><span class="o">.</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span>
<span class="n">std</span><span class="p">(</span><span class="n">mvar</span><span class="p">::</span><span class="n">MeanVarianceAccumulator</span><span class="p">)</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">var</span><span class="p">(</span><span class="n">mvar</span><span class="p">))</span>
</pre></div>
<p>You would call it as follows (tested with Julia version 0.3.5):</p>
<div class="highlight"><pre><span></span><span class="n">X</span> <span class="o">=</span> <span class="p">[</span><span class="mf">5.0</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.5</span><span class="p">,</span> <span class="mf">3.33</span><span class="p">]</span>
<span class="n">w</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">]</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">length</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">mu_exact</span> <span class="o">=</span> <span class="n">sum</span><span class="p">(</span><span class="n">w</span><span class="o">.*</span><span class="n">X</span><span class="p">)</span> <span class="o">/</span> <span class="n">sum</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="n">V_exact</span> <span class="o">=</span> <span class="n">sum</span><span class="p">(</span><span class="n">w</span> <span class="o">.*</span> <span class="p">((</span><span class="n">X</span> <span class="o">.-</span> <span class="n">mu_exact</span><span class="p">)</span><span class="o">.^</span><span class="mi">2</span><span class="p">))</span> <span class="o">/</span> <span class="p">(((</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><span class="o">/</span><span class="n">n</span><span class="p">)</span> <span class="o">*</span> <span class="n">sum</span><span class="p">(</span><span class="n">w</span><span class="p">))</span>
<span class="n">mvar</span> <span class="o">=</span> <span class="n">MeanVarianceAccumulator</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="n">n</span>
<span class="n">observe!</span><span class="p">(</span><span class="n">mvar</span><span class="p">,</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">w</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="k">end</span>
<span class="n">mean</span><span class="p">(</span><span class="n">mvar</span><span class="p">),</span> <span class="n">mu_exact</span><span class="p">,</span> <span class="n">var</span><span class="p">(</span><span class="n">mvar</span><span class="p">),</span> <span class="n">V_exact</span>
</pre></div>
<p>This gives the correct output (running mean, mean, running variance, variance):</p>
<div class="highlight"><pre><span></span><span class="p">(</span><span class="mf">0.8331250000000003</span><span class="p">,</span><span class="mf">0.8331249999999999</span><span class="p">,</span><span class="mf">13.826563476562498</span><span class="p">,</span><span class="mf">13.8265634765625</span><span class="p">)</span>
</pre></div>
<p>Alternative algorithms and variants for higher-order moments can be found on
the excellent <a href="http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance">Wikipedia page on the
topic</a>.</p>
<p><strong>Addendum</strong>: (October 2015) A recent paper by <a href="http://arxiv.org/abs/1510.04923">(Meng,
2015)</a> contains a variant of the above
algorithm for the unweighted case to compute the first four central moments
in a numerically stable manner. Meng provides a simple implementation
requiring only 24 floating point operations per observation.</p>
<p><em>Acknowledgements</em>. I thank Amit Adam for reading a draft and providing
comments that improved clarity.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>The Beginning2015-01-25T21:00:00+01:00Sebastian Nowozintag:www.nowozin.net,2015-01-25:sebastian/blog/the-beginning.html<p>This is the start of my blog.
This will be a quite technical blog and therefore address a more specialized
audience.</p>
<p>The articles will cover topics in the area of machine learning, statistics,
maybe some computer vision, let's see.
I plan to publish one article every two weeks, but let us see how that goes.</p>