Is your algorithm confident enough? How to measure uncertainty in neural networks

When machine learning techniques are used in “mission critical” applications, the acceptable margin of error becomes significantly lower.

Imagine that your model is driving a car, assisting a doctor or even just interacting directly with an (perhaps easily annoyed) end user. In these cases, you’ll want to ensure that you can be confident in the predictions your model makes before acting on them.

Measuring prediction uncertainty grows more important by the day, as fuzzy systems become an increasing part of our fuzzy lives.

Here’s the good news: There are several techniques for measuring uncertainty in neural networks and some of them are very easy to implement! First, let’s get a feel for what we’re about to measure

Photo credit:  Juan Rumimpunu .

Photo credit: Juan Rumimpunu.

Putting a number on uncertainty.

When you make models of the world, your models cannot always provide accurate answers.

This is partly due to that fact that models are simplifications of a seriously complicated world. Since some information is unknown, the predictions from your model are subject to some degree of uncertainty.

Parts of our world (and the ways we measure it) are simply chaotic. Some things happen randomly, and this randomness is also a source of uncertainty in your model’s predictions.

Prediction uncertainty can be divided into 3 categories:

1. Model uncertainty.

Model uncertainty comes from “ignorance” of the problem. That is, model uncertainty quantifies the things that could be correctly captured by the model but isn’t.

Yoel and Inbar from Taboola provide a fun example:

You want to build a model that gets a picture of an animal, and predicts if that animal will try to eat you. You trained the model on pictures of lions and giraffes. Now you show it a zombie. Since the model wasn’t trained on pictures of zombies, the uncertainty will be high. If trained on enough pictures of zombies, this uncertainty will decrease.

You want to build a model that gets a picture of an animal, and predicts if that animal will try to eat you. You trained the model on pictures of lions and giraffes. Now you show it a zombie. Since the model wasn’t trained on pictures of zombies, the uncertainty will be high. If trained on enough pictures of zombies, this uncertainty will decrease.

Sometimes it is also referred to as epistemic or structural uncertainty. Measuring model uncertainty is an area of statistics which is considered to be particularly challenging. One reason for this, is that principled techniques like Bayesian model averaging become very costly as models grow more complex.

2. Model misspecification.

If your model produces good predictions during training and validation but not during evaluation (or in production), it might be misspecified.

Model misspecification uncertainty captures scenarios where your model is making predictions on new data with very different patterns from the training data.

3. Inherent noise.

This is uncertainty produced by noise present in the dataset. It could be attributed to imperfect measurement techniques or an inherent randomness in the thing being measured.

Imagine your dataset contains 2 images of cards facing down. You’re feeling optimistic and you want to build a model to predict the suit and value of each card. The first card is labeled as ace of spades and the other is labeled as 8 of hearts. Here, the exact same features (an image of a card facing down) can be linked to different predictions (either ace of spades or 8 of hearts). Therefore, this dataset is subject to lots of inherent noise.

Imagine your dataset contains 2 images of cards facing down. You’re feeling optimistic and you want to build a model to predict the suit and value of each card. The first card is labeled as ace of spades and the other is labeled as 8 of hearts. Here, the exact same features (an image of a card facing down) can be linked to different predictions (either ace of spades or 8 of hearts). Therefore, this dataset is subject to lots of inherent noise.

Inherent noise is also sometimes called aleatoric or statistical uncertainty. The amount of inherent noise is linked to the Bayes error rate which the lowest achievable error rate of a given classifier. As you can imagine, the lowest possible error rate that a model can achieve is tightly linked to the amount of error produced by noise in the data itself.

These concepts lean heavily on Bayesian statistics. I have outlined these ideas in a simple way, but that’s just scratching the surface on these deep topics.

To learn more about uncertainty measures in Bayesian neural networks, I recommend taking a look at this article by Felix Laumann. For an in-depth explanation of Bayesian statistics in the context of data science, Ankit Rathi has written a series of great articles on the subject.

Photo credit:  Avi Richards

Photo credit: Avi Richards

Implementing uncertainty.

At this point, you may be thinking: “That sounds good, but how do I implement uncertainty in my model?”.

The Bayesian neural network integrates uncertainty by default in addition to generally being more robust to overfitting and handling smaller datasets. However, the toolchain for building Bayesian neural networks is still emerging and the models tend to be more computationally costly, both during training and when making predictions.

Also, migrating your work to a probabilistic model (like a Bayesian neural network) is going to be annoying.

In the long run, probabilistic deep learning will likely become the default. For now though, practical techniques to integrate the probabilistic perspective in our existing work is a good first step!

Monte Carlo dropout.

A couple of years ago, Yarin and Zoubin from University of Cambridge found a way to approximate model uncertainty without changing the structure or optimization techniques of the neural network.

Here’s the short version: By using dropout before each weight layer at test time and running your predictions for several iterations, you can approximate Bayesian uncertainty. They call this process Monte Carlo dropout:

  1. You feed an input to your model.

  2. You predict for several iterations on that single input, each time disabling small parts of the neural network randomly.

  3. You take the mean output value. This is your prediction. Finally, you measure the variance between iterations. This is the model uncertainty.

Intuitively, I think of it like this: The more your prediction fluctuates with tiny structural changes to the model, the more uncertain that prediction is.

Implementing Monte Carlo dropout is really easy. Here, I start with a simple dense network for the MNIST problem built with Keras. By default, dropout layers are only enabled during training. To enable the dropout layers at test time, “ set training=True “ for each layer.

algoritm-5.png

Next, we need a custom prediction function which can predict iteratively and return the mean and variance of those iterations. In this example, we measure the standard deviation instead of the variance, because it’s expressed in the same units as the mean.

algorithm-6.png

Now you’re ready to predict with approximated uncertainty:

algoritm-7.png
Photo credit:  Tim Gouw

Photo credit: Tim Gouw

Problems and critique.

As we saw, using Monte Carlo dropout is really easy. Maybe even too easy. The technique was critiqued by Ian Osband from Deepmind who noted that the predictive uncertainty of a simple model with Monte Carlo dropout did not decrease with more data. This raises the question of whether it is an inaccurate approximation of Bayesian uncertainty or if there are any underlying assumptions that need to be made more clear.

For more that issue, Sebastian Schöner has written a great blog post summarizing the critique.

In my workplace at Kanda, we have had mixed experiences with the effectiveness of the Monte Carlo dropout.

For a simple fully connected model trained on MNIST like my previous example, the uncertainty approximations behaved as expected: When presented with noise instead of a handwritten digit, the approximated uncertainty was higher. We found that 50–100 iterations of Monte Carlo dropout produced satisfactory results.

Later, we had a scenario, where we needed to run an image recognition task locally on a smartphone as part of an AR application. We used transfer learning, building a classifier on top of the NASNet Mobile architecture.

Running 100 iterations of NASNet on a smartphone is not a good idea.

Even with heavy parallelization, we were only realistically able to run ~20 iterations on the device in order to provide a prediction in good time.

Secondly, the uncertainty estimates were inaccurate. When fed a picture of random noise, the uncertainty was surprisingly low. It is worth noting that we only implemented dropout in the densely connected part of the classifier which sat on top of NASNet. If you’d like to share intuitions about what went wrong here, I’d be very interested to read a response from you!

Conclusion.

First, we had a look at why it’s important to quantify uncertainty in machine learning models. Then, I introduced you to 3 different ways of thinking about prediction uncertainty: Model uncertainty, model misspecification and inherent noise.

Monte Carlo dropout is an easy-to-implement technique to approximate Bayesian uncertainty, but there is some disagreement to whether the approximation is indeed accurate. Practically, I have found Monte Carlo dropout effective for simpler models, but have had some problems with the approach for complex models, both in terms of accuracy and performance.

Integrating Bayesian probability in machine learning will only become more important in the future, and I look forward to seeing more probabilistic techniques become part of the toolchain.

That’s all folks! I hope you enjoyed the article. I certainly feel like I learned something while writing it. Please feel free to leave your feedback and add me on LinkedIn where I (occasionally) post interesting stuff.


Daniel Rothman
Machine Learning Engineer