Estimating model uncertainty

Last updated on 2025-01-24 | Edit this page

Estimated time: 40 minutes

Overview

Questions

What is model uncertainty, and how can it be categorized?
How do uncertainty estimation methods intersect with OOD detection methods?
What are the computational challenges of estimating model uncertainty?
When is uncertainty estimation useful, and what are its limitations?
Why is OOD detection often preferred over traditional uncertainty estimation techniques in modern applications?

Objectives

Define and distinguish between aleatoric and epistemic uncertainty in machine learning models.
Explore common techniques for estimating aleatoric and epistemic uncertainty.
Understand why OOD detection has become a widely adopted approach in many real-world applications.
Compare and contrast the goals and computational costs of uncertainty estimation and OOD detection.
Summarize when and where different uncertainty estimation methods are most useful.

How confident is my model? Will it generalize to new data?

Understanding how confident a model is in its predictions is a valuable tool for building trustworthy AI systems, especially in high-stakes settings like healthcare or autonomous vehicles. Model uncertainty estimation focuses on quantifying the model’s confidence and is often used to identify predictions that require further review or caution.

Sources of uncertainty

At its core, model uncertainty starts with the data itself, as all models learn to form embeddings (feature representations) of the data. Uncertainty in the data—whether from inherent randomness or insufficient coverage—propagates through the model’s embeddings, leading to uncertainty in the outputs.

1) Aleatoric (Random) uncertainty

Aleotoric or random uncertainty is the inherent noise in the data that cannot be reduced, even with more data (observations OR missing features).

Inconsistent readings from faulty sensors (e.g., modern image sensors exhibit “thermal noise” or “shot noise”, where pixel values randomly fluctuate even under constant lighting)
Random crackling/static in recordings
Human errors in data entry
Any aspect of the data that is unpredictable

Methods for addressing aleatoric uncertainty

Since aleatoric/random uncertainty is generally considered inherent (unless you upgrade sensors or remove whatever is causing the random generating process), methods to address it focus on measuring the degree of noise or uncertainty.

Predictive variance in linear regression: The ability to derive error bars or prediction intervals in traditional regression comes from the assumption that the errors (residuals) follow a normal distribution and are homoskedastic (errors stay relatively constant across different values of predictors).
- In contrast, deep learning models are highly non-linear and have millions (or billions) of parameters. The mapping between inputs and outputs is not a simple linear equation but rather a complex, multi-layer function. In addition, deep learning can overfit common classes and underfit rarer clases. Because of these factors, errors are rarely normally distributed and homoskedastic in deep learning applications.
Heteroskedastic models: Use specialized loss functions that allow the model to predict the noise level in the data directly. These models are particularly critical in fields like robotics, where sensor noise varies significantly depending on environmental conditions. It is possible to build this functionality into both linear models and modern deep learning models. However, these methods may require some calibration, as ground truth measurements of noise usually aren’t available.
- Example application: Kendall, A., & Gal, Y. (2017). “What uncertainties do we need in Bayesian deep learning for computer vision?”.
Data augmentation and perturbation analysis: Assess variability in predictions by adding noise to the input data and observing how much the model’s outputs change. A highly sensitive change in predictions may indicate underlying noise or instability in the data. For instance, in image classification, augmenting training data with synthetic noise can help the model better handle real-world imperfections stemming from sensor artifacts.
- Example application: Shorten, C., & Khoshgoftaar, T. M. (2019). “A survey on image data augmentation for deep learning.”

2. Epistemic uncertainty

Epistemic (ep·i·ste·mic) is an adjective that means, “relating to knowledge or to the degree of its validation.”

Epistemic uncertainty refers to gaps in the model’s knowledge about the data distribution, which can be reduced by using more data or improved models. Epistemic uncertainy can arise due to:

Out-of-distribution (OOD) data:
- Tabular: Classifying user behavior from a new region not included in training data. Predicting hospital demand during a rare pandemic with limited historical data. Applying model trained on one location to another.
- Image: Recognizing a new species in wildlife monitoring. Detecting a rare/unseen obstacle to automate driving. A model trained on high-resolution images but tested on low-resolution inputs.
- Text: Queries about topics completely outside the model’s domain (e.g., financial queries in a healthcare chatbot). Interpreting slang or idiomatic expressions unseen during training.
Sparse or insufficient data in feature space:
- Tabular: High-dimensional data with many missing or sparsely sampled features (e.g., genomic datasets).
- Image: Limited labeled examples for rare diseases in medical imaging datasets.
- Text: Rare domain-specific terminology.

Methods for addressing epistemic uncertainty

Epistemic uncertainty arises from the model’s lack of knowledge about certain regions of the data space. Techniques to address this uncertainty include:

Collect more data: Easier said than done! Focus on gathering data from underrepresented scenarios or regions of the feature space, particularly areas where the model exhibits high uncertainty (e.g., rare medical conditions, edge cases in autonomous driving). This directly reduces epistemic uncertainty by expanding the model’s knowledge base.
- Active learning: Use model uncertainty estimates to prioritize uncertain or ambiguous samples for annotation, enabling more targeted data collection.
Ensemble models: These involve training multiple models on the same data, each starting with different initializations or random seeds. The ensemble’s predictions are aggregated, and the variance in their outputs reflects uncertainty. This approach works well because different models often capture different aspects of the data. For example, if all models agree, the prediction is confident; if they disagree, there is uncertainty. Ensembles are effective but computationally expensive, as they require training and evaluating multiple models.
Bayesian neural networks: These networks incorporate probabilistic layers to model uncertainty directly in the weights of the network. Instead of assigning a single deterministic weight to each connection, Bayesian neural networks assign distributions to these weights, reflecting the uncertainty about their true values. During inference, these distributions are sampled multiple times to generate predictions, which naturally include uncertainty estimates. While Bayesian neural networks are theoretically rigorous and align well with the goal of epistemic uncertainty estimation, they are computationally expensive and challenging to scale for large datasets or deep architectures. This is because calculating or approximating posterior distributions over all parameters becomes intractable as model size grows. To address this, methods like variational inference or Monte Carlo sampling are often used, but these approximations can introduce inaccuracies, making Bayesian approaches less practical for many modern applications. Despite these challenges, Bayesian neural networks remain valuable for research contexts where precise uncertainty quantification is needed or in domains where computational resources are less of a concern.
- Example application: Detecting rare tumor types in radiology.
- Reference: Blundell, C., et al. (2015). “Weight uncertainty in neural networks.”
Out-of-distribution detection: Identifies inputs that fall significantly outside the training distribution, flagging areas where the model’s predictions are unreliable. Many OOD methods produce continuous scores, such as Mahalanobis distance or energy-based scores, which measure how novel or dissimilar an input is from the training data. These scores can be interpreted as a form of epistemic uncertainty, providing insight into how unfamiliar an input is. However, OOD detection focuses on distinguishing ID from OOD inputs rather than offering confidence estimates for predictions on ID inputs.
- Example application: Flagging out-of-scope queries in chatbot systems.
- Reference: Hendrycks, D., & Gimpel, K. (2017). “A baseline for detecting misclassified and out-of-distribution examples in neural networks.”

Why is OOD detection widely adopted?

Among epistemic uncertainty methods, OOD detection has become a widely adopted approach in real-world applications due to its ability to efficiently identify inputs that fall outside the training data distribution, where predictions are inherently unreliable. Many OOD detection techniques produce continuous scores that quantify the novelty or dissimilarity of inputs, which can be interpreted as a form of uncertainty. This makes OOD detection not only effective at rejecting anomalous inputs but also useful for prioritizing inputs based on their predicted risk.

For example, in autonomous vehicles, OOD detection can help flag unexpected scenarios (e.g., unusual objects on the road) in near real-time, enabling safer decision-making. Similarly, in NLP, OOD methods are used to identify queries or statements that deviate from a model’s training corpus, such as out-of-context questions in a chatbot system. In the next couple of episodes, we’ll see how to implement various OOD strategies.

3) Subjectivity and ill-defined problems

A final source of model uncertainty, which sometimes receives less attention than aleatoric or epistemic uncertainty, arises from human subjectivity and ill-defined problems. These two sources are often intertwined: subjectivity reflects variability in how humans interpret data, while ill-defined problems arise when tasks or categories lack clear, objective boundaries. Together, they create challenges in labeling, data consistency, and model performance.

Methods for addressing subjectivity and ill-defined problems

Reframe problem: If the overlap or subjectivity stems from an ill-posed problem, reframing the task can significantly reduce ambiguity.
- Avoid arbitrary discretization: For naturally continuous problems, use continuous models instead of forcing discrete classifications. Example: Instead of categorizing facial expressions as “happy” vs. “neutral” (which overlap), predict the intensity of happiness on a scale of 0–10. For medical images, shift from binary classifications like “benign vs. malignant” to predicting risk scores or probabilities, which better reflect the underlying uncertainty. Continuous outputs preserve more information about the underlying data, reduce the cognitive burden of defining strict boundaries, and allow for more nuanced predictions.
Consensus-based labeling (inter-annotator agreement): Aggregate labels from multiple annotators to reduce subjectivity and quantify ambiguity. Use metrics like Cohen’s kappa or Fleiss’ kappa to measure agreement between annotators. Example: In sentiment analysis, annotators may disagree on whether a review is “positive” or “neutral.” By combining their opinions, you can capture the degree of subjectivity in the data, reduce labeling bias, and create a more reliable dataset.
Probabilistic labeling or soft targets: Instead of assigning hard labels (e.g., 0 or 1), use probabilistic labels to represent ambiguity. Example: If 70% of annotators labeled an image as “happy” and 30% as “neutral,” assign a soft label of [0.7, 0.3] rather than forcing a binary decision. This approach helps the model reflect the inherent uncertainty in the data and encourages more flexible decision-making in downstream applications.

Discussion

Identify aleatoric and epistemic uncertainty

For each scenario below, identify the sources of aleatoric and epistemic uncertainty. Provide specific examples based on the context of the application.

Tabular data example: Hospital resource allocation during seasonal flu outbreaks and pandemics.
Image data example: Tumor detection in radiology images.
Text data example: Chatbot intent recognition.

Show me the solution

Hospital resource allocation

Aleatoric uncertainty: Variability in seasonal flu demand; inconsistent local reporting.
Epistemic uncertainty: Limited data for rare pandemics; incomplete understanding of emerging health crises.

Tumor detection in radiology images

Aleatoric uncertainty: Imaging artifacts such as noise or motion blur.
Epistemic uncertainty: Limited labeled data for rare tumor types; novel imaging modalities.

Chatbot intent recognition

Aleatoric uncertainty: Noise in user queries such as typos or speech-to-text errors.
Epistemic uncertainty: Lack of training data for queries from out-of-scope domains; ambiguity due to unclear or multi-intent queries.

Summary

Uncertainty estimation is a critical component of building reliable and trustworthy machine learning models, especially in high-stakes applications. By understanding the distinction between aleatoric uncertainty (inherent data noise) and epistemic uncertainty (gaps in the model’s knowledge), practitioners can adopt tailored strategies to improve model robustness and interpretability.

Aleatoric uncertainty is irreducible noise in the data itself. Addressing this requires models that can predict variability, such as heteroscedastic loss functions, or strategies like data augmentation to make models more resilient to imperfections.
Epistemic uncertainty arises from the model’s incomplete understanding of the data distribution. It can be mitigated through methods like Monte Carlo dropout, Bayesian neural networks, ensemble models, and Out-of-Distribution (OOD) detection. Among these methods, OOD detection has become a cornerstone for handling epistemic uncertainty in practical applications. Its ability to flag anomalous or out-of-distribution inputs makes it an essential tool for ensuring model predictions are reliable in real-world scenarios.
- In many cases, collecting more data and employing active learning can directly address the root causes of epistemic uncertainty.

When choosing a method, it’s important to consider the trade-offs in computational cost, model complexity, and the type of uncertainty being addressed. Together, these techniques form a powerful toolbox, enabling models to better navigate uncertainty and maintain trustworthiness in dynamic environments. By combining these approaches strategically, practitioners can ensure that their systems are not only accurate but also robust, interpretable, and adaptable to the challenges of real-world data.