Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
TL;DR Highlight
An uncertainty measurement framework that proactively detects queries where multimodal LLMs are likely to be wrong — without external tools — and auto-routes them to experts or larger models.
Who Should Read
ML engineers deploying multimodal LLMs (image/audio/video) in production and building pipelines to escalate model errors to humans or larger models. Devs operating services where reliability matters, like medical image analysis or multimodal QA.
Core Mechanics
- Measures uncertainty as a single score by combining semantic variance across multiple generated answers (Semantic Volume) with model's own token probability-based consistency score (Incoherence Score)
- Works using only the MLLM's own embeddings and token probabilities — no external NLI models or reward models needed. Applies to image/audio/video without separate implementation
- The two signals complement different error types: Semantic Volume (U) catches answers where the model is overconfident but semantically diverse, while Quadratic Entropy (Q) catches answers where probability is spread out
- Applicable to black-box APIs like GPT-4o and Claude 3.5 Haiku using a small open-source model (Llava-v1.5-13b) as a local proxy
- With k=5 samples alone achieves AUROC 0.87 vs 0.63 for single-sample methods, with ~1000x lower computational overhead than Semantic Entropy
Evidence
- Average AUROC 0.81 on image-text benchmarks, consistently ahead of runner-up (Eigenscore 0.80), 78.7 vs 77.5 on adversarial dataset AdVQA
- CPC (confidence linearity) average 90.4, over 11% higher than runner-up — uncertainty score is linearly proportional to actual error rate
- ECE (calibration error) average 0.062, lowest of all (Semantic Entropy 0.211, Eigenscore 0.227) — min-max normalization alone makes it usable as an error probability proxy
- On image generation (AnyGPT): Pearson correlation with CLIP score 81.5 vs PUNC (image generation-specific method) 44.0
How to Apply
- In MLLM inference pipelines, sample k=5-10 answers, compute the UMPIRE score from the last layer EOS token embedding (ϕ) and token probabilities (p), and route to a larger model or human when threshold is exceeded — k=5 is 10x faster than k=50 while still crushing single-sample methods
- For black-box APIs like GPT-4o, run a small local open-source MLLM (Llava-7b etc.) as a proxy to compute UMPIRE scores on black-box responses — works without proxy fine-tuning
- Use the adaptive α strategy with a small number of unlabeled instances (1-5% of total) to automatically set α — computed from the median ratio of U and Q, achieves near-optimal results without a labeled dev set
Code Example
import torch
import numpy as np
def compute_umpire(model, tokenizer, query, k=10, eps=1e-6, alpha=None):
"""
UMPIRE uncertainty score computation (Algorithm 1)
model: HuggingFace MLLM (LLaVA, etc.)
query: multimodal input (image+text, etc.)
k: number of samples
"""
embeddings = [] # shape: [k, d]
probs = [] # shape: [k]
for _ in range(k):
with torch.no_grad():
# sampling with temperature=1
output = model.generate(
**query,
do_sample=True,
temperature=1.0,
top_p=0.9,
output_hidden_states=True,
return_dict_in_generate=True,
)
# extract last layer EOS token embedding and apply L2 normalization
last_hidden = output.hidden_states[-1][-1, -1, :] # EOS token
phi = last_hidden / last_hidden.norm()
embeddings.append(phi.cpu().numpy())
# compute joint probability of the entire response
log_prob = compute_sequence_log_prob(output, model)
probs.append(np.exp(log_prob))
Phi = np.stack(embeddings) # [k, d]
p = np.array(probs) # [k]
# Semantic Volume (U)
G = Phi @ Phi.T + eps * np.eye(k) # Gram matrix [k, k]
sign, logdet = np.linalg.slogdet(G)
U = logdet / (2 * k)
# Quadratic Entropy (Q) = incoherence score
Q = np.mean(1 - p)
# automatic alpha setting (adaptive): ratio of U and Q
if alpha is None:
alpha = U / (Q + 1e-9) # in practice, use batch median ratio
# UMPIRE score
V = U + alpha * Q
return V
# usage example: high V means uncertain -> escalation
# threshold = 0.5 # set after min-max normalization with unlabeled set
# if compute_umpire(model, tokenizer, query) > threshold:
# route_to_human_or_larger_model(query)Terminology
Related Resources
Original Abstract (Expand)
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models'own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.