Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Feb 27, 2026•Gregory Kang Ruey Lau, Hieu Dao, Nicole Lin +1•View PDF

TL;DR Highlight

An uncertainty measurement framework that proactively detects queries where multimodal LLMs are likely to be wrong — without external tools — and auto-routes them to experts or larger models.

Who Should Read

ML engineers deploying multimodal LLMs (image/audio/video) in production and building pipelines to escalate model errors to humans or larger models. Devs operating services where reliability matters, like medical image analysis or multimodal QA.

Core Mechanics

Measures uncertainty as a single score by combining semantic variance across multiple generated answers (Semantic Volume) with model's own token probability-based consistency score (Incoherence Score)
Works using only the MLLM's own embeddings and token probabilities — no external NLI models or reward models needed. Applies to image/audio/video without separate implementation
The two signals complement different error types: Semantic Volume (U) catches answers where the model is overconfident but semantically diverse, while Quadratic Entropy (Q) catches answers where probability is spread out
Applicable to black-box APIs like GPT-4o and Claude 3.5 Haiku using a small open-source model (Llava-v1.5-13b) as a local proxy
With k=5 samples alone achieves AUROC 0.87 vs 0.63 for single-sample methods, with ~1000x lower computational overhead than Semantic Entropy

Evidence

Average AUROC 0.81 on image-text benchmarks, consistently ahead of runner-up (Eigenscore 0.80), 78.7 vs 77.5 on adversarial dataset AdVQA
CPC (confidence linearity) average 90.4, over 11% higher than runner-up — uncertainty score is linearly proportional to actual error rate
ECE (calibration error) average 0.062, lowest of all (Semantic Entropy 0.211, Eigenscore 0.227) — min-max normalization alone makes it usable as an error probability proxy
On image generation (AnyGPT): Pearson correlation with CLIP score 81.5 vs PUNC (image generation-specific method) 44.0

How to Apply

In MLLM inference pipelines, sample k=5-10 answers, compute the UMPIRE score from the last layer EOS token embedding (ϕ) and token probabilities (p), and route to a larger model or human when threshold is exceeded — k=5 is 10x faster than k=50 while still crushing single-sample methods
For black-box APIs like GPT-4o, run a small local open-source MLLM (Llava-7b etc.) as a proxy to compute UMPIRE scores on black-box responses — works without proxy fine-tuning
Use the adaptive α strategy with a small number of unlabeled instances (1-5% of total) to automatically set α — computed from the median ratio of U and Q, achieves near-optimal results without a labeled dev set

Code Example

snippet

import torch
import numpy as np

def compute_umpire(model, tokenizer, query, k=10, eps=1e-6, alpha=None):
    """
    UMPIRE uncertainty score computation (Algorithm 1)
    model: HuggingFace MLLM (LLaVA, etc.)
    query: multimodal input (image+text, etc.)
    k: number of samples
    """
    embeddings = []  # shape: [k, d]
    probs = []       # shape: [k]

    for _ in range(k):
        with torch.no_grad():
            # sampling with temperature=1
            output = model.generate(
                **query,
                do_sample=True,
                temperature=1.0,
                top_p=0.9,
                output_hidden_states=True,
                return_dict_in_generate=True,
            )

        # extract last layer EOS token embedding and apply L2 normalization
        last_hidden = output.hidden_states[-1][-1, -1, :]  # EOS token
        phi = last_hidden / last_hidden.norm()
        embeddings.append(phi.cpu().numpy())

        # compute joint probability of the entire response
        log_prob = compute_sequence_log_prob(output, model)
        probs.append(np.exp(log_prob))

    Phi = np.stack(embeddings)  # [k, d]
    p = np.array(probs)         # [k]

    # Semantic Volume (U)
    G = Phi @ Phi.T + eps * np.eye(k)  # Gram matrix [k, k]
    sign, logdet = np.linalg.slogdet(G)
    U = logdet / (2 * k)

    # Quadratic Entropy (Q) = incoherence score
    Q = np.mean(1 - p)

    # automatic alpha setting (adaptive): ratio of U and Q
    if alpha is None:
        alpha = U / (Q + 1e-9)  # in practice, use batch median ratio

    # UMPIRE score
    V = U + alpha * Q
    return V

# usage example: high V means uncertain -> escalation
# threshold = 0.5  # set after min-max normalization with unlabeled set
# if compute_umpire(model, tokenizer, query) > threshold:
#     route_to_human_or_larger_model(query)

Terminology

MLLMA large language model that understands not just text but also images, audio, and video. GPT-4o, LLaVA, Phi-4, etc.

AUROCA metric for how well an uncertainty score separates correct from incorrect answers. Closer to 1 is better; 0.5 is random.

ECEExpected Calibration Error. Measures whether the model's "70% confidence" claim actually corresponds to 70% accuracy. Lower means the predicted probability matches reality.

ConfabulationWhen a model outputs a plausible-sounding but wrong answer about something it doesn't know, as if it knows it. Also commonly called "hallucination".

DPPDeterminantal Point Process. A mathematical tool that quantifies how diverse a set of items is using a matrix determinant. Also used in recommendation systems to ensure result diversity.

Quadratic EntropyThe probability that two samplings produce different answers. Less sensitive to extreme low-probability values than Shannon entropy, so it can be stably estimated with fewer samples.

Semantic VolumeThe "volume" occupied by multiple answers in embedding space. The more semantically diverse the answers, the larger the volume and the higher the uncertainty.

Related Resources

https://github.com/daohieu17ctt/UMPIRE

Original Abstract (Expand)

Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models'own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.