The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Mar 17, 2026•Robert Welch, Emir Konuk, Kevin Smith•View PDF

TL;DR Highlight

A paper analyzing how CoT reasoning improves accuracy but breaks the model's uncertainty estimation — making it confidently wrong

Who Should Read

ML engineers deploying VLMs in high-stakes environments like medical image analysis or autonomous driving. Developers implementing wrong-prediction detection logic while using CoT or Thinking models.

Core Mechanics

CoT (Chain-of-Thought) prompting improves accuracy but significantly degrades token-probability-based uncertainty estimation (MSP, Perplexity, MTE) ranking quality
'Implicit Answer Conditioning' phenomenon: as the reasoning trace converges toward a conclusion, the final answer token's probability gets artificially inflated — even for wrong answers
The effect is stronger in models with built-in reasoning like Qwen3-VL-8B-Thinking. MTE PRR plummets from 0.413 to -0.084 on MathVista
Consistency (sampling multiple times and using majority vote for uncertainty) remains stable under CoT and even improves in some cases
Masking the final answer word in reasoning traces with [MASK] recovers MSP PRR from 0.141 to 0.475 — 3x+ improvement with no accuracy change
Reasoning length itself isn't the cause: longer traces actually correlate with lower confidence (negative correlation)

Evidence

Qwen3-VL-8B-Thinking: MTE PRR drops from 0.735 to 0.086 on Oxford Pets, from 0.413 to -0.084 on MathVista
OK-VQA with answer masking: MSP PRR recovers from 0.141 to 0.475 (3.4x), Spearman correlation increases by 0.17-0.28
Consistency on MathVista: PRR improves from 0.345 to 0.684 with CoT, 0.345 to 0.767 with Thinking model
Repeated reasoning experiments: after round 1, answer token log-probability converges to near 0 (complete certainty) but accuracy doesn't increase

How to Apply

In VLM pipelines using CoT or Thinking models where you're implementing uncertainty-based abstention (answer refusal), replace MSP/Perplexity with Consistency (K=10 sample majority vote) as your uncertainty metric
If you must use token-probability-based confidence, check how many times the final answer string appears in the reasoning trace as a correction signal, or add post-processing that masks and re-scores those portions
When upgrading to Thinking models in safety-critical systems (medical/autonomous driving), don't be fooled by accuracy improvements alone — also re-evaluate your uncertainty estimation approach

Code Example

snippet

# Consistency-based uncertainty estimation example (majority vote after K samplings)
import re
from collections import Counter

def get_consistency_score(model, prompt, image, K=10):
    """
    Use Consistency as an uncertainty metric instead of token probabilities in CoT settings.
    High consistency = low uncertainty = answer is trustworthy
    """
    answers = []
    for _ in range(K):
        # Stochastic sampling with temperature > 0
        output = model.generate(
            prompt=prompt,
            image=image,
            temperature=1.0,
            do_sample=True
        )
        # Extract only the final answer from <answer> tag (excluding reasoning trace)
        match = re.search(r'<answer>(.*?)</answer>', output, re.DOTALL)
        if match:
            answers.append(match.group(1).strip().lower())
    
    if not answers:
        return 0.0, None
    
    # Compute majority vote answer and agreement rate
    majority_answer = Counter(answers).most_common(1)[0][0]
    consistency = sum(1 for a in answers if a == majority_answer) / K
    
    # Abstain if consistency is low (e.g., below 0.5)
    return consistency, majority_answer

# Usage example
score, answer = get_consistency_score(model, prompt, image, K=10)
if score < 0.5:
    print(f"Uncertain (consistency={score:.2f}) — answer rejected")
else:
    print(f"Answer: {answer} (confidence={score:.2f})")

Terminology

Chain-of-Thought (CoT)A prompting technique where the model writes out step-by-step 'thinking' before giving its final answer. Like showing your work when solving a math problem.

VLM (Vision-Language Model)An AI model that understands both images and text. Models like GPT-4o, Gemma3, Qwen3-VL that can answer questions about pictures.

Uncertainty Quantification (UQ)The model estimating how likely its own answer is to be correct. The foundational tech for making models say 'I don't know' when they're likely wrong.

ATL (Answer-Token Likelihood)The token probability when the model generates an answer word. Higher probability means the model is more confident — but this gets inflated after CoT.

ConsistencyAsking the same question multiple times and seeing how consistent the answers are. Different answers each time = uncertain; same answer every time = confident.

PRR (Prediction Rejection Ratio)A metric measuring how well uncertainty scores filter out wrong predictions. Higher score means better ability to keep confident answers and discard uncertain ones.

Selective GenerationAllowing the model to refuse (abstain) when it's unsure. A strategy that improves safety by choosing 'I don't know' instead of a wrong answer.

Implicit Answer ConditioningA phenomenon where as a conclusion implicitly forms during reasoning, the final answer token's probability concentrates on that conclusion. Even wrong answers get 'self-convinced' confidence.

Original Abstract (Expand)

Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model's own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.