The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models
TL;DR Highlight
A paper analyzing how CoT reasoning improves accuracy but breaks the model's uncertainty estimation — making it confidently wrong
Who Should Read
ML engineers deploying VLMs in high-stakes environments like medical image analysis or autonomous driving. Developers implementing wrong-prediction detection logic while using CoT or Thinking models.
Core Mechanics
- CoT (Chain-of-Thought) prompting improves accuracy but significantly degrades token-probability-based uncertainty estimation (MSP, Perplexity, MTE) ranking quality
- 'Implicit Answer Conditioning' phenomenon: as the reasoning trace converges toward a conclusion, the final answer token's probability gets artificially inflated — even for wrong answers
- The effect is stronger in models with built-in reasoning like Qwen3-VL-8B-Thinking. MTE PRR plummets from 0.413 to -0.084 on MathVista
- Consistency (sampling multiple times and using majority vote for uncertainty) remains stable under CoT and even improves in some cases
- Masking the final answer word in reasoning traces with [MASK] recovers MSP PRR from 0.141 to 0.475 — 3x+ improvement with no accuracy change
- Reasoning length itself isn't the cause: longer traces actually correlate with lower confidence (negative correlation)
Evidence
- Qwen3-VL-8B-Thinking: MTE PRR drops from 0.735 to 0.086 on Oxford Pets, from 0.413 to -0.084 on MathVista
- OK-VQA with answer masking: MSP PRR recovers from 0.141 to 0.475 (3.4x), Spearman correlation increases by 0.17-0.28
- Consistency on MathVista: PRR improves from 0.345 to 0.684 with CoT, 0.345 to 0.767 with Thinking model
- Repeated reasoning experiments: after round 1, answer token log-probability converges to near 0 (complete certainty) but accuracy doesn't increase
How to Apply
- In VLM pipelines using CoT or Thinking models where you're implementing uncertainty-based abstention (answer refusal), replace MSP/Perplexity with Consistency (K=10 sample majority vote) as your uncertainty metric
- If you must use token-probability-based confidence, check how many times the final answer string appears in the reasoning trace as a correction signal, or add post-processing that masks and re-scores those portions
- When upgrading to Thinking models in safety-critical systems (medical/autonomous driving), don't be fooled by accuracy improvements alone — also re-evaluate your uncertainty estimation approach
Code Example
# Consistency-based uncertainty estimation example (majority vote after K samplings)
import re
from collections import Counter
def get_consistency_score(model, prompt, image, K=10):
"""
Use Consistency as an uncertainty metric instead of token probabilities in CoT settings.
High consistency = low uncertainty = answer is trustworthy
"""
answers = []
for _ in range(K):
# Stochastic sampling with temperature > 0
output = model.generate(
prompt=prompt,
image=image,
temperature=1.0,
do_sample=True
)
# Extract only the final answer from <answer> tag (excluding reasoning trace)
match = re.search(r'<answer>(.*?)</answer>', output, re.DOTALL)
if match:
answers.append(match.group(1).strip().lower())
if not answers:
return 0.0, None
# Compute majority vote answer and agreement rate
majority_answer = Counter(answers).most_common(1)[0][0]
consistency = sum(1 for a in answers if a == majority_answer) / K
# Abstain if consistency is low (e.g., below 0.5)
return consistency, majority_answer
# Usage example
score, answer = get_consistency_score(model, prompt, image, K=10)
if score < 0.5:
print(f"Uncertain (consistency={score:.2f}) — answer rejected")
else:
print(f"Answer: {answer} (confidence={score:.2f})")Terminology
Original Abstract (Expand)
Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model's own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.