Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
TL;DR Highlight
Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.
Who Should Read
ML engineers developing services that analyze spatial relationships (location, direction, distance, etc.) within images using multimodal LLMs, or AI application developers who default to CoT prompting.
Core Mechanics
- CoT (Chain-of-Thought) prompting reduces accuracy by an average of 3% in visual spatial reasoning, unlike its performance in math and logic tasks.
- 7 out of 8 Multimodal Reasoning Models (MRMs) trained with Reinforcement Learning (RL) performed *worse* at spatial reasoning than their base Qwen2.5-VL-7B-Instruct model—expensive training can be counterproductive.
- ViGoRL-7B-Spatial, specifically trained for spatial reasoning, also underperformed its base model by −2%, and TreeVGR by −1.57%. Vision-G1(+0.6%) was the sole exception.
- The No-Image++ experiment—replacing images with a gray screen and adding a 'cannot determine from the image' option—showed MRMs confidently selecting incorrect answers based on textual knowledge alone, even fabricating spatial coordinates.
- GPT-5 and GPT-5-nano also show +0.65% and +1.23% higher accuracy with Non-CoT prompting compared to CoT, mirroring the trend observed in open-source models. GPT-4o and GPT-4.1-mini show minimal CoT gains (under 0.5%) that don't justify the added inference cost.
- Models with concise, non-repetitive CoT traces (GPT family, ~350 characters) experience less performance degradation than open-source models with lengthy, looping traces (~3600 characters). Verbose reasoning is suspected to induce hallucinations.
Evidence
- Across 17 models and 13 spatial benchmarks, CoT prompting resulted in an average 3% accuracy decrease compared to Non-CoT. Qwen2.5-VL-7B: Non-CoT 62.68% vs CoT 59.68%.
- GThinker-7B experienced the largest drop (−23.14%) when using Non-CoT prompts, repeatedly outputting `tool_call` tokens to the maximum token limit instead of following instructions.
- In the No-Image++ experiment, MRM accuracy for selecting 'cannot determine from the image' was: GThinker 5.55%, R1-Onevision 11.22%, Vision-R1 7.29%—below random chance. The base Qwen2.5-VL-7B achieved 76.41%.
- Qwen3-VL-8B-Thinking (a model enhanced for spatial awareness) showed Non-CoT outperforming CoT on 8 out of 13 datasets, with an average difference of +0.64%.
How to Apply
- When processing 'object location/direction/distance' questions in multimodal apps, switch from CoT prompts to direct-answer prompts (Non-CoT). For example, configure the system prompt without 'think' tags: 'You are a spatial-reasoning assistant. Answer the question directly.'
- If your spatial reasoning pipeline uses CoT models (GThinker, ViGoRL, etc. MRMs), compare their performance to the same base model (Qwen2.5-VL-7B-Instruct) running Non-CoT. This can reduce both cost and latency.
- To reduce model hallucinations in visually-critical functions (e.g., robot navigation, object relationship extraction in images), incorporate a No-Image++-style internal reliability test into your QA pipeline—input a blank image and include a 'cannot determine' option.
Code Example
# Non-CoT prompt example (spatial reasoning task)
base_system_prompt = "You are a spatial-reasoning assistant. The user asks a question, and the Assistant solves it."
# CoT prompt (do not use - performance degradation in spatial tasks)
cot_system_prompt = (
"You are a spatial-reasoning assistant. "
"First output the thinking process in <think></think> tags "
"and then output the final answer in <answer></answer> tags."
)
# Recommended: Call directly with Non-CoT
messages = [
{"role": "system", "content": base_system_prompt},
{
"role": "user",
"content": [
{"type": "image", "image": "<your_image_path_or_url>"},
{"type": "text", "text": "Where is the red box relative to the blue sphere?\nOptions:\nA. Left\nB. Right\nC. Above\nD. Below\nPlease select the correct answer (letter and option text) from the options above."}
]
}
]
# No-Image++ reliability test: blank image + 'cannot determine' option
def add_cannot_determine_option(options: list[str]) -> list[str]:
return options + ["Cannot determine from the image"]
# If the model doesn't choose 'cannot determine' with a blank image → hallucination risk signalTerminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.
Related Resources
Original Abstract (Expand)
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.