Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs | AI Paper Digest

TL;DR Highlight

Chain-of-Thought reasoning decreases accuracy across 17 models on image-based spatial reasoning tasks.

Who Should Read

ML engineers developing services that analyze spatial relationships (location, direction, distance, etc.) within images using multimodal LLMs, or AI application developers who default to CoT prompting.

Core Mechanics

CoT (Chain-of-Thought) prompting reduces accuracy by an average of 3% in visual spatial reasoning, unlike its performance in math and logic tasks.
7 out of 8 Multimodal Reasoning Models (MRMs) trained with Reinforcement Learning (RL) performed *worse* at spatial reasoning than their base Qwen2.5-VL-7B-Instruct model—expensive training can be counterproductive.
ViGoRL-7B-Spatial, specifically trained for spatial reasoning, also underperformed its base model by −2%, and TreeVGR by −1.57%. Vision-G1(+0.6%) was the sole exception.
The No-Image++ experiment—replacing images with a gray screen and adding a 'cannot determine from the image' option—showed MRMs confidently selecting incorrect answers based on textual knowledge alone, even fabricating spatial coordinates.
GPT-5 and GPT-5-nano also show +0.65% and +1.23% higher accuracy with Non-CoT prompting compared to CoT, mirroring the trend observed in open-source models. GPT-4o and GPT-4.1-mini show minimal CoT gains (under 0.5%) that don't justify the added inference cost.
Models with concise, non-repetitive CoT traces (GPT family, ~350 characters) experience less performance degradation than open-source models with lengthy, looping traces (~3600 characters). Verbose reasoning is suspected to induce hallucinations.

Evidence

Across 17 models and 13 spatial benchmarks, CoT prompting resulted in an average 3% accuracy decrease compared to Non-CoT. Qwen2.5-VL-7B: Non-CoT 62.68% vs CoT 59.68%.
GThinker-7B experienced the largest drop (−23.14%) when using Non-CoT prompts, repeatedly outputting `tool_call` tokens to the maximum token limit instead of following instructions.
In the No-Image++ experiment, MRM accuracy for selecting 'cannot determine from the image' was: GThinker 5.55%, R1-Onevision 11.22%, Vision-R1 7.29%—below random chance. The base Qwen2.5-VL-7B achieved 76.41%.
Qwen3-VL-8B-Thinking (a model enhanced for spatial awareness) showed Non-CoT outperforming CoT on 8 out of 13 datasets, with an average difference of +0.64%.

How to Apply

When processing 'object location/direction/distance' questions in multimodal apps, switch from CoT prompts to direct-answer prompts (Non-CoT). For example, configure the system prompt without 'think' tags: 'You are a spatial-reasoning assistant. Answer the question directly.'
If your spatial reasoning pipeline uses CoT models (GThinker, ViGoRL, etc. MRMs), compare their performance to the same base model (Qwen2.5-VL-7B-Instruct) running Non-CoT. This can reduce both cost and latency.
To reduce model hallucinations in visually-critical functions (e.g., robot navigation, object relationship extraction in images), incorporate a No-Image++-style internal reliability test into your QA pipeline—input a blank image and include a 'cannot determine' option.

Code Example

snippet

# Non-CoT prompt example (spatial reasoning task)
base_system_prompt = "You are a spatial-reasoning assistant. The user asks a question, and the Assistant solves it."

# CoT prompt (do not use - performance degradation in spatial tasks)
cot_system_prompt = (
    "You are a spatial-reasoning assistant. "
    "First output the thinking process in <think></think> tags "
    "and then output the final answer in <answer></answer> tags."
)

# Recommended: Call directly with Non-CoT
messages = [
    {"role": "system", "content": base_system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "<your_image_path_or_url>"},
            {"type": "text", "text": "Where is the red box relative to the blue sphere?\nOptions:\nA. Left\nB. Right\nC. Above\nD. Below\nPlease select the correct answer (letter and option text) from the options above."}
        ]
    }
]

# No-Image++ reliability test: blank image + 'cannot determine' option
def add_cannot_determine_option(options: list[str]) -> list[str]:
    return options + ["Cannot determine from the image"]

# If the model doesn't choose 'cannot determine' with a blank image → hallucination risk signal

Terminology

CoTChain-of-Thought. A technique where the model outputs the reasoning process step-by-step in text before providing a final answer, similar to showing your work in math.

MRMMultimodal Reasoning Model. A model that receives both images and text, and is trained with Reinforcement Learning to generate a long reasoning chain before answering. Consider it the multimodal version of DeepSeek-R1.

RLVRReinforcement Learning with Verifiable Rewards. A method for training models using reinforcement learning with automatically verifiable reward signals (e.g., correctness of a mathematical equation).

SFTSupervised Fine-Tuning. A learning process where the model is trained to mimic example answers, similar to learning by studying worked examples.

No-Image++A hallucination detection experiment devised in this paper. It replaces the image with a gray blank screen and adds a 'cannot determine from the image' option to test whether the model answers based on visual information or textual knowledge alone.

GRPOGroup Relative Policy Optimization. A reinforcement learning algorithm that updates the policy based on the relative rewards of multiple sample groups. Commonly used in DeepSeek models.

shortcut learningA phenomenon where the model relies on statistical patterns in the data (e.g., 'caves are usually under trees') instead of true understanding, leading to errors in new situations.

Related Papers

Related Resources

Original Abstract (Expand)

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.