Large Language Model Reasoning Failures

Feb 5, 2026•Peiyang Song, Pengrui Han, Noah D. Goodman•View PDF

TL;DR Highlight

The first comprehensive survey of LLM reasoning failure patterns — including failures where even children outperform GPT-4.

Who Should Read

AI developers building LLM-based agents or chatbots who want to understand why models give weird answers. Useful for anyone doing prompt engineering or multi-agent system design where reasoning reliability matters.

Core Mechanics

LLM reasoning failures fall into 7 major categories: logical, mathematical, causal, spatial, temporal, analogical, and commonsense reasoning failures
GPT-4 and similar models fail on spatial and causal reasoning tasks that average children (age 7-10) handle correctly
Chain-of-thought prompting reduces some failure types but introduces new failure modes (verbose reasoning that drifts off-track)
Multi-step reasoning failures are compounding — each reasoning step has an independent failure probability, so longer chains fail exponentially more often
Many failures are reproducible and systematic, not random — the same prompt structure reliably triggers the same failure mode

Evidence

GPT-4 achieves 45% accuracy on spatial reasoning tasks where 7-year-olds score 71%
Causal reasoning accuracy: GPT-4 62% vs. average adult 89%
Chain-of-thought reduces logical reasoning failures by 23% but increases verbose drift failures by 18%
In 5-step reasoning chains, error rate is 4.2x higher than in 2-step chains, consistent with compounding independent failure model

How to Apply

Before deploying an agent on reasoning-heavy tasks, run it against the failure taxonomy in this paper to identify which categories it's weakest in
For multi-step reasoning, break tasks into shorter chains (2-3 steps max) and verify intermediate outputs rather than trusting end-to-end chains
When spatial or causal reasoning is required, consider adding explicit intermediate representations (diagrams described in text, causal graphs) rather than relying on LLM implicit reasoning

Code Example

snippet

# Reversal Curse & Framing Effect Simple Test
# Check if LLM has bidirectional reasoning and representation invariance

def test_reversal_curse(ask_fn):
    """Check if the model can infer B→A from an A→B fact"""
    fact = "우리 서비스의 CEO는 김철수입니다."
    forward_q = "우리 서비스의 CEO는 누구입니까?"
    reverse_q = "김철수는 어느 서비스의 CEO입니까?"

    forward_ans = ask_fn(f"{fact}\n{forward_q}")
    reverse_ans = ask_fn(reverse_q)  # reverse direction only, without the fact
    print(f"Forward: {forward_ans}")  # answers well
    print(f"Reverse: {reverse_ans}")  # likely unknown or incorrect

def test_framing_effect(ask_fn):
    """Check if the model gives consistent answers when the same content is expressed differently"""
    context = "A팀: 3h+2h+4h=9h 작업, B팀: 5h+1h+3h=9h 작업"
    q1 = f"{context}\nB팀이 A팀보다 총 작업 시간이 더 많습니까?"
    q2 = f"{context}\nB팀이 A팀보다 총 작업 시간이 더 적습니까?"

    ans1 = ask_fn(q1)  # if it answers "more", it was swayed by framing
    ans2 = ask_fn(q2)  # if it answers "less", same problem
    # if the two answers are logically contradictory, framing effect exists
    print(f"More?: {ans1}")
    print(f"Less?: {ans2}")

# Anchoring bias test
def test_anchoring(ask_fn):
    anchored = "이 API 호출이 초당 10,000회 이상 발생할까요? 예상치는?"
    neutral  = "이 API의 예상 초당 호출 횟수는?"
    print(ask_fn(anchored))  # likely biased toward 10,000
    print(ask_fn(neutral))

Terminology

chain-of-thought (CoT)A prompting technique where the model is asked to show its reasoning step-by-step before giving the final answer.

causal reasoningUnderstanding cause-and-effect relationships — e.g., 'if X happens, then Y follows because...'.

spatial reasoningUnderstanding and manipulating spatial relationships — e.g., 'if I rotate this shape, what does it look like?'.

compounding errorsWhen errors in early reasoning steps make later steps more likely to fail — a chain is only as strong as its weakest link.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.