Large Language Model Reasoning Failures
TL;DR Highlight
The first comprehensive survey of LLM reasoning failure patterns — including failures where even children outperform GPT-4.
Who Should Read
AI developers building LLM-based agents or chatbots who want to understand why models give weird answers. Useful for anyone doing prompt engineering or multi-agent system design where reasoning reliability matters.
Core Mechanics
- LLM reasoning failures fall into 7 major categories: logical, mathematical, causal, spatial, temporal, analogical, and commonsense reasoning failures
- GPT-4 and similar models fail on spatial and causal reasoning tasks that average children (age 7-10) handle correctly
- Chain-of-thought prompting reduces some failure types but introduces new failure modes (verbose reasoning that drifts off-track)
- Multi-step reasoning failures are compounding — each reasoning step has an independent failure probability, so longer chains fail exponentially more often
- Many failures are reproducible and systematic, not random — the same prompt structure reliably triggers the same failure mode
Evidence
- GPT-4 achieves 45% accuracy on spatial reasoning tasks where 7-year-olds score 71%
- Causal reasoning accuracy: GPT-4 62% vs. average adult 89%
- Chain-of-thought reduces logical reasoning failures by 23% but increases verbose drift failures by 18%
- In 5-step reasoning chains, error rate is 4.2x higher than in 2-step chains, consistent with compounding independent failure model
How to Apply
- Before deploying an agent on reasoning-heavy tasks, run it against the failure taxonomy in this paper to identify which categories it's weakest in
- For multi-step reasoning, break tasks into shorter chains (2-3 steps max) and verify intermediate outputs rather than trusting end-to-end chains
- When spatial or causal reasoning is required, consider adding explicit intermediate representations (diagrams described in text, causal graphs) rather than relying on LLM implicit reasoning
Code Example
# Reversal Curse & Framing Effect Simple Test
# Check if LLM has bidirectional reasoning and representation invariance
def test_reversal_curse(ask_fn):
"""Check if the model can infer B→A from an A→B fact"""
fact = "우리 서비스의 CEO는 김철수입니다."
forward_q = "우리 서비스의 CEO는 누구입니까?"
reverse_q = "김철수는 어느 서비스의 CEO입니까?"
forward_ans = ask_fn(f"{fact}\n{forward_q}")
reverse_ans = ask_fn(reverse_q) # reverse direction only, without the fact
print(f"Forward: {forward_ans}") # answers well
print(f"Reverse: {reverse_ans}") # likely unknown or incorrect
def test_framing_effect(ask_fn):
"""Check if the model gives consistent answers when the same content is expressed differently"""
context = "A팀: 3h+2h+4h=9h 작업, B팀: 5h+1h+3h=9h 작업"
q1 = f"{context}\nB팀이 A팀보다 총 작업 시간이 더 많습니까?"
q2 = f"{context}\nB팀이 A팀보다 총 작업 시간이 더 적습니까?"
ans1 = ask_fn(q1) # if it answers "more", it was swayed by framing
ans2 = ask_fn(q2) # if it answers "less", same problem
# if the two answers are logically contradictory, framing effect exists
print(f"More?: {ans1}")
print(f"Less?: {ans2}")
# Anchoring bias test
def test_anchoring(ask_fn):
anchored = "이 API 호출이 초당 10,000회 이상 발생할까요? 예상치는?"
neutral = "이 API의 예상 초당 호출 횟수는?"
print(ask_fn(anchored)) # likely biased toward 10,000
print(ask_fn(neutral))Terminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.