TopoBench: Benchmarking LLMs on Hard Topological Reasoning
TL;DR Highlight
Even top models like GPT-4 solve only 25% of hard topology puzzles — not because of weak reasoning, but because they fail to extract spatial constraints.
Who Should Read
AI researchers studying spatial reasoning and math problem-solving in LLMs, and benchmark designers looking for tasks that expose fundamental reasoning gaps.
Core Mechanics
- Introduced a benchmark of topology puzzles (knot theory, graph embedding, surface classification) to test spatial reasoning in LLMs
- Top models including GPT-4 achieve only ~25% solve rate on hard puzzles
- Failure mode analysis reveals the bottleneck is not reasoning ability per se, but the extraction of spatial constraint information from problem descriptions
- Models can reason correctly once constraints are extracted properly — the perception/extraction step is the weak link
- Performance gap between easy and hard problems is driven by constraint complexity, not reasoning chain length
- Topology problems are particularly hard because they require maintaining global topological structure, not just local rules
Evidence
- GPT-4 and similar top models solve ~25% of hard topology puzzles
- When spatial constraints are provided in structured form (bypassing extraction), solve rates increase substantially
- Error analysis confirms >70% of failures trace back to incorrect constraint extraction, not reasoning errors
- Performance degrades consistently as topological complexity increases
How to Apply
- Use this benchmark to diagnose spatial reasoning gaps in your LLM — low scores indicate constraint extraction issues, not just reasoning weaknesses
- For applications requiring spatial reasoning (robotics planning, geometry proofs), pre-process problems to make constraints explicit before passing to the LLM
- Consider fine-tuning on topology problems as a way to improve spatial constraint extraction ability
Code Example
# Example of converting a Bridges puzzle to IntFormat
# ASCII original:
# 3..1.
# ..4.3
# .....
# .....
# 4.4.1
# Encoding rules: 4=water(.), 5=island(1), 6=island(2), 7=island(3), 8=island(4)
def ascii_bridges_to_intformat(ascii_grid):
mapping = {'.': '4', '1': '5', '2': '6', '3': '7', '4': '8'}
rows = ascii_grid.strip().split('\n')
result = []
for row in rows:
encoded = ','.join(mapping.get(c, c) for c in row)
result.append(encoded)
return '\n'.join(result)
ascii_input = """3..1.
..4.3
.....
.....
4.4.1"""
print(ascii_bridges_to_intformat(ascii_input))
# Output:
# 7,4,4,5,4
# 4,4,8,4,7
# 4,4,4,4,4
# 4,4,4,4,4
# 8,4,8,4,5
# JSON 2D array version (IntFormat JSON)
import json
def ascii_bridges_to_intformat_json(ascii_grid):
mapping = {'.': '4', '1': '5', '2': '6', '3': '7', '4': '8'}
rows = ascii_grid.strip().split('\n')
grid = [[mapping.get(c, c) for c in row] for row in rows]
return json.dumps(grid)
print(ascii_bridges_to_intformat_json(ascii_input))Terminology
Related Resources
Original Abstract (Expand)
Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.