TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Mar 12, 2026•Mayug Maniparambil, Nils Hoehing, Janak Kapuriya +5•View PDF

TL;DR Highlight

Even top models like GPT-4 solve only 25% of hard topology puzzles — not because of weak reasoning, but because they fail to extract spatial constraints.

Who Should Read

AI researchers studying spatial reasoning and math problem-solving in LLMs, and benchmark designers looking for tasks that expose fundamental reasoning gaps.

Core Mechanics

Introduced a benchmark of topology puzzles (knot theory, graph embedding, surface classification) to test spatial reasoning in LLMs
Top models including GPT-4 achieve only ~25% solve rate on hard puzzles
Failure mode analysis reveals the bottleneck is not reasoning ability per se, but the extraction of spatial constraint information from problem descriptions
Models can reason correctly once constraints are extracted properly — the perception/extraction step is the weak link
Performance gap between easy and hard problems is driven by constraint complexity, not reasoning chain length
Topology problems are particularly hard because they require maintaining global topological structure, not just local rules

Evidence

GPT-4 and similar top models solve ~25% of hard topology puzzles
When spatial constraints are provided in structured form (bypassing extraction), solve rates increase substantially
Error analysis confirms >70% of failures trace back to incorrect constraint extraction, not reasoning errors
Performance degrades consistently as topological complexity increases

How to Apply

Use this benchmark to diagnose spatial reasoning gaps in your LLM — low scores indicate constraint extraction issues, not just reasoning weaknesses
For applications requiring spatial reasoning (robotics planning, geometry proofs), pre-process problems to make constraints explicit before passing to the LLM
Consider fine-tuning on topology problems as a way to improve spatial constraint extraction ability

Code Example

snippet

# Example of converting a Bridges puzzle to IntFormat
# ASCII original:
# 3..1.
# ..4.3
# .....
# .....
# 4.4.1

# Encoding rules: 4=water(.), 5=island(1), 6=island(2), 7=island(3), 8=island(4)
def ascii_bridges_to_intformat(ascii_grid):
    mapping = {'.': '4', '1': '5', '2': '6', '3': '7', '4': '8'}
    rows = ascii_grid.strip().split('\n')
    result = []
    for row in rows:
        encoded = ','.join(mapping.get(c, c) for c in row)
        result.append(encoded)
    return '\n'.join(result)

ascii_input = """3..1.
..4.3
.....
.....
4.4.1"""

print(ascii_bridges_to_intformat(ascii_input))
# Output:
# 7,4,4,5,4
# 4,4,8,4,7
# 4,4,4,4,4
# 4,4,4,4,4
# 8,4,8,4,5

# JSON 2D array version (IntFormat JSON)
import json
def ascii_bridges_to_intformat_json(ascii_grid):
    mapping = {'.': '4', '1': '5', '2': '6', '3': '7', '4': '8'}
    rows = ascii_grid.strip().split('\n')
    grid = [[mapping.get(c, c) for c in row] for row in rows]
    return json.dumps(grid)

print(ascii_bridges_to_intformat_json(ascii_input))

Terminology

TopologyThe branch of mathematics studying properties preserved under continuous deformations — knot theory, surface types, etc. Requires global spatial reasoning.

Spatial Constraint ExtractionThe ability to identify and formalize geometric or topological rules from natural language problem descriptions.

BenchmarkA standardized set of problems used to evaluate and compare AI model capabilities.

Knot TheoryA branch of topology studying mathematical knots — whether two knots are the same, can be untangled, etc.

Related Resources

https://github.com/mayug/topobench-benchmark

Original Abstract (Expand)

Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.