TopoBench: 위상수학적(Topological) 추론에서 LLM 벤치마킹

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Mar 12, 2026•Mayug Maniparambil, Nils Hoehing, Janak Kapuriya +5•View PDF

TL;DR Highlight

GPT-5-mini 같은 최강 모델도 어려운 위상 퍼즐을 25%밖에 못 풀고, 실패 원인은 추론 능력 부족이 아니라 공간 제약 조건 추출 실패라는 걸 밝힌 연구.

Who Should Read

LLM의 공간/논리 추론 한계를 이해하고 싶은 AI 엔지니어. 특히 구조화된 상태 추적이 필요한 에이전트나 퍼즐 솔버를 개발 중인 개발자.

Core Mechanics

6가지 위상 퍼즐(Flow Free, Bridges, Loopy 등)로 구성된 TopoBench 벤치마크 공개 - 쉬운 단계는 인간이 쉽게 풀지만 LLM은 어려운 단계에서 25% 미만 정확도
GPT-5-mini-high가 hard tier에서 0.24, 최고 오픈소스인 DeepSeek V3.2는 0.10에 불과 - Galaxies·Loopy 두 퍼즐은 hard에서 거의 전 모델이 0점
에러 분석 결과: 자주 나타나는 에러(Repeated Reasoning 33%)가 실제 실패 원인이 아니고, 드문 에러(Constraint Forgetting 4%)가 오히려 정확도를 10~11pp 떨어뜨리는 주범
Premature Commitment(초반에 틀린 경로 고수)가 Bridges에서 -20.8pp, Undead에서 -11.3pp 정확도 하락 유발 - 가장 치명적인 실패 패턴
입력 포맷을 ASCII에서 IntFormat(정수 인코딩)으로 바꾸면 Bridges/Galaxies에서 최대 +38pp 향상 - 토크나이저가 그리드 셀 경계를 제대로 처리 못하는 게 문제
외부 툴로 구조화된 제약 정보(JSON 형태 degree 수, 연결 컴포넌트)를 제공하면 hard Bridges 정확도 40%→50% 향상, 반면 ASCII 그리드를 툴로 재렌더링하면 오히려 성능 감소

Evidence

GPT-5-mini-high hard tier 평균 정확도 0.24, DeepSeek V3.2 0.10 - 인간이 쉽게 푸는 easy tier 퍼즐도 비프론티어 모델은 0.03~0.07 수준
Constraint Forgetting 개입 실험: Bridges -10.6pp, Undead -11.3pp 정확도 하락 (95% Wilson CI 기준 유의미) - 관찰 빈도는 고작 4%에 불과
IntFormat 입력 포맷: Gemini 3 Flash Bridges +37.3pp, DeepSeek V3.2 Galaxies +28.0pp 향상 (ASCII 대비)
툴 ablation: make_move+render_board만 사용 시 정확도 26%(기준선 40%보다 하락), state_summary 추가 시 42%로 회복, 전체 구조화 툴 사용 시 50% 달성

How to Apply

LLM에게 그리드/표 형태 데이터를 넘길 때 ASCII 대신 콤마로 구분된 정수 인코딩(IntFormat)이나 JSON 2D 배열 형태로 변환하면 파싱 오류를 줄일 수 있다. 특히 BPE 토크나이저가 셀 경계를 무너뜨리는 문제를 방지.
에이전트가 상태 추적이 필요한 작업(보드 게임, 레이아웃 계산 등)을 할 때 현재 상태를 시각적 렌더링이 아닌 구조화된 JSON(남은 degree, 연결 컴포넌트 등 사전 계산값)으로 제공하는 툴을 설계하라. ASCII 그리드 재출력은 오히려 성능을 떨어뜨린다.
Chain-of-thought 에러 분석 시 '자주 나타나는 에러 = 중요한 에러'가 아님을 주의. 실제 인과 관계는 controlled intervention(에러를 의도적으로 주입해 정확도 변화 측정)으로만 확인 가능하다. 프롬프트 레벨 전략 지시(플래닝, 백트래킹 지시)는 reasoning 모델에서 효과 없음.

Code Example

snippet

# IntFormat으로 Bridges 퍼즐 변환 예시
# ASCII 원본:
# 3..1.
# ..4.3
# .....
# .....
# 4.4.1

# 인코딩 규칙: 4=water(.), 5=island(1), 6=island(2), 7=island(3), 8=island(4)
def ascii_bridges_to_intformat(ascii_grid):
    mapping = {'.': '4', '1': '5', '2': '6', '3': '7', '4': '8'}
    rows = ascii_grid.strip().split('\n')
    result = []
    for row in rows:
        encoded = ','.join(mapping.get(c, c) for c in row)
        result.append(encoded)
    return '\n'.join(result)

ascii_input = """3..1.
..4.3
.....
.....
4.4.1"""

print(ascii_bridges_to_intformat(ascii_input))
# 출력:
# 7,4,4,5,4
# 4,4,8,4,7
# 4,4,4,4,4
# 4,4,4,4,4
# 8,4,8,4,5

# JSON 2D 배열 버전 (IntFormat JSON)
import json
def ascii_bridges_to_intformat_json(ascii_grid):
    mapping = {'.': '4', '1': '5', '2': '6', '3': '7', '4': '8'}
    rows = ascii_grid.strip().split('\n')
    grid = [[mapping.get(c, c) for c in row] for row in rows]
    return json.dumps(grid)

print(ascii_bridges_to_intformat_json(ascii_input))

Terminology

Chain-of-Thought (CoT)모델이 답을 바로 내놓지 않고 중간 추론 과정을 단계별로 출력하는 방식. 수학 문제 풀 때 풀이 과정 보여주는 것과 같음.

BPE (Byte Pair Encoding)LLM이 텍스트를 토큰으로 쪼개는 방식. 자주 나오는 문자 조합을 하나의 토큰으로 묶어서 처리하는데, 그리드 데이터에서는 셀 경계를 가로질러 묶어버려 공간 구조가 깨지는 문제 발생.

Premature Commitment퍼즐 풀이 초반에 틀린 선택을 하고 그 경로를 계속 고수하는 현상. 체스에서 첫 수를 잘못 두고 계속 그 방향으로만 생각하는 것과 비슷.

Constraint Forgetting퍼즐 규칙을 어기는 행동을 명시적으로 실행하는 실수. 예: 브리지 퍼즐에서 이미 최대 연결된 섬에 다리를 하나 더 추가하는 것.

LLM-as-judgeLLM이 다른 LLM의 출력을 평가·분류하는 방식. 사람 대신 GPT 같은 모델이 채점관 역할을 하는 것.

pass@1모델이 한 번만 시도했을 때 정답을 맞추는 비율. 여러 번 시도해서 하나라도 맞추는 pass@k와 달리 단발성 성능을 측정.

Wilson score CI통계에서 비율(정확도)의 신뢰 구간을 계산하는 방법. 샘플이 적을 때도 정확한 구간을 추정할 수 있어 벤치마크에서 자주 사용.

MoE (Mixture of Experts)모델 전체를 항상 다 쓰지 않고, 입력에 따라 일부 '전문가' 서브네트워크만 활성화하는 아키텍처. 총 파라미터는 크지만 실제 연산은 훨씬 적게 함.

Related Resources

https://github.com/mayug/topobench-benchmark

Original Abstract (Expand)

Solving topological grid puzzles requires reasoning over global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To investigate whether these failures stem from reasoning limitations or from difficulty extracting and maintaining spatial constraints, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment and constraint forgetting have a direct impact on the ability to solve the puzzle, while repeated reasoning is a benign effect of search. Finally we study mitigation strategies including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them. Code and data are available at github.com/mayug/topobench-benchmark.