Further human + AI + proof assistant work on Knuth's "Claude Cycles" problem
TL;DR Highlight
A post sharing the process of solving the 'Claude Cycles' problem posed by mathematician Donald Knuth through collaboration between human experts, AI (LLMs), and formal proof assistants like Lean — demonstrating the real potential of AI to contribute meaningfully to mathematical research.
Who Should Read
Developers and researchers curious about how far AI can be used in mathematical reasoning or formal verification. Especially those interested in proof assistants like Lean or Coq, or AI for mathematics.
Core Mechanics
- This post covers a collaborative approach using human mathematicians + LLMs (large language models) + formal proof assistants (e.g., Lean) to solve a mathematical problem called 'Claude Cycles' posed by legendary computer scientist Donald Knuth.
- While the original tweet is inaccessible due to JavaScript being disabled, community comments and context indicate it reports further progress beyond previous work, showing that this kind of human-AI collaboration is producing real results in pure mathematics research.
- LLMs are characterized as strong at 'broad but shallow search' — meaning that when an expert sets the direction, LLMs excel at rapidly exploring a wide possibility space and proposing candidate ideas.
- Formal proof assistants like Lean and Coq are software tools that allow mathematical proofs to be written in a machine-verifiable form. Using these tools to verify proof ideas suggested by AI can reliably filter out errors.
- Some in the community predicted that in the future, applying AlphaGo-style reinforcement learning (RL) to Lean's syntax tree will prove more powerful than LLMs, since RL on the Lean syntax tree enables reasoning over much longer time scales.
- There was an observation that a professional mathematician's toolkit consists of roughly 10 core tricks, and if these tricks could be encoded as latent vectors (abstract representations inside AI models), AI could greatly accelerate mathematical research.
- Overall, a sober assessment also coexists: AI handles 'repetitive expert-level tasks' well when guided by specialists, but still has blind spots when it comes to truly difficult and complex problems.
Evidence
- "A witty comment went viral suggesting that 'AI will win a Fields Medal (the highest honor in mathematics) before it takes on the role of a McDonald's manager.' The argument is that while mathematics may seem like using a brain as a hammer to tighten a screw, LLMs' strength in 'broad and shallow search' actually makes them a good fit for mathematical research. There was also a prediction that AlphaGo-style reinforcement learning applied to Lean's syntax tree will become the dominant approach instead of LLMs, as RL-based methods can search over much longer time scales and are better suited for complex proofs. A realistic comment noted that it's unsurprising AI performs well when guided by experts — AI handles experts' 'lazy work' effectively, but still has blind spots on truly hard problems. One comment said it was hard to tell whether the thread participants were bots or humans, a meta-observation reflecting how deeply AI has become involved in mathematics community discussions, making it difficult to distinguish who is real. There were also comments wondering 'if anyone would tackle P≠NP this way,' and practical questions like 'what does this mean for ordinary people?' — reflecting that this type of research still largely remains within specialist communities."
How to Apply
- "When mathematical proof or algorithm correctness verification is needed, you can build a two-stage pipeline: generate draft proof ideas with an LLM, then verify them using a proof assistant like Lean or Coq to mechanically catch errors. Rather than trying to solve complex math problems with an LLM alone, design a role-sharing structure where a domain expert (or expert-level prompt) sets the direction and the LLM explores candidate paths — this yields far more reliable results. If you're interested in the AlphaGo-style RL + formal proof tool combination, use DeepMind's AlphaProof or related papers as references and experiment with reinforcement learning agents in the Lean environment. This field is currently advancing rapidly."
Terminology
Related Papers
Can LLMs model real-world systems in TLA+?
LLM이 TLA+ 명세를 작성할 때 문법은 잘 통과하지만 실제 시스템과의 동작 일치도(conformance)는 46% 수준에 그친다는 걸 체계적으로 검증한 벤치마크 연구로, AI 기반 형식 검증의 현실적 한계를 보여준다.
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic이 LLM 내부의 숫자 벡터(활성화값)를 직접 읽을 수 있는 자연어로 변환하는 NLA 기법을 공개했다. AI가 실제로 무슨 생각을 하는지 해석하는 interpretability 연구의 새로운 진전이다.
ProgramBench: Can language models rebuild programs from scratch?
LLM이 FFmpeg, SQLite, PHP 인터프리터 같은 실제 소프트웨어를 문서만 보고 처음부터 재구현할 수 있는지 측정하는 새 벤치마크로, 최고 모델도 전체 태스크의 3%만 95% 이상 통과하는 수준에 그쳤다.
MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
티켓 3장으로 쪼개면 Claude/GPT도 보안 취약점 코드를 53~86% 확률로 그냥 짜준다.
Refusal in Language Models Is Mediated by a Single Direction
Open-source chat models encode safety as a single vector direction, and removing it disables safety fine-tuning.
Show HN: A new benchmark for testing LLMs for deterministic outputs
Structured Output Benchmark assesses LLM JSON handling across seven metrics, revealing performance beyond schema compliance.