Structured Reasoning for Large Language Models
TL;DR Highlight
A training framework that explicitly separates LLM reasoning into three stages — generation → verification → revision — reducing tokens by 50% while improving accuracy
Who Should Read
ML engineers looking to reduce unnecessary self-reflection loops and token waste in reasoning models (o1, DeepSeek-R1 family). Fine-tuning practitioners who want to improve CoT quality in math/code problem-solving pipelines.
Core Mechanics
- Over 50% of reasoning tokens in existing LRMs (Large Reasoning Models) are spent on Verification & Revision, yet the actual incorrect→correct fix rate is only 1–7% — the rest is just redundant confirmation of already correct answers
- SCR explicitly separates reasoning into three tags — <answer> (generation) → <critic> (verification) → <revised> (revision) — structuring each stage to be independently trainable
- Dynamic Termination Supervision (DTS): if the critic outputs 'T' (correct answer confirmed), an [EOS] token is inserted immediately for early termination; if 'F', revision is forced — eliminating unnecessary repetition loops
- Two-stage RL strategy: Stage 1 trains only initial generation + self-verification (masking revision), Stage 2 focuses on optimizing revision — preventing interference between learning signals
- On Qwen2.5-7B, output tokens on MATH500 drop from 4069 → 1332 (approx. 67% reduction), and accuracy on AIME25 improves from SFT+GRPO 11.67% → SCR 13.67%
- Self-verification accuracy improves dramatically over the base model: Qwen2.5-7B F1 52.09 → 75.36, and Qwen2.5-3B surpasses Llama 8B Base despite being a smaller model
Evidence
- Average output tokens on MATH500: SFT+GRPO 4069 → SCR 1332 (approx. 67% reduction); on Olympiad: 3103 → 1118 (64% reduction)
- Qwen2.5-7B on AIME25: Base 6.00% → SCR 13.67%; Qwen2.5-3B average accuracy: Base 34.05% → SCR 39.75% (+5.70%)
- Llama3.1-8B overall average: Base 24.71% → SCR 34.25% (+9.54%), the largest improvement margin
- Self-verification F1: Llama3.1-8B Base 30.75 → SCR-Stage I 55.81; Qwen2.5-7B Base 52.09 → SCR-Stage I 75.36
How to Apply
- When constructing fine-tuning data, synthesize structured trajectories in the <answer>...</answer><critic>...</critic><revised>...</revised> tag format — generate SFT data by inserting EOS when the critic ends with 'T', or adding a revised block when it ends with 'F'
- Split RL training into two stages: in Stage 1, apply gradient masking to the <revised> block and optimize only initial generation + verification; in Stage 2, apply revision rewards across the full trajectory (successful correction: +, unnecessary revision: −, corrupting a correct answer: strongly −)
- To test immediately via prompting, use the 'Prompt Used in SCR' format — construct a system prompt that generates an answer, makes a T/F judgment in <critic>, revises in <revised> if F, and stops immediately if T
Code Example
# SCR inference prompt (ready to use)
system_prompt = """
You are a helpful AI assistant.
For each question, you must first solve the problem and put your complete solution inside <answer>...</answer>. The final result must be wrapped in \\boxed{}.
Then you must evaluate your own solution inside <critic>...</critic>.
At the end of the <critic>, you must give the final judgment using only one single symbol: T or F.
T means the answer is correct. F means the answer is incorrect.
If the final judgment is F, you must give a corrected solution inside <revised>...</revised>, and the final result must also be wrapped in \\boxed{}.
If the final judgment is T, you must stop and give no further output.
"""
# Parsing example
import re
def parse_scr_output(text):
answer = re.search(r'<answer>(.*?)</answer>', text, re.DOTALL)
critic = re.search(r'<critic>(.*?)</critic>', text, re.DOTALL)
revised = re.search(r'<revised>(.*?)</revised>', text, re.DOTALL)
verdict = 'T' if critic and critic.group(1).strip().endswith('T') else 'F'
final = revised.group(1) if revised else answer.group(1)
return {
'initial': answer.group(1) if answer else None,
'verdict': verdict,
'final': final,
'was_revised': revised is not None
}Terminology
Related Resources
Original Abstract (Expand)
Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.