Structured Reasoning for Large Language Models

Jan 12, 2026•Jinyi Han, Zixiang Di, Zishang Jiang +4•View PDF

TL;DR Highlight

A training framework that explicitly separates LLM reasoning into three stages — generation → verification → revision — reducing tokens by 50% while improving accuracy

Who Should Read

ML engineers looking to reduce unnecessary self-reflection loops and token waste in reasoning models (o1, DeepSeek-R1 family). Fine-tuning practitioners who want to improve CoT quality in math/code problem-solving pipelines.

Core Mechanics

Over 50% of reasoning tokens in existing LRMs (Large Reasoning Models) are spent on Verification & Revision, yet the actual incorrect→correct fix rate is only 1–7% — the rest is just redundant confirmation of already correct answers
SCR explicitly separates reasoning into three tags — <answer> (generation) → <critic> (verification) → <revised> (revision) — structuring each stage to be independently trainable
Dynamic Termination Supervision (DTS): if the critic outputs 'T' (correct answer confirmed), an [EOS] token is inserted immediately for early termination; if 'F', revision is forced — eliminating unnecessary repetition loops
Two-stage RL strategy: Stage 1 trains only initial generation + self-verification (masking revision), Stage 2 focuses on optimizing revision — preventing interference between learning signals
On Qwen2.5-7B, output tokens on MATH500 drop from 4069 → 1332 (approx. 67% reduction), and accuracy on AIME25 improves from SFT+GRPO 11.67% → SCR 13.67%
Self-verification accuracy improves dramatically over the base model: Qwen2.5-7B F1 52.09 → 75.36, and Qwen2.5-3B surpasses Llama 8B Base despite being a smaller model

Evidence

Average output tokens on MATH500: SFT+GRPO 4069 → SCR 1332 (approx. 67% reduction); on Olympiad: 3103 → 1118 (64% reduction)
Qwen2.5-7B on AIME25: Base 6.00% → SCR 13.67%; Qwen2.5-3B average accuracy: Base 34.05% → SCR 39.75% (+5.70%)
Llama3.1-8B overall average: Base 24.71% → SCR 34.25% (+9.54%), the largest improvement margin
Self-verification F1: Llama3.1-8B Base 30.75 → SCR-Stage I 55.81; Qwen2.5-7B Base 52.09 → SCR-Stage I 75.36

How to Apply

When constructing fine-tuning data, synthesize structured trajectories in the <answer>...</answer><critic>...</critic><revised>...</revised> tag format — generate SFT data by inserting EOS when the critic ends with 'T', or adding a revised block when it ends with 'F'
Split RL training into two stages: in Stage 1, apply gradient masking to the <revised> block and optimize only initial generation + verification; in Stage 2, apply revision rewards across the full trajectory (successful correction: +, unnecessary revision: −, corrupting a correct answer: strongly −)
To test immediately via prompting, use the 'Prompt Used in SCR' format — construct a system prompt that generates an answer, makes a T/F judgment in <critic>, revises in <revised> if F, and stops immediately if T

Code Example

snippet

# SCR inference prompt (ready to use)
system_prompt = """
You are a helpful AI assistant.
For each question, you must first solve the problem and put your complete solution inside <answer>...</answer>. The final result must be wrapped in \\boxed{}.

Then you must evaluate your own solution inside <critic>...</critic>.
At the end of the <critic>, you must give the final judgment using only one single symbol: T or F.
T means the answer is correct. F means the answer is incorrect.

If the final judgment is F, you must give a corrected solution inside <revised>...</revised>, and the final result must also be wrapped in \\boxed{}.
If the final judgment is T, you must stop and give no further output.
"""

# Parsing example
import re

def parse_scr_output(text):
    answer = re.search(r'<answer>(.*?)</answer>', text, re.DOTALL)
    critic = re.search(r'<critic>(.*?)</critic>', text, re.DOTALL)
    revised = re.search(r'<revised>(.*?)</revised>', text, re.DOTALL)
    
    verdict = 'T' if critic and critic.group(1).strip().endswith('T') else 'F'
    final = revised.group(1) if revised else answer.group(1)
    
    return {
        'initial': answer.group(1) if answer else None,
        'verdict': verdict,
        'final': final,
        'was_revised': revised is not None
    }

Terminology

CoTShort for Chain of Thought. A technique where the model writes out its intermediate reasoning steps in text (e.g., 'Step 1: ..., Step 2: ...') rather than directly outputting an answer. Analogous to showing your work when solving a problem.

GRPOGroup Relative Policy Optimization. A reinforcement learning method that simultaneously samples multiple responses and learns by comparing their relative quality against each other. Rather than using absolute scores, it rewards answers that are relatively better within the sampled group.

SFTSupervised Fine-Tuning. A training method that shows the model example answers and trains it to replicate them. Similar to learning by studying worked examples, it serves to establish the basic format before RL training.

RLVRReinforcement Learning with Verifiable Rewards. A reinforcement learning approach that uses automatically verifiable reward signals — such as whether an answer is correct or not — instead of human feedback. Well-suited for tasks with objectively measurable correctness, such as math problems or code execution results.

Dynamic Termination SupervisionA technique that trains the model to decide on its own when to stop. If the verification result is correct, a termination signal is inserted immediately; if incorrect, revision is forced — eliminating unnecessary repetitive loops.

Cognitive IdlingThe phenomenon where a model continues to repeat the same verification steps even when it already has the correct answer. A state of 'thought spinning in place' — the answer is settled, but the model keeps re-confirming it.

Test-time ScalingA strategy that improves performance by investing more computation at inference time (e.g., longer CoT, multiple attempts) rather than increasing model size. Essentially spending more compute to let the model think longer.

Related Resources

Original Abstract (Expand)

Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.