A Survey of Post-Training Scaling in Large Language Models
TL;DR Highlight
A concise overview of 3 Post-Training Scaling methods that have emerged as alternatives to pre-training data scaling limits.
Who Should Read
Researchers and engineers following LLM scaling trends who want to understand the landscape of post-training techniques as pre-training data becomes scarce.
Core Mechanics
- Pre-training data scaling is hitting practical limits — high-quality internet text is largely exhausted for LLM training
- Three post-training scaling approaches are emerging as alternatives: (1) Inference-time scaling (more compute at test time), (2) RL-based reasoning (training models to reason better), (3) Synthetic data generation (models teaching themselves)
- Inference-time scaling (chain-of-thought, self-consistency, tree search) can double effective model capability without any training
- RL-based reasoning training (RLHF, RLAIF, process reward models) improves reasoning ability proportionally to training compute invested
- Synthetic data generation (models generating their own training data) enables continued scaling beyond human-labeled data limits
- The three approaches are complementary — combining them produces superadditive benefits
Evidence
- Inference-time scaling: spending 10x more compute at inference matches training a 3x larger model on reasoning tasks
- RL reasoning training: consistent log-linear improvement in reasoning ability with training compute invested
- Synthetic data: models trained on self-generated + curated data outperform those trained on human-labeled data alone for reasoning tasks
How to Apply
- For immediate capability improvements without training: invest in inference-time scaling — chain-of-thought, self-consistency, and best-of-N sampling are available today for any model.
- For sustained capability improvements: combine RL training (for reasoning) with synthetic data generation (for continued post-training scaling) — this is the trajectory of frontier model development.
- Prioritize based on your constraints: inference-time scaling requires no training but costs more per query; RL training requires significant upfront compute but reduces per-query cost afterward.
Code Example
# TTC Style: Improving Inference Quality with Best-of-N Sampling
import anthropic
client = anthropic.Anthropic()
def best_of_n_inference(prompt: str, n: int = 8) -> str:
"""Generate N responses and select the most confident answer (simple TTC implementation)"""
responses = []
for _ in range(n):
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
responses.append(msg.content[0].text)
# Select the best answer via majority vote or a separate judge model
judge_prompt = f"""
From the following {n} responses, select the most accurate and logical one and output only its content.
Responses:
" + "\n---\n".join(f"{i+1}. {r}" for i, r in enumerate(responses))
judge = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": judge_prompt}]
)
return judge.content[0].text
# Usage example
result = best_of_n_inference("Write and explain a Python code to find the 10th term of the Fibonacci sequence.", n=4)
print(result)Terminology
Original Abstract (Expand)
Large language models (LLMs) have achieved remarkable proficiency in understanding and generating human natural languages, mainly owing to the "scaling law" that optimizes relationships among language modeling loss, model parameters, and pre-trained tokens. However, with the exhaustion of high-quality internet corpora and increasing computational demands, the sustainability of pre-training scaling needs to be addressed. This paper presents a comprehensive survey of post-training scaling, an emergent paradigm aiming to relieve the limitations of traditional pre-training by focusing on the alignment phase, which traditionally accounts for a minor fraction of the total training computation. Our survey categorizes post-training scaling into three key methodologies: Supervised Fine-tuning (SFT), Reinforcement Learning from Feedback (RLxF), and Test-time Compute (TTC). We provide an in-depth analysis of the motivation behind post-training scaling, the scalable variants of these methodologies, and a comparative discussion against traditional approaches. By examining the latest advancements, identifying promising application scenarios, and highlighting unresolved issues, we seek a coherent understanding and map future research trajectories in the landscape of post-training scaling for LLMs.