A Survey of Post-Training Scaling in Large Language Models

Jan 1, 2025•Hanyu Lai, Xiao Liu, Junjie Gao +11•View PDF

TL;DR Highlight

A concise overview of 3 Post-Training Scaling methods that have emerged as alternatives to pre-training data scaling limits.

Who Should Read

Researchers and engineers following LLM scaling trends who want to understand the landscape of post-training techniques as pre-training data becomes scarce.

Core Mechanics

Pre-training data scaling is hitting practical limits — high-quality internet text is largely exhausted for LLM training
Three post-training scaling approaches are emerging as alternatives: (1) Inference-time scaling (more compute at test time), (2) RL-based reasoning (training models to reason better), (3) Synthetic data generation (models teaching themselves)
Inference-time scaling (chain-of-thought, self-consistency, tree search) can double effective model capability without any training
RL-based reasoning training (RLHF, RLAIF, process reward models) improves reasoning ability proportionally to training compute invested
Synthetic data generation (models generating their own training data) enables continued scaling beyond human-labeled data limits
The three approaches are complementary — combining them produces superadditive benefits

Evidence

Inference-time scaling: spending 10x more compute at inference matches training a 3x larger model on reasoning tasks
RL reasoning training: consistent log-linear improvement in reasoning ability with training compute invested
Synthetic data: models trained on self-generated + curated data outperform those trained on human-labeled data alone for reasoning tasks

How to Apply

For immediate capability improvements without training: invest in inference-time scaling — chain-of-thought, self-consistency, and best-of-N sampling are available today for any model.
For sustained capability improvements: combine RL training (for reasoning) with synthetic data generation (for continued post-training scaling) — this is the trajectory of frontier model development.
Prioritize based on your constraints: inference-time scaling requires no training but costs more per query; RL training requires significant upfront compute but reduces per-query cost afterward.

Code Example

snippet

# TTC Style: Improving Inference Quality with Best-of-N Sampling
import anthropic

client = anthropic.Anthropic()

def best_of_n_inference(prompt: str, n: int = 8) -> str:
    """Generate N responses and select the most confident answer (simple TTC implementation)"""
    responses = []
    for _ in range(n):
        msg = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        responses.append(msg.content[0].text)
    
    # Select the best answer via majority vote or a separate judge model
    judge_prompt = f"""
From the following {n} responses, select the most accurate and logical one and output only its content.

Responses:
" + "\n---\n".join(f"{i+1}. {r}" for i, r in enumerate(responses))
    
    judge = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    return judge.content[0].text

# Usage example
result = best_of_n_inference("Write and explain a Python code to find the 10th term of the Fibonacci sequence.", n=4)
print(result)

Terminology

Post-Training ScalingImproving model capabilities through techniques applied after pre-training — rather than scaling the pre-training itself.

Inference-Time ScalingUsing more compute during generation (e.g., more samples, longer thinking) to improve output quality.

Process Reward ModelA reward model that evaluates individual reasoning steps rather than just final answers — enables more precise RL training for reasoning.

Synthetic DataTraining data generated by models rather than collected from humans — enables continued data scaling.

RLAIFReinforcement Learning from AI Feedback — using an AI model instead of humans to provide RL training rewards.

Original Abstract (Expand)

Large language models (LLMs) have achieved remarkable proficiency in understanding and generating human natural languages, mainly owing to the "scaling law" that optimizes relationships among language modeling loss, model parameters, and pre-trained tokens. However, with the exhaustion of high-quality internet corpora and increasing computational demands, the sustainability of pre-training scaling needs to be addressed. This paper presents a comprehensive survey of post-training scaling, an emergent paradigm aiming to relieve the limitations of traditional pre-training by focusing on the alignment phase, which traditionally accounts for a minor fraction of the total training computation. Our survey categorizes post-training scaling into three key methodologies: Supervised Fine-tuning (SFT), Reinforcement Learning from Feedback (RLxF), and Test-time Compute (TTC). We provide an in-depth analysis of the motivation behind post-training scaling, the scalable variants of these methodologies, and a comparative discussion against traditional approaches. By examining the latest advancements, identifying promising application scenarios, and highlighting unresolved issues, we seek a coherent understanding and map future research trajectories in the landscape of post-training scaling for LLMs.