Long-form RewardBench: Evaluating Reward Models for Long-form Generation

Mar 13, 2026•Hui Huang, Yancheng He, Wei Liu +7•View PDF

TL;DR Highlight

The first evaluation dataset specifically for long-text generation, addressing the gap in existing Reward Model benchmarks that only cover short texts.

Who Should Read

Researchers evaluating Reward Models and RLHF pipelines, and teams that need to assess model quality on long-form generation tasks.

Core Mechanics

Identified a critical gap: existing Reward Model (RM) benchmarks focus on short texts, missing the challenges of long-form generation evaluation
Created the first dedicated benchmark dataset for evaluating RMs on long-text generation tasks
Long-text evaluation requires different quality criteria: coherence, consistency, completeness, and narrative flow across many tokens
Existing RMs trained on short-text preferences fail to generalize to long-text quality assessment
The dataset covers multiple long-text generation categories: essays, reports, creative writing, long-form QA
Provides fine-grained quality annotations beyond simple preference labels

Evidence

State-of-the-art RMs show significantly degraded performance on long-text evaluation vs short-text
Human inter-annotator agreement on long-text quality is measurably different from short-text patterns
Existing benchmarks have near-zero long-text representation (>2048 tokens)
Models trained only on short-text preference data underperform on long-text tasks despite instruction following ability

How to Apply

Use this benchmark to evaluate whether your Reward Model is actually capable of judging long-form outputs
If your RM scores are poor on this benchmark, include long-text preference pairs in your RM training data
For any RLHF pipeline targeting long-form generation, validate RM quality on this dataset before deploying

Code Example

snippet

# Example prompt using Generative Reward Model in Selection mode
# (Much higher accuracy than Scoring mode)

prompt_template = """
You are an expert evaluator. Given the following instruction and multiple responses,
select the BEST response that is most helpful, accurate, and comprehensive.

Instruction:
{instruction}

Response A:
{response_a}

Response B:
{response_b}

Response C:
{response_c}

Which response is the best? Answer with only 'A', 'B', or 'C'.
"""

# ❌ Scoring mode to avoid (causes tie-handling issues)
bad_prompt = """
Rate the following response on a scale of 1-10.
Response: {response}
Score:
"""

# ✅ Recommended Selection mode
import openai

def select_best_response(instruction, responses):
    labeled = {chr(65+i): r for i, r in enumerate(responses)}  # A, B, C...
    prompt = prompt_template.format(
        instruction=instruction,
        **{f'response_{k.lower()}': v for k, v in labeled.items()}
    )
    result = openai.chat.completions.create(
        model="gpt-4.1-2025-04-14",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1
    )
    best_key = result.choices[0].message.content.strip()
    return labeled.get(best_key)

Terminology

Reward Model (RM)A model trained to predict human preferences between outputs — used as a reward signal in RLHF training.

Long-text GenerationTasks requiring the model to generate extended, coherent text (essays, reports, etc.) rather than short responses.

Preference BenchmarkA dataset of paired outputs with human preference labels, used to evaluate reward models.

CoherenceThe degree to which a long text maintains consistent topics, logical flow, and narrative structure throughout.

Related Resources

Long-form RewardBench GitHub

Original Abstract (Expand)

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.