Long-form RewardBench: Evaluating Reward Models for Long-form Generation
TL;DR Highlight
The first evaluation dataset specifically for long-text generation, addressing the gap in existing Reward Model benchmarks that only cover short texts.
Who Should Read
Researchers evaluating Reward Models and RLHF pipelines, and teams that need to assess model quality on long-form generation tasks.
Core Mechanics
- Identified a critical gap: existing Reward Model (RM) benchmarks focus on short texts, missing the challenges of long-form generation evaluation
- Created the first dedicated benchmark dataset for evaluating RMs on long-text generation tasks
- Long-text evaluation requires different quality criteria: coherence, consistency, completeness, and narrative flow across many tokens
- Existing RMs trained on short-text preferences fail to generalize to long-text quality assessment
- The dataset covers multiple long-text generation categories: essays, reports, creative writing, long-form QA
- Provides fine-grained quality annotations beyond simple preference labels
Evidence
- State-of-the-art RMs show significantly degraded performance on long-text evaluation vs short-text
- Human inter-annotator agreement on long-text quality is measurably different from short-text patterns
- Existing benchmarks have near-zero long-text representation (>2048 tokens)
- Models trained only on short-text preference data underperform on long-text tasks despite instruction following ability
How to Apply
- Use this benchmark to evaluate whether your Reward Model is actually capable of judging long-form outputs
- If your RM scores are poor on this benchmark, include long-text preference pairs in your RM training data
- For any RLHF pipeline targeting long-form generation, validate RM quality on this dataset before deploying
Code Example
# Example prompt using Generative Reward Model in Selection mode
# (Much higher accuracy than Scoring mode)
prompt_template = """
You are an expert evaluator. Given the following instruction and multiple responses,
select the BEST response that is most helpful, accurate, and comprehensive.
Instruction:
{instruction}
Response A:
{response_a}
Response B:
{response_b}
Response C:
{response_c}
Which response is the best? Answer with only 'A', 'B', or 'C'.
"""
# ❌ Scoring mode to avoid (causes tie-handling issues)
bad_prompt = """
Rate the following response on a scale of 1-10.
Response: {response}
Score:
"""
# ✅ Recommended Selection mode
import openai
def select_best_response(instruction, responses):
labeled = {chr(65+i): r for i, r in enumerate(responses)} # A, B, C...
prompt = prompt_template.format(
instruction=instruction,
**{f'response_{k.lower()}': v for k, v in labeled.items()}
)
result = openai.chat.completions.create(
model="gpt-4.1-2025-04-14",
messages=[{"role": "user", "content": prompt}],
max_tokens=1
)
best_key = result.choices[0].message.content.strip()
return labeled.get(best_key)Terminology
Related Resources
Original Abstract (Expand)
The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.