VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
TL;DR Highlight
An 8B-scale judge model that scores the accuracy of each solution step in image+text reasoning problems, pluggable into existing models to boost reasoning performance by up to 8.4 points.
Who Should Read
ML engineers operating or improving math/reasoning pipelines with multimodal LLMs (image+text), especially teams looking to enhance inference quality through Best-of-N sampling.
Core Mechanics
- PRM (Process Reward Model), which scores accuracy at each individual step rather than the entire solution, consistently outperforms ORM (outcome-only model) and Self-Consistency in Best-of-N settings
- VisualPRM400K, a multimodal process supervision dataset of 400K samples, was constructed via an automated pipeline — expected accuracy (mc) is calculated by sampling multiple completions at each step
- Applying the Best-of-8 strategy to InternVL2.5-8B, MiniCPM-V2.6, Qwen2.5-VL-7B, and InternVL2.5-78B improves overall performance by 8.4, 8.0, 3.7, and 5.9 points respectively
- Using existing open-source MLLMs (including InternVL2.5-78B) as judge models yields almost no improvement — caused by a positivity bias where most steps are judged as correct
- VisualPRM computes all step scores in a single forward pass, resulting in significantly lower inference latency than autoregressive MLLM-as-Judge approaches
- VisualProcessBench, an evaluation benchmark containing 2,866 samples and 26,950 human-annotated step-level correctness labels, is publicly released
Evidence
- In Best-of-8, VisualPRM outperforms ORM by 1.5 points and Self-Consistency by 2.4 points (based on InternVL2.5-8B); scaling to N=128 widens the gap to 4.3 and 3.1 points respectively
- VisualProcessBench F1 scores: VisualPRM 62.0 vs GPT-4o 60.3 vs Gemini-2.0-Flash 62.3 — an 8B open-source model achieves step verification capability on par with proprietary models
- Also effective on text-only benchmarks — Qwen2.5-7B MATH-500 +6.1pt, GPQA-Diamond +5.0pt; InternVL2.5-8B MATH-500 +9.4pt
- As N scales from 8 to 128, VisualPRM shows consistent performance gains (InternVL2.5-8B: 41.2→44.0), while ORM plateaus or degrades after N=64
How to Apply
- Sample N responses (8–32) from an existing multimodal model (InternVL2.5, Qwen2.5-VL, etc.) at temperature=0.7 for the same problem, then use VisualPRM-8B to compute the average step score for each solution and select the highest-scoring answer — inference quality improves without any model fine-tuning
- For aggregating step scores, 'average' is the most stable method; avoid 'max' aggregation, as it concentrates scores on easy early steps and degrades performance
- If you need your own process supervision data, you can replicate the VisualPRM400K pipeline: Monte Carlo sample 16 completions per step to estimate expected accuracy (mc), and treat steps with mc>0 as correct
Code Example
# Best-of-N selection example (pseudocode)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Generate N candidates using the policy model
policy_model = load_model("InternVL2.5-8B")
candidates = [
policy_model.generate(image, question, temperature=0.7)
for _ in range(8) # N=8
]
# 2. Score each candidate's steps using VisualPRM
prm = load_model("VisualPRM-8B")
def score_response(image, question, solution_steps):
step_scores = []
for i, step in enumerate(solution_steps):
# Use the generation probability of the '+' token as the score for each step
prob_correct = prm.get_token_prob(
image, question, solution_steps[:i+1], token="+"
)
step_scores.append(prob_correct)
return sum(step_scores) / len(step_scores) # Average aggregation
# 3. Select the candidate with the highest score
scores = [score_response(image, question, c.steps) for c in candidates]
best_response = candidates[scores.index(max(scores))]
# VisualPRM prompt format (multi-turn chat)
# Turn 1: <image> + question + step_0
# Turn 2: step_1 → model predicts '+' or '-'
# Turn N: step_n → model predicts '+' or '-'Terminology
Related Resources
Original Abstract (Expand)
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.