VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Mar 13, 2025•Weiyun Wang, Zhangwei Gao, Lianjie Chen +12•View PDF

TL;DR Highlight

An 8B-scale judge model that scores the accuracy of each solution step in image+text reasoning problems, pluggable into existing models to boost reasoning performance by up to 8.4 points.

Who Should Read

ML engineers operating or improving math/reasoning pipelines with multimodal LLMs (image+text), especially teams looking to enhance inference quality through Best-of-N sampling.

Core Mechanics

PRM (Process Reward Model), which scores accuracy at each individual step rather than the entire solution, consistently outperforms ORM (outcome-only model) and Self-Consistency in Best-of-N settings
VisualPRM400K, a multimodal process supervision dataset of 400K samples, was constructed via an automated pipeline — expected accuracy (mc) is calculated by sampling multiple completions at each step
Applying the Best-of-8 strategy to InternVL2.5-8B, MiniCPM-V2.6, Qwen2.5-VL-7B, and InternVL2.5-78B improves overall performance by 8.4, 8.0, 3.7, and 5.9 points respectively
Using existing open-source MLLMs (including InternVL2.5-78B) as judge models yields almost no improvement — caused by a positivity bias where most steps are judged as correct
VisualPRM computes all step scores in a single forward pass, resulting in significantly lower inference latency than autoregressive MLLM-as-Judge approaches
VisualProcessBench, an evaluation benchmark containing 2,866 samples and 26,950 human-annotated step-level correctness labels, is publicly released

Evidence

In Best-of-8, VisualPRM outperforms ORM by 1.5 points and Self-Consistency by 2.4 points (based on InternVL2.5-8B); scaling to N=128 widens the gap to 4.3 and 3.1 points respectively
VisualProcessBench F1 scores: VisualPRM 62.0 vs GPT-4o 60.3 vs Gemini-2.0-Flash 62.3 — an 8B open-source model achieves step verification capability on par with proprietary models
Also effective on text-only benchmarks — Qwen2.5-7B MATH-500 +6.1pt, GPQA-Diamond +5.0pt; InternVL2.5-8B MATH-500 +9.4pt
As N scales from 8 to 128, VisualPRM shows consistent performance gains (InternVL2.5-8B: 41.2→44.0), while ORM plateaus or degrades after N=64

How to Apply

Sample N responses (8–32) from an existing multimodal model (InternVL2.5, Qwen2.5-VL, etc.) at temperature=0.7 for the same problem, then use VisualPRM-8B to compute the average step score for each solution and select the highest-scoring answer — inference quality improves without any model fine-tuning
For aggregating step scores, 'average' is the most stable method; avoid 'max' aggregation, as it concentrates scores on easy early steps and degrades performance
If you need your own process supervision data, you can replicate the VisualPRM400K pipeline: Monte Carlo sample 16 completions per step to estimate expected accuracy (mc), and treat steps with mc>0 as correct

Code Example

snippet

# Best-of-N selection example (pseudocode)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Generate N candidates using the policy model
policy_model = load_model("InternVL2.5-8B")
candidates = [
    policy_model.generate(image, question, temperature=0.7)
    for _ in range(8)  # N=8
]

# 2. Score each candidate's steps using VisualPRM
prm = load_model("VisualPRM-8B")

def score_response(image, question, solution_steps):
    step_scores = []
    for i, step in enumerate(solution_steps):
        # Use the generation probability of the '+' token as the score for each step
        prob_correct = prm.get_token_prob(
            image, question, solution_steps[:i+1], token="+"
        )
        step_scores.append(prob_correct)
    return sum(step_scores) / len(step_scores)  # Average aggregation

# 3. Select the candidate with the highest score
scores = [score_response(image, question, c.steps) for c in candidates]
best_response = candidates[scores.index(max(scores))]

# VisualPRM prompt format (multi-turn chat)
# Turn 1: <image> + question + step_0
# Turn 2: step_1  → model predicts '+' or '-'
# Turn N: step_n  → model predicts '+' or '-'

Terminology

PRMStands for Process Reward Model. A judge model that checks whether each individual step of a solution is correct, rather than only scoring the final answer. Analogous to a grader who awards partial credit for the entire solution process, not just the final answer.

ORMStands for Outcome Reward Model. A scoring approach that ignores the solution process and only checks whether the final answer is correct. Contrasts with PRM.

Best-of-NA strategy where the model generates N candidate answers for the same problem, and the answer rated best by a judge model is selected as the final answer. Similar to buying N lottery tickets and submitting the best one.

Test-Time ScalingA technique that improves performance by using more computation at inference time, without retraining the model. Essentially spending more compute to make more attempts and picking the best result.

Self-ConsistencyA method that solves the same problem multiple times and selects the most frequently occurring answer. Operates on the same principle as majority voting.

Monte Carlo 샘플링A technique that estimates a value by probabilistically running many trials and averaging the results. Just as rolling a die 100 times estimates the probability of getting a 6, here it estimates the accuracy of a specific step by completing the solution multiple times from that step.

MLLMStands for Multimodal Large Language Model. A large language model capable of understanding and responding to both text and images. Models such as GPT-4o and Gemini fall into this category.

Related Resources

Original Abstract (Expand)

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.