Empowering Reliable Visual-Centric Instruction Following in MLLMs

Jan 6, 2026•Wei He, Feng Ju, Zhiyuan Fan +3•View PDF

TL;DR Highlight

We created a benchmark and 10k fine-tuning dataset to verify whether multimodal models actually reference images — existing evaluations could be passed without any image at all.

Who Should Read

ML engineers and researchers who evaluate or fine-tune multimodal LLMs (image+text processing models). Especially those who want to distinguish whether a model is 'genuinely looking at the image to answer' vs. 'answering based on text patterns alone.'

Core Mechanics

Existing benchmarks like MM-IFEval can be passed by satisfying text conditions alone without any image — models game the eval through language habits, not true visual understanding
The VC-IFEngine pipeline automatically generates 10 types of visual constraints (Spatial, Attribute, Comparative, etc.) aligned to the actual image content
10k SFT training data (VC-IFInstruct) and 10k DPO (preference learning) data (VC-IFDPO) will be publicly released
DPO rejected samples are generated by either 'removing some constraints' or 'editing the image with Stable Diffusion' — the 100% constraint removal approach yields the best performance
Even the strongest open-source model Qwen2.5-VL-32B only achieves 67.3% on VC-IFEval — current models' visual instruction following is weaker than expected
Evaluation combines two methods — 'direct GPT-4o judgment' and 'response comparison with/without image' — to filter out language bias

Evidence

Qwen2.5-VL-7B-Instruct: VC-IFEval score 57.3% → 63.0% after VC-IFInstruct SFT, then 66.1% after DPO (+8.8%p improvement)
LLaVA-NeXT-Llama3-8B: 50.1% → 53.3% after SFT, then 60.2% after DPO (+10.1%p improvement)
Comparative evaluation human agreement 92%, direct evaluation human agreement 90% — automated evaluation reliability verified
80% of VC-IFDPO preference data validated by both annotators as correct chosen/rejected pairs, IAA Cohen's κ = 0.86

How to Apply

To check whether your model actually references the image: generate responses to the same question with and without the image, then ask GPT-4o to judge 'Influenced / Not influenced'
When creating data with visual constraints: choose from 10 categories — Spatial (position), Attribute (color/texture), Comparative, Counting, etc. — select those applicable to the image and rewrite them concretely
For DPO rejected samples, 'removing constraints' is more effective than 'removing the image' — deleting the image actually weakens the visual grounding learning signal

Code Example

snippet

# VC-IFEval-style comparative evaluation prompt (measuring image influence)
comparative_judge_prompt = """
You are evaluating whether the availability of IMAGE caused a substantive influence on the model's answer.
You will be given the question and two answers:
- Answer A: produced WITH image available.
- Answer B: produced WITHOUT image.

Guidelines:
- If Answer A contains details that plausibly come from visual evidence (objects, layout, colors, counts, attributes)
  and such details are missing/incorrect in Answer B, or the final conclusions differ BECAUSE of visual cues,
  judge it as "Influenced".
- If both answers are essentially the same in conclusions and key details (only minor wording differs),
  judge "Not influenced".

Question: {question}
Answer A (WITH image): {answer_with_image}
Answer B (WITHOUT image): {answer_without_image}

Return exactly one word: Influenced or Not influenced.
"""

# Direct evaluation prompt (checking constraint compliance)
direct_judge_prompt = """
You are asked to judge whether the AI assistant's response fully complies with each listed constraint.
1. Each judgment should be grounded in the visual evidence provided by the image.
2. Assign 1 point if completely satisfied; assign 0 otherwise.

<start of response> {prediction} <end of response>
<start of constraint list> {constraints} <end of constraint list>

Output format: Judgement: ... Summary: constraint_1: x/1, constraint_2: x/1, ...
"""

Terminology

MLLMA large language model that processes both images and text simultaneously. Models like GPT-4o and Qwen2.5-VL fall into this category.

SFTSupervised Fine-Tuning. A training method where the model learns by observing example answers and imitating them. Similar to studying worked examples in school.

DPODirect Preference Optimization. A training method that steers the model toward better answers by presenting pairs of 'good answer' vs. 'bad answer.' Learns preferences more simply than reinforcement learning.

Instruction FollowingThe ability to accurately comply with user-specified conditions. This includes following constraints like 'write within 300 words' or 'describe only what is in the foreground.'

Visual GroundingWhen a model bases its response on the actual content of the image. If a model answers by imagining rather than looking at the image, it lacks visual grounding.

CFAConstraint-following Accuracy. A score measuring how well the model adheres to the specified constraints.

IISImage Influence Score. A score measuring how much the response changes between having and not having the image. A higher score means the model is genuinely referencing the image.

Original Abstract (Expand)

Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs'instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs'instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.