Empowering Reliable Visual-Centric Instruction Following in MLLMs
TL;DR Highlight
We created a benchmark and 10k fine-tuning dataset to verify whether multimodal models actually reference images — existing evaluations could be passed without any image at all.
Who Should Read
ML engineers and researchers who evaluate or fine-tune multimodal LLMs (image+text processing models). Especially those who want to distinguish whether a model is 'genuinely looking at the image to answer' vs. 'answering based on text patterns alone.'
Core Mechanics
- Existing benchmarks like MM-IFEval can be passed by satisfying text conditions alone without any image — models game the eval through language habits, not true visual understanding
- The VC-IFEngine pipeline automatically generates 10 types of visual constraints (Spatial, Attribute, Comparative, etc.) aligned to the actual image content
- 10k SFT training data (VC-IFInstruct) and 10k DPO (preference learning) data (VC-IFDPO) will be publicly released
- DPO rejected samples are generated by either 'removing some constraints' or 'editing the image with Stable Diffusion' — the 100% constraint removal approach yields the best performance
- Even the strongest open-source model Qwen2.5-VL-32B only achieves 67.3% on VC-IFEval — current models' visual instruction following is weaker than expected
- Evaluation combines two methods — 'direct GPT-4o judgment' and 'response comparison with/without image' — to filter out language bias
Evidence
- Qwen2.5-VL-7B-Instruct: VC-IFEval score 57.3% → 63.0% after VC-IFInstruct SFT, then 66.1% after DPO (+8.8%p improvement)
- LLaVA-NeXT-Llama3-8B: 50.1% → 53.3% after SFT, then 60.2% after DPO (+10.1%p improvement)
- Comparative evaluation human agreement 92%, direct evaluation human agreement 90% — automated evaluation reliability verified
- 80% of VC-IFDPO preference data validated by both annotators as correct chosen/rejected pairs, IAA Cohen's κ = 0.86
How to Apply
- To check whether your model actually references the image: generate responses to the same question with and without the image, then ask GPT-4o to judge 'Influenced / Not influenced'
- When creating data with visual constraints: choose from 10 categories — Spatial (position), Attribute (color/texture), Comparative, Counting, etc. — select those applicable to the image and rewrite them concretely
- For DPO rejected samples, 'removing constraints' is more effective than 'removing the image' — deleting the image actually weakens the visual grounding learning signal
Code Example
# VC-IFEval-style comparative evaluation prompt (measuring image influence)
comparative_judge_prompt = """
You are evaluating whether the availability of IMAGE caused a substantive influence on the model's answer.
You will be given the question and two answers:
- Answer A: produced WITH image available.
- Answer B: produced WITHOUT image.
Guidelines:
- If Answer A contains details that plausibly come from visual evidence (objects, layout, colors, counts, attributes)
and such details are missing/incorrect in Answer B, or the final conclusions differ BECAUSE of visual cues,
judge it as "Influenced".
- If both answers are essentially the same in conclusions and key details (only minor wording differs),
judge "Not influenced".
Question: {question}
Answer A (WITH image): {answer_with_image}
Answer B (WITHOUT image): {answer_without_image}
Return exactly one word: Influenced or Not influenced.
"""
# Direct evaluation prompt (checking constraint compliance)
direct_judge_prompt = """
You are asked to judge whether the AI assistant's response fully complies with each listed constraint.
1. Each judgment should be grounded in the visual evidence provided by the image.
2. Assign 1 point if completely satisfied; assign 0 otherwise.
<start of response> {prediction} <end of response>
<start of constraint list> {constraints} <end of constraint list>
Output format: Judgement: ... Summary: constraint_1: x/1, constraint_2: x/1, ...
"""Terminology
Original Abstract (Expand)
Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs'instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs'instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.