MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing
TL;DR Highlight
A benchmark proving that automatic per-query model routing achieves the same accuracy as the strongest single model at just 33% of the cost
Who Should Read
ML infrastructure engineers operating a mix of GPT-5, Gemini, Claude while optimizing cost-to-performance ratios. Especially AI service developers handling multimodal queries with wildly varying difficulty across OCR, VQA, and math reasoning.
Core Mechanics
- Routing to different models per query alone achieves the same accuracy at 33% of the strongest single model's cost — simple OCR goes to Qwen2.5-VL-3B ($0.03/1M), complex math reasoning to GPT-5 ($10/1M) automatically
- Text-only routers can't judge visual complexity like OCR density, charts, or spatial reasoning, causing incorrect routing to cheap models
- Adaptive fusion of image+text modalities outperforms both text-only and image-only routers — gap is especially large for low-resolution or complex scenes
- Matrix Factorization-based router is most stable across diverse workloads — KNN/KMeans only strong on specific datasets and fragile to budget changes
- Multimodal-trained router outperforms the strongest single model on text-only benchmarks (GSM8K, MMLU, ARC) even with image channel masked to zero — no retraining needed
- Provides offline result tables for 10 models × 11,000 instances including GPT-5, Gemini 2.5 Pro/Flash, Claude 3.7 Sonnet, Qwen2.5-VL 3B/7B/72B, InternVL3-78B
Evidence
- ~33% of strongest single model cost for same accuracy (Pareto frontier comparison, Figure 4)
- Multimodal trained router on text-only benchmarks: GSM8K 94.5→96.7, MMLU 91.2→92.4, ARC 65.7→66.7 (all surpassing strongest single model)
- Cross-dataset evaluation: router peak score exceeds strongest single model on OCR (0.7234 vs 0.7062), VQA (0.8012 vs 0.7936), Math (0.7914 vs 0.7592)
- LinearMFRouter overall nAUC 0.7042 / Peak Score 0.7533, surpassing Best Single Model's peak score 0.7412
How to Apply
- In API servers handling diverse difficulty queries: extract text+image embeddings via CLIP, apply adaptive fusion, then route via KMeans/LinearMF router as middleware — training data consists of each model's accuracy + cost records
- For B2B services with variable budgets, default to Matrix Factorization router; for structurally repetitive domains like math, KNN router achieves higher peak scores at lower cost
- Train a multimodal router once and handle text-only requests by masking image embeddings to zero — one router for unified multi/single modal operation without retraining
Code Example
import numpy as np
from sklearn.cluster import KMeans
import clip
import torch
# Extract text/image embeddings with CLIP
model, preprocess = clip.load("ViT-B/32", device="cpu")
def get_embeddings(image, text):
img_input = preprocess(image).unsqueeze(0)
txt_input = clip.tokenize([text], truncate=True)
with torch.no_grad():
img_emb = model.encode_image(img_input).float()
txt_emb = model.encode_text(txt_input).float()
img_emb = img_emb / img_emb.norm(dim=-1, keepdim=True)
txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
return txt_emb.numpy()[0], img_emb.numpy()[0]
def adaptive_fuse(txt_emb, img_emb, eta=5.0, alpha=0.5, beta=0.5):
"""Adaptive fusion from the MMR-Bench paper (Eq. S6)"""
# Compute softmax weights based on norm-based confidence
c = np.array([np.linalg.norm(txt_emb), np.linalg.norm(img_emb)])
weights = np.exp(eta * c) / np.exp(eta * c).sum()
w_txt, w_img = weights
# Weighted sum + product (cross-modal agreement) + difference (mismatch)
fused = (w_txt * txt_emb + w_img * img_emb) \
+ alpha * (txt_emb * img_emb) \
+ beta * (txt_emb - img_emb)
return fused / np.linalg.norm(fused)
# Model zoo (cost: $/1M output tokens)
MODEL_ZOO = {
"qwen-3b": {"cost": 0.03},
"qwen-7b": {"cost": 0.07},
"gemini-flash": {"cost": 2.00},
"gpt-5": {"cost": 10.0},
}
class KMeansRouter:
def __init__(self, n_clusters=8):
self.kmeans = KMeans(n_clusters=n_clusters, init='k-means++')
self.cluster_model_map = {}
def fit(self, embeddings, utilities, cost_weight=0.5):
"""utilities: (n_samples, n_models) accuracy per model"""
self.kmeans.fit(embeddings)
costs = np.array([v["cost"] for v in MODEL_ZOO.values()])
for h in range(self.kmeans.n_clusters):
mask = self.kmeans.labels_ == h
scores = utilities[mask].mean(0) - cost_weight * (costs / costs.max())
self.cluster_model_map[h] = list(MODEL_ZOO.keys())[np.argmax(scores)]
def route(self, embedding):
cluster = self.kmeans.predict(embedding.reshape(1, -1))[0]
return self.cluster_model_map[cluster]
# --- At inference time ---
# txt_emb, img_emb = get_embeddings(image, query)
# fused = adaptive_fuse(txt_emb, img_emb)
# selected = router.route(fused)
# print(f"→ {selected} (${MODEL_ZOO[selected]['cost']}/1M tokens)")
# For text-only queries, mask image embedding with zeros (no retraining needed)
# fused = adaptive_fuse(txt_emb, np.zeros_like(txt_emb))Terminology
Related Resources
Original Abstract (Expand)
Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over-provisions compute on easy instances or sacrifices accuracy on hard ones. Query-level model selection (routing) addresses this tension, but extending routing from text-only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation. We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR-Bench provides (i) a controlled environment with modality-aware inputs and variable compute budgets, (ii) a broad suite of vision-language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single-model reference, oracle upper bounds, and representative routing policies. Using MMR-Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost-accuracy frontier and enable the routed system to exceed the strongest single model's accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero-shot to new datasets and text-only benchmarks without retuning, establishing MMR-Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: https://github.com/Hunter-Wrynn/MMR-Bench.