MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing

Jan 25, 2026•Haoxuan Ma, Guannan Lai, Han-Jia Ye•View PDF

TL;DR Highlight

A benchmark proving that automatic per-query model routing achieves the same accuracy as the strongest single model at just 33% of the cost

Who Should Read

ML infrastructure engineers operating a mix of GPT-5, Gemini, Claude while optimizing cost-to-performance ratios. Especially AI service developers handling multimodal queries with wildly varying difficulty across OCR, VQA, and math reasoning.

Core Mechanics

Routing to different models per query alone achieves the same accuracy at 33% of the strongest single model's cost — simple OCR goes to Qwen2.5-VL-3B ($0.03/1M), complex math reasoning to GPT-5 ($10/1M) automatically
Text-only routers can't judge visual complexity like OCR density, charts, or spatial reasoning, causing incorrect routing to cheap models
Adaptive fusion of image+text modalities outperforms both text-only and image-only routers — gap is especially large for low-resolution or complex scenes
Matrix Factorization-based router is most stable across diverse workloads — KNN/KMeans only strong on specific datasets and fragile to budget changes
Multimodal-trained router outperforms the strongest single model on text-only benchmarks (GSM8K, MMLU, ARC) even with image channel masked to zero — no retraining needed
Provides offline result tables for 10 models × 11,000 instances including GPT-5, Gemini 2.5 Pro/Flash, Claude 3.7 Sonnet, Qwen2.5-VL 3B/7B/72B, InternVL3-78B

Evidence

~33% of strongest single model cost for same accuracy (Pareto frontier comparison, Figure 4)
Multimodal trained router on text-only benchmarks: GSM8K 94.5→96.7, MMLU 91.2→92.4, ARC 65.7→66.7 (all surpassing strongest single model)
Cross-dataset evaluation: router peak score exceeds strongest single model on OCR (0.7234 vs 0.7062), VQA (0.8012 vs 0.7936), Math (0.7914 vs 0.7592)
LinearMFRouter overall nAUC 0.7042 / Peak Score 0.7533, surpassing Best Single Model's peak score 0.7412

How to Apply

In API servers handling diverse difficulty queries: extract text+image embeddings via CLIP, apply adaptive fusion, then route via KMeans/LinearMF router as middleware — training data consists of each model's accuracy + cost records
For B2B services with variable budgets, default to Matrix Factorization router; for structurally repetitive domains like math, KNN router achieves higher peak scores at lower cost
Train a multimodal router once and handle text-only requests by masking image embeddings to zero — one router for unified multi/single modal operation without retraining

Code Example

snippet

import numpy as np
from sklearn.cluster import KMeans
import clip
import torch

# Extract text/image embeddings with CLIP
model, preprocess = clip.load("ViT-B/32", device="cpu")

def get_embeddings(image, text):
    img_input = preprocess(image).unsqueeze(0)
    txt_input = clip.tokenize([text], truncate=True)
    with torch.no_grad():
        img_emb = model.encode_image(img_input).float()
        txt_emb = model.encode_text(txt_input).float()
    img_emb = img_emb / img_emb.norm(dim=-1, keepdim=True)
    txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
    return txt_emb.numpy()[0], img_emb.numpy()[0]

def adaptive_fuse(txt_emb, img_emb, eta=5.0, alpha=0.5, beta=0.5):
    """Adaptive fusion from the MMR-Bench paper (Eq. S6)"""
    # Compute softmax weights based on norm-based confidence
    c = np.array([np.linalg.norm(txt_emb), np.linalg.norm(img_emb)])
    weights = np.exp(eta * c) / np.exp(eta * c).sum()
    w_txt, w_img = weights

    # Weighted sum + product (cross-modal agreement) + difference (mismatch)
    fused = (w_txt * txt_emb + w_img * img_emb) \
          + alpha * (txt_emb * img_emb) \
          + beta  * (txt_emb - img_emb)
    return fused / np.linalg.norm(fused)

# Model zoo (cost: $/1M output tokens)
MODEL_ZOO = {
    "qwen-3b":      {"cost": 0.03},
    "qwen-7b":      {"cost": 0.07},
    "gemini-flash": {"cost": 2.00},
    "gpt-5":        {"cost": 10.0},
}

class KMeansRouter:
    def __init__(self, n_clusters=8):
        self.kmeans = KMeans(n_clusters=n_clusters, init='k-means++')
        self.cluster_model_map = {}

    def fit(self, embeddings, utilities, cost_weight=0.5):
        """utilities: (n_samples, n_models) accuracy per model"""
        self.kmeans.fit(embeddings)
        costs = np.array([v["cost"] for v in MODEL_ZOO.values()])
        for h in range(self.kmeans.n_clusters):
            mask = self.kmeans.labels_ == h
            scores = utilities[mask].mean(0) - cost_weight * (costs / costs.max())
            self.cluster_model_map[h] = list(MODEL_ZOO.keys())[np.argmax(scores)]

    def route(self, embedding):
        cluster = self.kmeans.predict(embedding.reshape(1, -1))[0]
        return self.cluster_model_map[cluster]

# --- At inference time ---
# txt_emb, img_emb = get_embeddings(image, query)
# fused = adaptive_fuse(txt_emb, img_emb)
# selected = router.route(fused)
# print(f"→ {selected} (${MODEL_ZOO[selected]['cost']}/1M tokens)")

# For text-only queries, mask image embedding with zeros (no retraining needed)
# fused = adaptive_fuse(txt_emb, np.zeros_like(txt_emb))

Terminology

LLM RoutingAutomatically deciding which AI model to send a query to based on analysis. A smart dispatcher sending easy questions to cheap models, hard questions to powerful ones.

MLLMMultimodal Large Language Model that understands both text and images simultaneously. Models like GPT-4V, Claude 3.7 Sonnet, Gemini that can look at photos and answer questions.

nAUCNormalized area under the performance curve across the entire cost range. Shows how consistently well it works across different budgets, not just one.

Matrix FactorizationRepresenting instances and models in a low-dimensional latent space to predict performance. Same principle as Netflix recommendation systems decomposing user-movie matrices.

Pareto frontierThe boundary of optimal points on cost vs performance axes where neither is sacrificed. The set of points that are either 'higher performance at the same cost' or 'lower cost at the same performance.'

QNC (Quality-Neutral Cost)Relative cost needed to match the strongest single model's accuracy. 0.33 means same performance at 33% cost; infinity means it can never catch up regardless of spending.

Adaptive FusionDynamically weighting text and image embeddings based on estimated reliability of each modality, rather than simple averaging. Automatically increases image embedding weight for queries where the image is more important.

Related Resources

https://github.com/Hunter-Wrynn/MMR-Bench

Original Abstract (Expand)

Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over-provisions compute on easy instances or sacrifices accuracy on hard ones. Query-level model selection (routing) addresses this tension, but extending routing from text-only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation. We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR-Bench provides (i) a controlled environment with modality-aware inputs and variable compute budgets, (ii) a broad suite of vision-language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single-model reference, oracle upper bounds, and representative routing policies. Using MMR-Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost-accuracy frontier and enable the routed system to exceed the strongest single model's accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero-shot to new datasets and text-only benchmarks without retuning, establishing MMR-Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: https://github.com/Hunter-Wrynn/MMR-Bench.