MMR-Bench: Multimodal LLM Routing을 위한 종합 벤치마크

MMR-Bench: A Comprehensive Benchmark for Multimodal LLM Routing

Jan 25, 2026•Haoxuan Ma, Guannan Lai, Han-Jia Ye•View PDF

TL;DR Highlight

쿼리마다 적합한 AI 모델을 자동 선택하는 routing으로, 최강 단일 모델 비용의 33%만 써도 동일한 정확도를 달성할 수 있음을 입증한 벤치마크.

Who Should Read

GPT-5, Gemini, Claude 등 여러 AI 모델을 혼합 운영하면서 비용 대비 성능 최적화를 고민하는 ML 인프라 엔지니어. 특히 OCR, VQA, 수학 추론 등 난이도가 천차만별인 멀티모달 쿼리를 처리하는 AI 서비스 개발자.

Core Mechanics

쿼리마다 다른 모델로 라우팅하는 것만으로 최강 단일 모델 비용의 33%로 동일 정확도 달성 — 단순 OCR은 Qwen2.5-VL-3B($0.03/1M), 복잡한 수학 추론은 GPT-5($10/1M)로 자동 분기
텍스트만 보는 router는 OCR 밀도, 차트, 공간 추론 등 시각적 복잡도를 판단 못해 저렴한 모델에 잘못 라우팅하는 실수를 범함
이미지+텍스트 두 modality를 adaptive fusion으로 합치면 텍스트 전용·이미지 전용 router를 모두 뛰어넘음 — 특히 저해상도나 복잡한 장면에서 격차가 큼
Matrix Factorization(저차원 잠재 공간으로 모델 성능 예측하는 기법) 기반 router가 다양한 워크로드에서 가장 안정적 — KNN/KMeans는 특정 데이터셋에서만 강하고 예산 변화에 취약
멀티모달로 학습한 router가 이미지 채널을 0으로 마스킹해도 텍스트 전용 벤치마크(GSM8K, MMLU, ARC)에서 최강 단일 모델을 능가 — 재훈련 불필요
GPT-5, Gemini 2.5 Pro/Flash, Claude 3.7 Sonnet, Qwen2.5-VL 3B/7B/72B, InternVL3-78B 등 10개 모델 × 11,000개 인스턴스의 오프라인 결과 테이블 제공

Evidence

최강 단일 모델 비용의 약 33%만으로 동일 정확도 달성 (Pareto frontier 비교, Figure 4)
멀티모달 trained router의 텍스트 전용 벤치마크 성능: GSM8K 94.5→96.7, MMLU 91.2→92.4, ARC 65.7→66.7 (최강 단일 모델 대비 모두 상회)
cross-dataset 평가에서 router의 peak score가 최강 단일 모델보다 높음: OCR(0.7234 vs 0.7062), VQA(0.8012 vs 0.7936), Math(0.7914 vs 0.7592)
LinearMFRouter 전체 평균 nAUC 0.7042 / Peak Score 0.7533으로 Best Single Model의 nAUC(1.0 기준) 대비 peak score 0.7412를 상회

How to Apply

다양한 난이도 쿼리를 처리하는 API 서버에서 CLIP으로 텍스트+이미지 임베딩을 뽑아 adaptive fusion 후 KMeans/LinearMF router로 모델을 선택하는 미들웨어 레이어를 추가 — 학습 데이터는 각 모델의 정답률+비용 기록으로 구성
예산이 가변적인 B2B 서비스라면 Matrix Factorization router를 기본으로 채택; 수학 문제처럼 구조적으로 반복되는 도메인에서는 KNN router가 오히려 낮은 비용으로 높은 peak score를 냄
멀티모달 router를 한 번 학습해두면 이미지 없는 텍스트 전용 요청도 이미지 임베딩을 0으로 마스킹해 재훈련 없이 처리 가능 — 하나의 router로 멀티/단일 모달 통합 운영

Code Example

snippet

import numpy as np
from sklearn.cluster import KMeans
import clip
import torch

# CLIP으로 텍스트/이미지 임베딩 추출
model, preprocess = clip.load("ViT-B/32", device="cpu")

def get_embeddings(image, text):
    img_input = preprocess(image).unsqueeze(0)
    txt_input = clip.tokenize([text], truncate=True)
    with torch.no_grad():
        img_emb = model.encode_image(img_input).float()
        txt_emb = model.encode_text(txt_input).float()
    img_emb = img_emb / img_emb.norm(dim=-1, keepdim=True)
    txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)
    return txt_emb.numpy()[0], img_emb.numpy()[0]

def adaptive_fuse(txt_emb, img_emb, eta=5.0, alpha=0.5, beta=0.5):
    """MMR-Bench 논문의 adaptive fusion (Eq. S6)"""
    # norm 기반 신뢰도로 softmax 가중치 계산
    c = np.array([np.linalg.norm(txt_emb), np.linalg.norm(img_emb)])
    weights = np.exp(eta * c) / np.exp(eta * c).sum()
    w_txt, w_img = weights

    # 가중합 + 곱(cross-modal agreement) + 차이(mismatch)
    fused = (w_txt * txt_emb + w_img * img_emb) \
          + alpha * (txt_emb * img_emb) \
          + beta  * (txt_emb - img_emb)
    return fused / np.linalg.norm(fused)

# 모델 zoo (비용: $/1M output tokens)
MODEL_ZOO = {
    "qwen-3b":      {"cost": 0.03},
    "qwen-7b":      {"cost": 0.07},
    "gemini-flash": {"cost": 2.00},
    "gpt-5":        {"cost": 10.0},
}

class KMeansRouter:
    def __init__(self, n_clusters=8):
        self.kmeans = KMeans(n_clusters=n_clusters, init='k-means++')
        self.cluster_model_map = {}

    def fit(self, embeddings, utilities, cost_weight=0.5):
        """utilities: (n_samples, n_models) 각 모델의 정답률"""
        self.kmeans.fit(embeddings)
        costs = np.array([v["cost"] for v in MODEL_ZOO.values()])
        for h in range(self.kmeans.n_clusters):
            mask = self.kmeans.labels_ == h
            scores = utilities[mask].mean(0) - cost_weight * (costs / costs.max())
            self.cluster_model_map[h] = list(MODEL_ZOO.keys())[np.argmax(scores)]

    def route(self, embedding):
        cluster = self.kmeans.predict(embedding.reshape(1, -1))[0]
        return self.cluster_model_map[cluster]

# --- 추론 시 ---
# txt_emb, img_emb = get_embeddings(image, query)
# fused = adaptive_fuse(txt_emb, img_emb)
# selected = router.route(fused)
# print(f"→ {selected} (${MODEL_ZOO[selected]['cost']}/1M tokens)")

# 텍스트 전용 쿼리라면 이미지 임베딩을 0으로 마스킹 (재훈련 불필요)
# fused = adaptive_fuse(txt_emb, np.zeros_like(txt_emb))

Terminology

LLM Routing쿼리를 보고 어떤 AI 모델에게 넘길지 자동으로 결정하는 기법. 쉬운 질문은 저렴한 모델, 어려운 질문은 강력한 모델로 보내는 스마트 분기기.

MLLM텍스트와 이미지를 동시에 이해하는 멀티모달 대형 언어 모델. GPT-4V, Claude 3.7 Sonnet, Gemini처럼 사진을 보고 질문에 답할 수 있는 모델.

nAUC비용 구간 전체에서 성능 곡선 아래 면적을 정규화한 값. 특정 예산뿐 아니라 다양한 예산에서 얼마나 고르게 잘 작동하는지를 나타냄.

Matrix Factorization인스턴스와 모델을 낮은 차원의 잠재 공간에서 표현해 성능을 예측하는 기법. 넷플릭스 추천 시스템에서 사용자-영화 행렬을 분해하는 것과 동일한 원리.

Pareto frontier비용과 성능 두 축에서 어느 쪽도 희생하지 않고 최적인 점들의 경계선. '같은 비용으로 더 높은 성능' 또는 '같은 성능으로 더 낮은 비용'인 지점들의 집합.

QNC (Quality-Neutral Cost)최강 단일 모델과 동일한 정확도를 내는 데 필요한 상대적 비용. 0.33이면 33% 비용으로 동일 성능, 무한대(∞)면 아무리 돈을 써도 따라잡지 못함.

Adaptive Fusion텍스트와 이미지 임베딩을 단순 평균 대신, 각 modality의 신뢰도를 추정해 동적으로 가중치를 부여해 합치는 방법. 이미지가 더 중요한 쿼리에서는 이미지 임베딩 비중을 자동으로 높임.

Related Resources

https://github.com/Hunter-Wrynn/MMR-Bench

Original Abstract (Expand)

Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over-provisions compute on easy instances or sacrifices accuracy on hard ones. Query-level model selection (routing) addresses this tension, but extending routing from text-only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation. We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR-Bench provides (i) a controlled environment with modality-aware inputs and variable compute budgets, (ii) a broad suite of vision-language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single-model reference, oracle upper bounds, and representative routing policies. Using MMR-Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost-accuracy frontier and enable the routed system to exceed the strongest single model's accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero-shot to new datasets and text-only benchmarks without retuning, establishing MMR-Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: https://github.com/Hunter-Wrynn/MMR-Bench.