MixLLM: Mixed Large Language Models에서의 Dynamic Routing

MixLLM: Dynamic Routing in Mixed Large Language Models

Feb 9, 2025•Xinyuan Wang, Yanchi Liu, Wei Cheng +5•View PDF

TL;DR Highlight

여러 LLM 중 쿼리별로 최적 모델을 자동 선택해서 GPT-4 품질의 97%를 24% 비용으로 달성하는 라우팅 시스템

Who Should Read

여러 LLM API를 혼용하면서 품질과 비용을 동시에 잡고 싶은 백엔드/ML 엔지니어. 특히 GPT-4 같은 고비용 모델 사용량을 줄이고 싶은 팀.

Core Mechanics

쿼리마다 GPT-4, GPT-3.5, Llama 등 여러 LLM 후보 중 품질·비용·지연시간을 동시에 고려해 최적 모델을 골라주는 라우터 구축
InsTag 모델로 쿼리에 도메인 태그(예: 'Computer Science', 'Legal')를 붙이고, BERT 인코더를 unsupervised fine-tuning해서 라우팅에 특화된 임베딩 생성
각 LLM별로 독립적인 경량 예측 모델(Random Forest, MLP, KNN)을 두어 응답 품질과 비용을 미리 예측 — 새 모델 추가 시 전체 재학습 불필요
특정 LLM에 쿼리가 몰릴 때 지연시간 패널티를 부과해서 병목 현상을 자동으로 방지
배포 후에도 사용자 피드백(좋아요/싫어요)으로 계속 학습하는 continual learning 지원 — Contextual Bandit(맥락 기반 강화학습) 방식 활용
Llama 3.1 8B/70B 추가 시 GPT-4 품질의 98.55%를 16.79% 비용으로 달성 — LLM 풀 확장에도 유연하게 대응

Evidence

GPT-4 품질의 97.25%를 GPT-4 비용의 24.18%로 달성 (최고 베이스라인 OptLLM은 96.39% 품질에 32.94% 비용)
Llama 3.1 모델 추가 후 GPT-4 품질의 98.55%를 비용 16.79%로 달성
태그 강화 임베딩 사용 시 저비용 구간에서 응답 품질 5.72% 향상 (53.14% → 56.18%)
온라인 학습 데이터 비율 70% 시 refined feedback으로 2.22%, binary feedback으로 1.31% 성능 향상

How to Apply

GPT-4와 GPT-3.5-turbo, Llama 3.1을 함께 운영 중이라면 각 모델별 경량 품질 예측기를 학습시키고, 쿼리 임베딩 + 예측 품질/비용 스코어로 라우팅 레이어를 추가하면 된다.
사용자 만족도 피드백(thumbs up/down)이 있는 서비스라면 Contextual Bandit 방식의 온라인 학습을 붙여서 시간이 지날수록 라우팅 정확도를 높일 수 있다.
특정 시간대에 특정 LLM API가 느려지는 문제가 있다면 대기시간 기반 지수 패널티(spen)를 라우팅 스코어에 포함시켜서 자동으로 덜 혼잡한 모델로 분산할 수 있다.

Code Example

snippet

# MixLLM 라우팅 로직 핵심 의사코드
import numpy as np
from sklearn.ensemble import RandomForestRegressor

class MixLLMRouter:
    def __init__(self, llm_candidates, lambda_=1.4, alpha=0.01, beta=0.1):
        """
        llm_candidates: [{'name': 'gpt-4', 'price_in': 0.03, 'price_out': 0.06, 'speed': 50}, ...]
        lambda_: 품질 vs 비용 우선순위 (클수록 품질 우선)
        """
        self.llms = llm_candidates
        self.lambda_ = lambda_
        self.alpha = alpha
        self.beta = beta
        # 각 LLM별 독립적인 품질 예측기
        self.quality_predictors = {llm['name']: RandomForestRegressor() for llm in llm_candidates}
        self.waiting_times = {llm['name']: 0.0 for llm in llm_candidates}
    
    def score(self, query_embedding, llm_name, predicted_quality, predicted_cost):
        # 1. 품질-비용 트레이드오프 스코어
        lam = self.lambda_
        s_trade = (lam / (lam + 1)) * predicted_quality - (1 / (lam + 1)) * predicted_cost
        
        # 2. 불확실성 보너스 (exploration)
        # 실제 구현에서는 LinUCB 방식의 역공분산 행렬 사용
        s_unc = 0.01  # 간략화
        
        # 3. 지연시간 패널티 (대기시간이 길면 선택 억제)
        gamma, xi, tau = 0.1, 0.8, 30.0
        w = self.waiting_times[llm_name]
        s_pen = np.exp(gamma * (w - xi * tau))
        
        return s_trade + self.alpha * s_unc - self.beta * s_pen
    
    def route(self, query_embedding):
        scores = {}
        for llm in self.llms:
            name = llm['name']
            # 각 LLM별 품질/비용 예측
            pred_quality = self.quality_predictors[name].predict([query_embedding])[0]
            pred_cost = 0.001 * len(query_embedding)  # 간략화된 비용 추정
            scores[name] = self.score(query_embedding, name, pred_quality, pred_cost)
        
        # 가장 높은 스코어의 LLM 선택
        best_llm = max(scores, key=scores.get)
        return best_llm

# 사용 예시
llms = [
    {'name': 'gpt-4', 'price_in': 0.03, 'price_out': 0.06},
    {'name': 'gpt-3.5-turbo', 'price_in': 0.001, 'price_out': 0.002},
    {'name': 'llama-3.1-70b', 'price_in': 0.0009, 'price_out': 0.0009},
]
router = MixLLMRouter(llms, lambda_=1.4)
# selected_llm = router.route(query_embedding)

Terminology

Contextual Bandit카지노 슬롯머신(bandit)에서 유래한 강화학습 기법. 맥락(쿼리 내용)을 보고 여러 선택지(LLM) 중 하나를 골라 보상(응답 품질)을 최대화하도록 학습함.

LLM Routing들어오는 질문을 가장 잘 답할 수 있는 AI 모델로 자동 배분하는 기술. 음식점 종류에 따라 맞는 셰프에게 주문을 보내는 것과 비슷.

Continual Learning배포 후에도 새로운 데이터와 피드백으로 계속 학습하는 방식. 학교 졸업 후에도 실무 경험으로 계속 성장하는 것과 같음.

InsTagLLM 명령어(instruction)에 자동으로 태그를 붙이는 모델. 도서관 사서가 책마다 분류 라벨을 붙이듯이 쿼리의 주제와 특성을 태그로 표현함.

t-SNE고차원 데이터(임베딩)를 2D 평면에 시각화하는 기법. 비슷한 데이터끼리 가까이 모이게 압축해서 군집을 눈으로 볼 수 있게 해줌.

OOD (Out-of-Domain)학습 데이터에 없던 새로운 분야의 입력이 들어오는 상황. 한국 음식만 배운 요리사에게 갑자기 멕시칸 요리를 시키는 것과 비슷.

Policy Gradient신경망이 어떤 행동을 선택할 확률을 조정하는 강화학습 방법. 결과가 좋으면 그 선택을 더 자주 하도록, 나쁘면 덜 하도록 확률을 업데이트함.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4's quality at 24.18% of the cost under the time constraint).