When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models | AI Paper Digest

TL;DR Highlight

여러 LLM을 조합해도 '모든 모델이 동시에 틀리는 비율(β)'이 성능 상한선이며, 업계가 쓰는 pairwise 상관계수(ρ)는 이 상한선을 예측하지 못한다.

Who Should Read

GPT-5, Claude, Gemini 등 여러 모델을 앙상블하거나 라우팅 레이어를 설계하는 MLOps/백엔드 개발자. 모델 조합이 단일 최고 모델 대비 얼마나 실제로 도움이 되는지 판단해야 하는 AI 아키텍트.

Core Mechanics

모델 조합(라우팅, 투표, cascade, MoA)의 정확도 상한선은 β(모든 모델이 동시에 틀리는 비율)로 결정된다. 어떤 정책도 1−β를 초과할 수 없다.
업계가 '앙상블 효과 있다/없다' 판단에 쓰는 pairwise error correlation ρ는 β를 수학적으로 식별 불가능하다. 동일한 ρ값에서도 β는 완전히 다를 수 있다(Prop. 3).
67개 모델(GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro, Grok-4.3, DeepSeek V4, Qwen3.7-Max, Kimi K2.7 포함)에서 tetrachoric(잠재 변수 기반) 상관계수로 보정한 single-factor copula조차 실제 co-failure 꼬리를 약 2.5배 과소평가한다.
MATH-500에서 β=0.052인데 single-factor copula 예측값은 0.021, full 67×67 Gaussian copula도 0.023으로 실제의 절반도 안 된다. 이건 모델 수가 늘수록 더 벌어진다.
태스크 포맷이 regime을 결정한다. 완전히 동일한 GPQA-Diamond 질문을 객관식으로 물으면 β≈0인데, 선택지를 제거하고 자유응답으로 바꾸면 β=0.127로 급증한다.
품질이 같을 때는 low-ρ 이종 앙상블이 high-ρ Self-MoA(같은 모델 반복 샘플링)를 이긴다. 그러나 품질이 다른 모델들을 단순히 섞으면 오히려 최고 단일 모델보다 나빠진다.

Evidence

MATH-500(67모델, n=330)에서 실증 β=0.052이고, tetrachoric 보정 single-factor 예측은 0.021, full-Σ Gaussian copula 예측은 0.023—실측이 어떤 Gaussian copula보다 2.25× 크다(k=17 all-wrong events).
pool 크기 k=2에서 underpricing ratio=1.0이던 것이 k=67에서 중앙값 2.5(5~95% band: 2.1~2.7)로 단조 증가—특정 모델 구성이 아니라 pool 크기 자체가 원인임이 입증됨.
TF-IDF+domain 로지스틱 라우터는 G의 9%만 실현(95% CI가 0을 포함), LLM-as-router(GPT-5-mini)는 G의 0%를 실현—학습된 라우터가 oracle gain을 거의 회수 못한다.
동일 GPQA-Diamond 질문에서 객관식 β≈0(0/130)이지만 자유응답 전환 시 β=0.127(10/79, CP[0.062, 0.220])—포맷만 바꿔서 co-failure tail이 열림. 5명 LLM 판사 패널 κ=0.73~0.92.

How to Apply

새 멀티모델 시스템을 배포하기 전에 beta_certificate.py($0 도구)를 돌려라. 보유한 벤치마크 쿼리셋에서 'all-wrong' 횟수 K/n을 세면 Clopper-Pearson 신뢰구간으로 어떤 라우터/투표/cascade도 넘을 수 없는 최대 gain 상한선을 무료로 계산할 수 있다.
모델 앙상블 효과를 평가할 때 pairwise ρ 대신 직접 β를 측정하라. ρ가 낮아 보여도 β가 크면 앙상블 효과가 없다. 특히 open-ended 태스크(수학, 코드, 자유응답)에서는 β>0이 거의 확실하므로 ceiling-bound regime인지 먼저 확인해야 한다.
Self-MoA(같은 모델 반복 샘플링)를 쓰는 경우, 정확도가 비슷한 모델들로 구성된 low-ρ 이종 앙상블로 교체하면 MMLU-Pro 기준 +0.027(60번 파티셔닝 전체 양수)의 개선을 기대할 수 있다. 단, 품질이 다른 모델들을 무작위로 섞으면 오히려 성능이 떨어진다.

Code Example

snippet

# beta_certificate.py 핵심 로직 (Prop. 1 구현)
# 보유한 outcome 매트릭스에서 all-wrong count K, 총 쿼리 수 n, 단일 최고 모델 정확도 asb를 입력
# Clopper-Pearson lower bound로 '어떤 라우터/앙상블도 달성 불가한 최대 gain 상한선' 계산

import scipy.stats as stats

def beta_certificate(K: int, n: int, asb: float, delta: float = 0.05):
    """
    K: all-wrong 이벤트 수 (모든 모델이 동시에 틀린 쿼리 수)
    n: 총 쿼리 수
    asb: 단일 최고 모델 정확도
    delta: 신뢰 수준 (기본 5%)
    
    반환: 모든 selection policy의 최대 gain 상한선 (95% 신뢰)
    """
    # Clopper-Pearson lower bound on beta
    if K == 0:
        beta_lo = 0.0
    else:
        beta_lo = stats.beta.ppf(delta, K, n - K + 1)
    
    # 최대 gain 상한선: (1 - beta_lo) - asb
    max_gain_cert = (1 - beta_lo) - asb
    
    print(f"All-wrong count K={K}, n={n}")
    print(f"Beta lower bound (Clopper-Pearson): {beta_lo:.4f}")
    print(f"Ceiling: 1 - beta_lo = {1 - beta_lo:.4f}")
    print(f"Single-best accuracy (asb): {asb:.4f}")
    print(f"Max certified gain any policy can achieve: {max_gain_cert:.4f}")
    
    if max_gain_cert <= 0:
        print("→ 어떤 앙상블/라우터도 단일 최고 모델을 이길 수 없음이 인증됨 ($0 test)")
    else:
        print(f"→ 최대 gain 상한: {max_gain_cert:.4f} (이를 초과하는 앙상블 효과는 불가능)")
    
    return max_gain_cert

# 예시: MATH-500, 67 모델
# K=17 all-wrong, n=330 queries, single-best=0.836
beta_certificate(K=17, n=330, asb=0.836)

Terminology

β (co-failure rate)모든 모델이 동시에 같은 문제를 틀리는 비율. 이 값이 크면 아무리 좋은 라우터/앙상블을 써도 정확도가 1-β를 넘을 수 없다.

ρ (pairwise error correlation)두 모델이 같은 문제를 틀리는 경향의 상관계수. 값이 낮으면 '앙상블 효과 있다'고 판단하는데, 이 논문은 그게 틀렸다고 주장한다.

Mixture-of-Agents (MoA)여러 LLM의 답변을 합쳐서 최종 답을 만드는 방식. 예를 들어 GPT, Claude, Gemini 답변을 다른 모델이 종합하는 구조.

Routing쿼리마다 어떤 모델에게 보낼지 결정하는 레이어. 쉬운 질문은 저렴한 모델로, 어려운 질문은 강력한 모델로 보내는 식.

Cascade처음엔 저렴한 모델로 시도하고, 자신 없으면 더 강력한 모델로 에스컬레이션하는 방식. 비용 절감을 위해 자주 쓰임.

Clopper-Pearson interval작은 샘플에서 이진 이벤트(성공/실패)의 확률 구간을 계산하는 통계적 방법. 여기서는 'all-wrong 비율 β'의 신뢰구간을 구하는 데 사용.

Tetrachoric correlation두 변수가 모두 이진(0/1)일 때 배후의 잠재 연속 변수 간 상관계수. 단순 Pearson 상관계수보다 정확하지만, LLM 평가 연구에서 자주 무시됨.

Gaussian copula여러 변수의 결합 확률 분포를 모델링하는 도구. 2008년 금융위기 때 CDO 위험을 과소평가한 것처럼, 이 논문에서도 LLM 동시 실패 확률을 과소평가한다는 걸 보임.

Related Papers

Related Resources

논문 arXiv 링크

Original Abstract (Expand)

Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample certificate on the largest gain any router, vote, or cascade could deliver before training a router. Across 67 models from 21 providers, a tetrachoric-calibrated single-factor model still underprices the all-wrong tail: on open-ended mathematics, observed beta is 0.052 versus 0.023 under the full 67-model Gaussian copula, about 2.5 times underpricing, with 90 percent CI 1.7 to 3.4 and k equals 17. The effect recurs on execution-graded code, where beta is 0.079. Re-asking the same GPQA-Diamond questions in free-response rather than multiple-choice form reopens the tail, with beta 0.127 and a five-judge panel with kappa 0.73 to 0.92, locating co-failure in answer format rather than subject. At matched quality, low-rho heterogeneous ensembles beat high-rho Self-MoA, but on checkable tasks in our pool, combining models rarely beats the single best model without a strong query-level routing signal. Gains come from models failing on different questions, not from adding more models.