Mixture-of-Agents: 여러 LLM을 레이어로 쌓아 GPT-4o를 넘는 방법

Mixture-of-Agents Enhances Large Language Model Capabilities

Jun 7, 2024•Junlin Wang, Jue Wang, Ben Athiwaratkun +2•View PDF

TL;DR Highlight

여러 오픈소스 LLM을 레이어 구조로 쌓아서 GPT-4o보다 높은 성능을 파인튜닝 없이 달성하는 앙상블 기법.

Who Should Read

여러 LLM을 조합해 응답 품질을 높이고 싶은 AI 엔지니어나 백엔드 개발자. 특히 비용 대비 성능을 극대화하려는 프로덕션 LLM 파이프라인 담당자.

Core Mechanics

LLM은 '협력성(collaborativeness)'이 있음 — 다른 모델의 응답을 참고하면 품질이 오름, 심지어 참고한 모델이 더 약한 모델이어도
MoA 구조: 각 레이어에 여러 LLM(Proposer)이 독립적으로 답변 생성 → 다음 레이어 LLM(Aggregator)이 통합 → 이걸 N번 반복
오픈소스 모델만으로 AlpacaEval 2.0에서 65.1% 달성, GPT-4o(57.5%) 대비 7.6%p 상회
파인튜닝 없이 프롬프트 인터페이스만으로 동작 — 어떤 LLM이든 붙일 수 있음
모델 다양성이 핵심: 같은 모델 여러 번 쓰는 것보다 서로 다른 모델 조합이 일관되게 성능 높음 (n=6 멀티 프로포저 61.3% vs 싱글 프로포저 56.7%)
WizardLM은 Proposer로는 최고지만 Aggregator로는 부적합 — 역할별 모델 특성이 다름

Evidence

AlpacaEval 2.0 LC win rate: MoA 65.1%, MoA w/ GPT-4o 65.7% vs GPT-4o 57.5% (8.2%p 차이)
MT-Bench: MoA 9.25, MoA w/ GPT-4o 9.40 vs GPT-4 Turbo 9.31, GPT-4o 9.19
비용 효율: MoA-Lite는 GPT-4o와 동일 비용으로 AlpacaEval 59.3% 달성, GPT-4 Turbo 대비 4% 높고 비용은 절반 이하
MATH 데이터셋: Llama-3-70B-Instruct 기준 레이어 1→3 거칠수록 0.456 → 0.584 → 0.578로 추론 정확도 향상

How to Apply

Together AI API로 바로 시작 가능: Qwen1.5-110B, LLaMA-3-70B, WizardLM-8x22B, Mixtral-8x22B, dbrx 등을 Proposer로 병렬 호출 후 Aggregate-and-Synthesize 프롬프트로 통합하면 됨
비용이 부담되면 MoA-Lite 전략 사용: 레이어 2개 + 저렴한 Aggregator(Qwen1.5-72B)로도 GPT-4o 수준 이상 달성 가능
Aggregator 선정 기준: 강한 합성 능력이 있는 모델(Qwen1.5-110B, GPT-4o)을 마지막 레이어에 배치하고, WizardLM처럼 다양한 관점 제시에 강한 모델은 Proposer로만 활용

Code Example

snippet

import asyncio
from openai import AsyncOpenAI

# Aggregate-and-Synthesize 프롬프트 (논문 Table 1)
AGGREGATE_PROMPT = """You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.

Responses from models:
{responses}"""

client = AsyncOpenAI(base_url="https://api.together.xyz/v1", api_key="YOUR_KEY")

PROPOSER_MODELS = [
    "Qwen/Qwen1.5-72B-Chat",
    "meta-llama/Llama-3-70b-chat-hf",
    "mistralai/Mixtral-8x22B-Instruct-v0.1",
]
AGGREGATOR_MODEL = "Qwen/Qwen1.5-110B-Chat"

async def get_response(model, messages):
    resp = await client.chat.completions.create(
        model=model, messages=messages, max_tokens=2048
    )
    return resp.choices[0].message.content

async def moa_pipeline(user_query: str, num_layers: int = 2):
    messages = [{"role": "user", "content": user_query}]
    
    for layer in range(num_layers - 1):
        # 병렬로 Proposer 호출
        responses = await asyncio.gather(
            *[get_response(m, messages) for m in PROPOSER_MODELS]
        )
        # 다음 레이어 입력 생성
        formatted = "\n".join([f"{i+1}. {r}" for i, r in enumerate(responses)])
        aggregate_content = AGGREGATE_PROMPT.format(responses=formatted)
        messages = [
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": aggregate_content},
        ]
    
    # 마지막 레이어: Aggregator
    final = await get_response(AGGREGATOR_MODEL, messages)
    return final

# 실행
result = asyncio.run(moa_pipeline("Explain the concept of mixture of experts."))
print(result)

Terminology

MoA (Mixture-of-Agents)여러 LLM을 레이어 구조로 쌓아서 앞 레이어 결과를 뒤 레이어가 통합하는 방식. 회사에서 여러 직원이 초안 쓰고 팀장이 최종 정리하는 구조와 비슷.

ProposerMoA에서 초기 응답을 생성하는 역할의 LLM. 다양한 관점을 제공하는 게 핵심이라 꼭 최고 성능 모델이 아니어도 됨.

Aggregator여러 Proposer의 답변을 받아서 하나의 고품질 답변으로 통합하는 역할의 LLM. 합성 능력이 중요해서 강한 모델이 적합.

AlpacaEval 2.0LLM 응답 품질을 자동으로 평가하는 벤치마크. GPT-4 기준으로 '이 모델이 GPT-4보다 나은가'를 비율(LC win rate)로 표현.

LC win rateLength-Controlled win rate. 응답 길이 편향을 제거한 승률 지표. 길게 쓴다고 유리해지지 않게 보정함.

MoE (Mixture-of-Experts)하나의 모델 내부에서 여러 전문가 서브네트워크를 두고 입력에 따라 일부만 활성화하는 기법. MoA는 이를 모델 단위로 확장한 것.

FLASKLLM을 12가지 세부 스킬(정확성, 사실성, 완결성 등)로 평가하는 벤치마크. 단순 점수가 아닌 역량별 분석 가능.

TTFT (Time to First Token)모델이 응답의 첫 번째 글자를 출력하기까지 걸리는 시간. MoA는 레이어가 많을수록 TTFT가 늘어나는 단점이 있음.

Related Resources

Original Abstract (Expand)

Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.