MCP-Universe: 실제 MCP 서버로 LLM을 벤치마킹하는 종합 평가 프레임워크

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Aug 20, 2025•Ziyang Luo, Zhiqi Shen, Wenzhuo Yang +7•View PDF

TL;DR Highlight

GPT-5도 43.7% 성공률에 불과한 실제 MCP 서버 기반 LLM 에이전트 벤치마크가 나왔다.

Who Should Read

MCP를 활용한 AI 에이전트를 개발하거나 도입을 검토하는 백엔드/AI 개발자. 어떤 LLM이 실제 툴 사용 시나리오에서 잘 동작하는지 비교하고 싶은 엔지니어.

Core Mechanics

GPT-5(43.7%), Grok-4(33.3%), Claude-4.0-Sonnet(29.4%) — 최고 모델들도 실제 MCP 환경에선 처참한 성공률
6개 도메인(지도 탐색, GitHub, 금융, 3D 디자인, 브라우저 자동화, 웹 검색) × 11개 MCP 서버 × 231개 태스크로 구성
상호작용 스텝이 늘어날수록 컨텍스트 토큰이 폭발적으로 증가하는 'long-context 문제' 확인 — Location Navigation에서 16스텝이면 8만 토큰 초과
LLM이 MCP 서버의 정확한 사용법을 모르는 'unknown-tools 문제' 존재 — 예: Yahoo Finance API에서 start_date와 end_date를 같은 날로 설정하면 에러
탐색(exploration) 단계를 추가하면 일부 도메인에서 성능 향상 — Claude-4.0-Sonnet 금융 분석 +7.5%p, GPT-4.1 브라우저 자동화 +7.7%p
엔터프라이즈 에이전트 Cursor가 단순 ReAct 프레임워크보다 오히려 성능이 낮음 (26.4% vs 29.4%)

Evidence

최고 성능 GPT-5도 전체 231개 태스크에서 43.72% 성공률 — 나머지 모델은 모두 35% 미만
무관한 MCP 서버 추가 연결 시 성능 하락 확인 — Claude-4.0-Sonnet Location Navigation 22.2% → 11.1%로 반토막
포맷 준수(형식 평가)는 80~98% 통과하지만, 실제 내용 정확도(정적/동적 평가)는 40~65% 수준으로 급락
o3는 평균 4.82스텝으로 가장 효율적이나, GPT-5는 8.22스텝 사용하면서도 성공률이 가장 높아 '많이 생각할수록 유리' 경향 확인

How to Apply

MCP 에이전트 개발 시 '탐색 단계(exploration phase)'를 먼저 실행해 툴 스펙을 학습시킨 뒤 실제 태스크를 수행하면 특히 정보 검색·추론 도메인에서 성능 향상 가능
MCP 서버를 많이 연결할수록 성능이 떨어지므로, 각 태스크에 필요한 서버만 선택적으로 연결하는 최소화 전략 적용 권장
컨텍스트가 폭발적으로 늘어나는 다단계 태스크엔 요약 에이전트(summarization agent)를 중간에 삽입해 토큰 압축 시도 — 단, 브라우저 자동화·금융 도메인엔 역효과 가능성 있으므로 도메인별 테스트 필요

Code Example

snippet

# MCP-Universe 벤치마크 빠른 시작 예시
# GitHub: https://github.com/SalesforceAIResearch/MCP-Universe

# 1. 설치
# pip install mcpuniverse

# 2. ReAct + Exploration 에이전트 패턴 (논문 권장)
SYSTEM_PROMPT = """
Phase 1 - Exploration:
Before solving the task, freely interact with the available MCP tools.
Try calling each tool with sample inputs to understand:
- Required parameters and their formats
- Valid input ranges and constraints
- Expected output structure

Phase 2 - Exploitation (ReAct):
Now use the tool knowledge you gained to solve the actual task.
Format:
Thought: <reasoning based on observations>
Action: <tool call with correct parameters>
Observation: <tool result>
... repeat until task complete
Final Answer: <structured response>
"""

# 3. 핵심 교훈: 날짜 파라미터 같은 제약 조건 주의
# BAD (Yahoo Finance MCP 에러 유발)
# get_historical_stock_prices(ticker='XOM', start_date='2023-12-01', end_date='2023-12-01')

# GOOD
# get_historical_stock_prices(ticker='XOM', start_date='2023-11-30', end_date='2023-12-01')

Terminology

MCPAI가 외부 툴·데이터에 표준화된 방식으로 연결하는 프로토콜. USB-C처럼 어떤 AI든 어떤 서비스든 같은 규격으로 꽂아 쓸 수 있게 해줌.

ReActLLM이 '생각(Thought) → 행동(Action) → 관찰(Observation)'을 반복하며 문제를 푸는 에이전트 패턴. 마치 탐정이 단서를 모아가며 추리하는 방식.

execution-based evaluatorLLM 판사 대신 코드로 직접 정답을 검증하는 평가 방식. 실시간 데이터가 바뀌어도 자동으로 최신 정답을 가져와 비교함.

long-horizon reasoning여러 단계에 걸쳐 순서대로 툴을 호출하며 최종 목표를 달성하는 능력. 한 번에 답이 안 나오고 여러 번 검색·계산을 반복해야 하는 복잡한 태스크.

dynamic evaluator답이 시간에 따라 바뀌는 태스크(예: 항공권 가격, 실시간 주가)를 평가할 때 쓰는 자동 평가기. 평가 시점에 실시간으로 정답을 수집해서 비교함.

LLM-as-a-judge다른 LLM이 모델 응답의 품질을 채점하는 평가 방식. 빠르고 저렴하지만 스타일 편향이나 환각 문제가 있어 실시간 데이터 태스크에는 부적합.

context windowLLM이 한 번에 처리할 수 있는 텍스트의 최대 길이. 대화가 길어지면 앞부분 내용을 잊어버리거나 에러가 나는 문제가 생김.

Related Resources

Original Abstract (Expand)

The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.