MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

May 22, 2025•Xuanqi Gao, Siyi Xie, Juan Zhai +2•View PDF

TL;DR Highlight

The first benchmark measuring tool-use capabilities of 10 LLMs (GPT-5, Gemini, Claude, etc.) across 6 domains and 507 tasks in an MCP environment.

Who Should Read

Backend/AI engineers developing MCP-based AI agents or comparing LLM tool calling performance. Teams that need benchmark data on how well models handle real tasks like email, file management, and terminal operations.

Core Mechanics

Closed-source models (GPT-5, Gemini-2.5-Pro) lead significantly in math reasoning, but the gap narrows to <10% in web search — all models score below 30% accuracy on web search
GPT-4o achieves 84.5% DTSR (per-turn tool call success) in file management but only 43% final accuracy — proving tool execution ability ≠ problem-solving ability
Atomic tool design (one tool = one function) yields higher accuracy than complex multi-function tools
ReAct vs Concise system prompts: Gemini benefits from verbose prompts while GPT performs better with concise ones

Evidence

Gemini-2.5-Pro tops web search accuracy at 29.8%; Llama-4 trails at 0.8% — open-source average 10.8% vs closed-source average 20.7%
GPT-4o file management: DTSR 84.5% vs final ACC 43% — 40.7pp gap quantifying the disconnect between tool execution and problem solving
Math domain best: Gemini-2.5-Pro with strong lead; all models struggle with multi-step web tasks

How to Apply

Design MCP tools following the 'atomic tool' principle — each tool does one thing. LLMs compose simple tools more accurately than calling complex multi-function ones.
Keep MCP system prompt tool descriptions concise — Gemini prefers verbose ReAct-style prompts, but GPT models work better with concise descriptions. Test both styles for your target model.
Don't assume high per-turn tool success means high task completion — always evaluate end-to-end task accuracy separately.

Code Example

snippet

# MCP-RADAR style tool definition example (atomic tool principle applied)
# Bad example: too many functions in a single tool
bad_tool = {
    "name": "EmailManager",
    "description": "Send, read, draft, delete, label emails and manage attachments",
    "inputs": ["action", "to", "subject", "body", "labels", "attachments", ...]
}

# Good example: separated atomically
good_tools = [
    {
        "name": "SendEmail",
        "description": "Send a single email to one or more recipients.",
        "inputs": ["to", "subject", "body"]
    },
    {
        "name": "DraftEmail",
        "description": "Save an email as draft without sending.",
        "inputs": ["to", "subject", "body"]
    },
    {
        "name": "LabelEmail",
        "description": "Add or remove a label from an existing email by message_id.",
        "inputs": ["message_id", "label", "action"]
    }
]

# Concise system prompt example (often outperforms ReAct in many cases)
system_prompt = """
You are a helpful assistant with access to MCP tools.
Rules:
- ALWAYS use tools to complete tasks. Do NOT answer from memory.
- Select the most specific tool for the task.
- Format your final answer as: <answer>[YOUR ANSWER]</answer>
"""

# Evaluation metric calculation example
def compute_cre(token_used, token_min, token_max):
    """Computational Resource Efficiency (lower means more token-efficient)"""
    return (token_used - token_min) / (token_max - token_min + 1e-9)

def check_fuzzy_match(pred_tool, pred_args, gt_tool, gt_args):
    """Fuzzy Match accuracy: both tool name and key parameters must match to succeed"""
    return pred_tool == gt_tool and pred_args == gt_args

Terminology

MCP (Model Context Protocol)Anthropic's standard protocol for connecting LLMs to external tools. Like USB-C — any LLM supporting MCP can call tools (search, email, files, etc.) the same way.

DTSR (Dialogue Turn Success Rate)The success rate of tool calls at each individual conversation turn. High DTSR doesn't guarantee the overall task gets solved.

Related Resources

Original Abstract (Expand)

As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of interacting with external tools, the Model Context Protocol (MCP) has emerged as a key standardized framework for dynamic tool discovery and orchestration. Despite its widespread industry adoption, existing evaluation methods do not adequately assess tool utilization capabilities under this new paradigm. To address this gap, this paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance within the MCP framework. MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations. It quantifies performance based on two primary criteria: answer correctness and operational accuracy. To closely emulate real-world usage, our evaluation employs both authentic MCP tools and high-fidelity simulations of official tools. Unlike traditional benchmarks that rely on subjective human evaluation or binary success metrics, MCP-RADAR adopts objective, quantifiable measurements across multiple task domains, including computational resource efficiency and the number of successful tool-invocation rounds. Our evaluation of leading closed-source and open-source LLMs reveals distinct capability profiles and highlights a significant trade-off between accuracy and efficiency. Our findings provide actionable insights for both LLM developers and tool creators, establishing a standardized methodology applicable to the broader LLM agent ecosystem. All implementations, configurations, and datasets are publicly available at https://anonymous.4open.science/r/MCPRadar-B143.