Tool Attention Is All You Need: Dynamic Tool Gating과 Lazy Schema Loading으로 MCP/Tools Tax 제거하기

TL;DR Highlight

MCP 에이전트가 매 턴마다 쓸모없는 툴 스키마를 수만 토큰씩 낭비하는 문제를, 의도 기반 동적 필터링으로 95% 줄이는 미들웨어 기법.

Who Should Read

LangGraph나 MCP 기반 AI 에이전트를 프로덕션에 붙이면서 토큰 비용이 폭발하거나 컨텍스트 오염으로 성능 저하를 겪고 있는 백엔드/AI 엔지니어. 특히 10개 이상의 MCP 서버를 연결한 멀티 툴 에이전트를 운영 중인 팀.

Core Mechanics

MCP는 매 대화 턴마다 연결된 모든 툴의 JSON 스키마를 통째로 재주입하는 구조라, 4~6개 서버만 연결해도 15k~55k 토큰이 매 턴 낭비됨. 이걸 'Tools Tax'라고 부름.
Tools Tax가 쌓이면 컨텍스트 사용률이 70%를 넘는 시점부터 LLM 추론 품질이 급격히 붕괴됨. 툴 파라미터 환각, 비슷한 툴 혼동, 멀티스텝 계획 기억 손실이 발생.
Tool Attention은 3가지 컴포넌트로 구성된 미들웨어: (1) 사용자 의도와 툴 설명을 sentence-transformers로 임베딩해 유사도 계산하는 ISO 스코어, (2) 선행 조건(인증 여부, 이전 툴 출력 존재 여부 등)을 체크하는 상태 기반 게이팅 함수, (3) 전체 스키마는 선택된 top-k 툴만 지연 로딩하는 2단계 Lazy Schema Loader.
Phase-1에서는 모든 툴의 짧은 요약(≤60토큰)만 항상 컨텍스트에 유지하고, Phase-2에서는 게이팅 통과한 top-k 툴의 전체 JSON 스키마만 그 턴에 주입함. 요약풀은 캐시 히트율 84%를 기록.
게이팅에서 실수로 누락된 툴을 모델이 호출하면 after_model 훅이 'tool_not_available' 에러를 반환해서 결정론적으로 잡아냄. 이 게이트가 트리거된 턴의 78%에서 모델이 다음 턴에 정상 복구.
보안 부산물로, Tool Poisoning Attack(악성 툴 설명으로 에이전트 제어를 탈취하는 공격)에도 방어 효과가 있음. 의도와 코사인 유사도가 낮은 poisoned 툴은 자동으로 게이팅 아웃됨.

Evidence

120개 툴, 6개 서버 벤치마크에서 툴 토큰을 47,312 → 2,368 토큰으로 95.0% 직접 측정 감소. 유효 컨텍스트 활용률은 0.24 → 0.91로 3.8배 상승.
동일 작업 비용 비교(프로젝션): 나이브 Full-Schema 방식 대비 비용 86% 절감, P50 레이턴시 52% 단축, 태스크 성공률 +22%p (72% → 94%) 개선.
Static Pruning(수동 30개 툴 선택) 방식은 오히려 성공률이 72% → 58%로 하락. 필요한 툴이 제외되면 복구 경로가 없기 때문.
어블레이션: Lazy Loading 제거 시 성공률 -10.3%p로 가장 큰 손실. 인코더를 TF-IDF로 교체하면 -8.1%p. MiniLM → MPNet 업그레이드는 고작 +0.4%p로 비용 대비 효과 없음.

How to Apply

LangGraph 에이전트에서 before_model 훅에 IntentRouter를 끼워 넣으면 됨. FAISS로 툴 요약을 인덱싱하고, 매 턴 사용자 쿼리를 임베딩해 top-k 툴만 선택한 뒤 그것만 스키마를 주입. 임계값 θ는 100~200개 (쿼리, 정답 툴) 쌍으로 스윕해서 F1 최대화 지점(보통 0.22~0.32)으로 설정.
툴 이름과 설명이 의미 없이 짧으면 검색 품질이 급락하므로, 제공된 summarize_tool.py 유틸리티로 기존 MCP tools/list 출력을 사용자 의도 중심 문장('GitHub 이슈를 라벨과 담당자로 검색')으로 재작성하면 retrieval F1 8%p 향상, 요약 길이 63% 단축 효과.
멀티 서버 환경에서 툴이 50개를 넘기 시작하면 Full-Schema 방식은 ρ(유효 컨텍스트 활용률)가 70% 분기점 아래로 붕괴함. 이 시점이 Tool Attention 도입 적기이며, GitHub 저장소의 benchmark.py로 현재 토큰 상황을 API 호출 없이 30초 안에 측정 가능.

Code Example

snippet

# 빠른 시작: Tool Attention 미들웨어 설정
from sentence_transformers import SentenceTransformer
from vector_store import ToolVectorStore
from lazy_loader import LazySchemaLoader
from intent_router import IntentRouter
from tool_attention import ToolAttention
import tiktoken

# 1. 툴 카탈로그 정의 (요약은 사용자 의도 중심으로 작성)
tools = [
    {"id": "github_search_issues", "summary": "GitHub 이슈를 라벨, 담당자, 상태로 검색"},
    {"id": "slack_post_message",   "summary": "Slack 채널에 메시지 전송"},
    {"id": "db_query",             "summary": "SQL 쿼리로 데이터베이스 조회"},
    # ... 120개 툴 전체
]

# 2. 컴포넌트 초기화
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
store = ToolVectorStore(dim=384)
store.add_tools(tools, encoder)

loader = LazySchemaLoader(registry_path="./schemas")  # 각 tool_id.json 보관

router = IntentRouter(
    store=store,
    encoder=encoder,
    threshold=0.28,   # F1 최대화 지점으로 캘리브레이션
    top_k=10
)

enc = tiktoken.get_encoding("cl100k_base")
ta = ToolAttention(
    store=store,
    loader=loader,
    router=router,
    token_counter=lambda s: len(enc.encode(s))
)

# 3. 매 턴 실행 (before_model 훅)
user_query = "지난주 CSAT 급락 관련 Slack 메시지 찾아서 Jira 티켓 만들어줘"

# 선행 조건 체크 (예: 인증 상태 확인)
def precondition_check(tool_id):
    if "github_write" in tool_id:
        return agent_state.get("github_token") is not None
    return True

result = ta.before_model(user_query, precondition_check=precondition_check)

print(f"Phase-1 토큰(요약풀): {result.phase1_tokens}")
print(f"Phase-2 토큰(전체스키마): {result.phase2_tokens}")
print(f"선택된 툴: {result.active_ids}")
# → 전체 47k 대신 ~2.4k 토큰만 주입

# 4. 모델 응답 후 환각 게이트
requested_tool = model_response.get("tool_call")
error = ta.after_model(result.active_ids, requested_tool)
if error:
    # 모델에 structured error 반환 → 다음 턴에 78% 확률로 자동 복구
    return {"error": "tool_not_available", "available": result.active_ids}

Terminology

MCP (Model Context Protocol)AI 에이전트와 외부 툴 서버를 연결하는 표준 인터페이스. USB 포트처럼 어떤 에이전트든 MCP 규격만 맞으면 어떤 툴 서버든 꽂아 쓸 수 있게 해주는 프로토콜.

Tools TaxMCP가 매 대화 턴마다 연결된 모든 툴의 스펙을 통째로 다시 전송하면서 쌓이는 불필요한 토큰 비용. 대화가 길어질수록 누적되는 구조적 낭비.

KV Cache트랜스포머가 이전에 계산한 결과를 GPU 메모리에 저장해두는 캐시. 툴 스키마 토큰이 많을수록 캐시도 커져서 GPU 메모리 압박과 응답 지연이 생김.

ISO Score (Intent-Schema Overlap)사용자 의도와 툴 설명의 의미적 유사도 점수. 벡터 공간에서 두 문장이 얼마나 가까운지를 코사인 유사도로 측정한 것.

Lazy Schema Loading필요할 때만 데이터를 로드하는 패턴. 툴 스키마를 처음부터 다 올려두지 않고, 실제로 쓸 것 같은 툴만 그 턴에 로드함.

Tool Poisoning Attack악성 툴 설명에 몰래 지시문을 삽입해서 에이전트가 그 툴을 보는 것만으로도 공격자 의도대로 행동하게 만드는 보안 취약점.

FAISSFacebook이 만든 고속 벡터 유사도 검색 라이브러리. 수천~수백만 개의 벡터 중에서 가장 유사한 것을 빠르게 찾아주는 도구.

sentence-transformers문장을 의미를 담은 벡터로 변환하는 오픈소스 라이브러리. 두 문장의 의미가 비슷하면 벡터도 비슷한 방향을 가리키도록 학습된 모델 모음.

Related Resources

Original Abstract (Expand)

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention