MCP server that reduces Claude Code context consumption by 98%
TL;DR Highlight
When MCP tool calls return raw verbose output, it eats context window fast — here's a pattern to compress tool outputs before they hit the LLM.
Who Should Read
Developers building MCP servers or AI agent pipelines where tool call outputs are large and verbosity is killing context window efficiency.
Core Mechanics
- The core problem: MCP tools often return raw, verbose output (full API responses, stack traces, large data blobs) that gets dumped directly into the LLM's context window.
- This wastes context tokens on information the LLM doesn't need, increases latency, and raises costs.
- The solution pattern: add a 'summarizer' or 'filter' layer between the tool execution and the context — transform raw outputs into compact, LLM-relevant summaries before they're inserted.
- For structured data (JSON API responses), extract only the fields the agent actually needs. For errors, compress to key error type + message. For large text, summarize.
- This can be implemented as a middleware layer in the MCP server, or as a post-processing step in the agent loop.
Evidence
- The author demonstrated context window savings of 60–80% on common tool outputs (API responses, file reads, terminal output) by applying output compression.
- HN commenters shared similar patterns they'd developed — some using a small/fast LLM as the summarizer to avoid adding latency.
- One commenter noted this is essentially the 'perception-action loop' problem in cognitive architectures: you need selective attention, not full sensory input.
How to Apply
- When building MCP servers, define an output schema for each tool that specifies what fields are relevant for the agent — filter raw outputs to only those fields.
- For tools that return large text (documentation, file contents), add a max_tokens parameter and truncate with a summary suffix rather than passing the full text.
- Consider using a fast/cheap LLM (like GPT-4o-mini or Haiku) as a summarizer for tool outputs before they reach the main agent model — the cost savings on the main model usually outweigh the summarizer cost.
- Log both the raw tool output and the compressed version during development to tune your compression strategies.
Code Example
# MCP-only installation (tool use only)
claude mcp add context-mode -- npx -y context-mode
# Plugin Marketplace installation (includes auto-routing hook + slash command)
/plugin marketplace add mksglu/claude-context-mode
/plugin install context-mode@claude-context-modeTerminology
Related Papers
Jamesob's guide to running SOTA LLMs locally
2천 달러짜리 RTX 3090 한 장부터 4만 달러짜리 RTX PRO 6000 4장 셋업까지, 로컬에서 최신 LLM을 직접 돌리는 방법을 하드웨어 선택·구성·실행 설정까지 통째로 정리한 실전 가이드다.
Faster embeddings: how we rebuilt the ONNX path in Manticore
Manticore Search가 기존 SentenceTransformers/Candle 백엔드를 ONNX Runtime으로 교체해 텍스트 임베딩 생성 속도를 평균 14배 향상시켰다. 별도 모델 서비스 없이 DB 내부에서 직접 임베딩을 처리하는 구조에서 INSERT 속도가 곧 임베딩 속도이기 때문에 이 개선은 실질적인 ingest 처리량 향상으로 직결된다.
Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction
멀티벡터 검색 모델의 문서 벡터를 1비트 이진값으로 압축하고 쿼리 벡터만 int8로 유지하는 비대칭 양자화 기법으로, 스토리지를 97% 줄이면서 검색 품질 손실을 0.61점(NDCG@10 기준)에 그치게 만든 실제 프로덕션 적용 사례다.
Show HN: Bash4LLM+ – A lightweight, dependency-free Bash wrapper for LLM APIs
Python이나 Node.js 없이 순수 Bash만으로 Groq 등 OpenAI 호환 LLM API를 호출할 수 있는 단일 스크립트 도구로, Termux(Android)를 포함한 모든 Unix 환경에서 동작한다.
Wayfinder Router: deterministic routing of queries between local and hosted LLM
프롬프트의 복잡도를 모델 호출 없이 오프라인으로 점수화해서 간단한 쿼리는 로컬 모델로, 어려운 쿼리는 유료 모델로 자동 라우팅하는 CLI 도구다. LLM 비용을 줄이면서도 응답 품질을 유지하고 싶은 개발자에게 유용하다.
Apple Neural Engine: Architecture, Programming, and Performance
Apple 기기에 내장된 AI 전용 칩인 ANE(Apple Neural Engine)를 리버스 엔지니어링으로 분석한 302페이지짜리 기술 문서로, Core ML 아래 숨겨진 내부 구조와 직접 접근 경로를 처음으로 공개한다.