Claude Token Counter, now with model comparisons
TL;DR Highlight
Anthropic’s Claude Opus 4.7 consumes up to 46% more tokens than its predecessor on the same input due to a tokenizer change, effectively raising costs.
Who Should Read
Developers operating services with the Claude API, particularly backend/AI developers considering or already using Opus 4.7 and needing precise cost impact analysis.
Core Mechanics
- Simon Willison’s Claude Token Counter now compares token counts across models, simultaneously supporting Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5.
- Claude Opus 4.7 marks Anthropic’s first model to undergo a tokenizer change, potentially converting the same input into 1.0 to 1.35 times more tokens.
- Testing with a system prompt revealed Opus 4.7 generated 1.46 times more tokens than Opus 4.6, exceeding Anthropic’s stated range of 1.35x.
- Despite maintaining the same pricing ($5 per million input tokens, $25 per million output tokens as Opus 4.6), the increased token count results in a real cost increase of over 40%.
- Testing with a high-resolution image (3456x2234 pixels, 3.7MB PNG) showed Opus 4.7 generating 3.01 times more tokens than Opus 4.6, due to enhanced Vision capabilities supporting images up to 2,576 pixels.
- Conversely, smaller images (682x318) showed negligible token differences between Opus 4.7 (314 tokens) and 4.6 (310 tokens), indicating the increase stems from high-resolution support, not the tokenizer itself.
- A 15MB, 30-page text-centric PDF resulted in Opus 4.7 generating 60,934 tokens versus 56,482 for 4.6, a 1.08x difference—a smaller increase than observed with images.
- The token counting API requires a Claude API key and allows pre-checking expected token counts for each model by specifying the model ID.
Evidence
- "Critics labeled the tokenizer change a ‘money grab,’ citing Anthropic’s lack of transparency regarding the reasons or methodology behind the alteration. Technical counterarguments suggest the change could be an intentional design for performance improvements, potentially improving inference quality by breaking down text into more meaningful units. Speculation also arose about replacing the tokenizer with a smaller learning model, similar to Byte Latent Transformer. Data from tokens.billchambers.me/leaderboard shows large-scale comparisons between 4.6 and 4.7, with one user reporting a 40% increase in tokens for their prompts. Practical experience reveals that token costs escalate in agent systems due to re-transmitting the entire context (including previous tool call results) upon timeouts, potentially consuming three times the tokens for a failed API call. Developers are responding by maintaining the default model in Claude CLI as 4.6 and using the `--model claude-opus-4-7` flag only when necessary, and by downsampling high-resolution images before upload."
How to Apply
- "If considering migrating to Opus 4.7, pre-measure the token cost increase for your existing system prompts and representative inputs using Simon Willison’s Claude Token Counter (https://tools.simonwillison.net/claude-token-counter). If upgrading image processing pipelines to Opus 4.7, pre-resize images to 682x318 if high resolution isn’t essential to maintain token costs comparable to Opus 4.6. When using Claude CLI or API, separate models based on task complexity to manage costs, using Sonnet 4.6 or Haiku 4.5 as defaults and specifying `--model claude-opus-4-7` only for complex tasks. For agent systems, monitor tokens at both the token and action levels; track whether side effects actually executed to reduce unnecessary re-attempts and minimize token waste."
Terminology
Related Papers
Jamesob's guide to running SOTA LLMs locally
2천 달러짜리 RTX 3090 한 장부터 4만 달러짜리 RTX PRO 6000 4장 셋업까지, 로컬에서 최신 LLM을 직접 돌리는 방법을 하드웨어 선택·구성·실행 설정까지 통째로 정리한 실전 가이드다.
Faster embeddings: how we rebuilt the ONNX path in Manticore
Manticore Search가 기존 SentenceTransformers/Candle 백엔드를 ONNX Runtime으로 교체해 텍스트 임베딩 생성 속도를 평균 14배 향상시켰다. 별도 모델 서비스 없이 DB 내부에서 직접 임베딩을 처리하는 구조에서 INSERT 속도가 곧 임베딩 속도이기 때문에 이 개선은 실질적인 ingest 처리량 향상으로 직결된다.
Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction
멀티벡터 검색 모델의 문서 벡터를 1비트 이진값으로 압축하고 쿼리 벡터만 int8로 유지하는 비대칭 양자화 기법으로, 스토리지를 97% 줄이면서 검색 품질 손실을 0.61점(NDCG@10 기준)에 그치게 만든 실제 프로덕션 적용 사례다.
Show HN: Bash4LLM+ – A lightweight, dependency-free Bash wrapper for LLM APIs
Python이나 Node.js 없이 순수 Bash만으로 Groq 등 OpenAI 호환 LLM API를 호출할 수 있는 단일 스크립트 도구로, Termux(Android)를 포함한 모든 Unix 환경에서 동작한다.
Wayfinder Router: deterministic routing of queries between local and hosted LLM
프롬프트의 복잡도를 모델 호출 없이 오프라인으로 점수화해서 간단한 쿼리는 로컬 모델로, 어려운 쿼리는 유료 모델로 자동 라우팅하는 CLI 도구다. LLM 비용을 줄이면서도 응답 품질을 유지하고 싶은 개발자에게 유용하다.
Apple Neural Engine: Architecture, Programming, and Performance
Apple 기기에 내장된 AI 전용 칩인 ANE(Apple Neural Engine)를 리버스 엔지니어링으로 분석한 302페이지짜리 기술 문서로, Core ML 아래 숨겨진 내부 구조와 직접 접근 경로를 처음으로 공개한다.