Claude Token Counter, now with model comparisons
TL;DR Highlight
Anthropic’s Claude Opus 4.7 consumes up to 46% more tokens than its predecessor on the same input due to a tokenizer change, effectively raising costs.
Who Should Read
Developers operating services with the Claude API, particularly backend/AI developers considering or already using Opus 4.7 and needing precise cost impact analysis.
Core Mechanics
- Simon Willison’s Claude Token Counter now compares token counts across models, simultaneously supporting Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5.
- Claude Opus 4.7 marks Anthropic’s first model to undergo a tokenizer change, potentially converting the same input into 1.0 to 1.35 times more tokens.
- Testing with a system prompt revealed Opus 4.7 generated 1.46 times more tokens than Opus 4.6, exceeding Anthropic’s stated range of 1.35x.
- Despite maintaining the same pricing ($5 per million input tokens, $25 per million output tokens as Opus 4.6), the increased token count results in a real cost increase of over 40%.
- Testing with a high-resolution image (3456x2234 pixels, 3.7MB PNG) showed Opus 4.7 generating 3.01 times more tokens than Opus 4.6, due to enhanced Vision capabilities supporting images up to 2,576 pixels.
- Conversely, smaller images (682x318) showed negligible token differences between Opus 4.7 (314 tokens) and 4.6 (310 tokens), indicating the increase stems from high-resolution support, not the tokenizer itself.
- A 15MB, 30-page text-centric PDF resulted in Opus 4.7 generating 60,934 tokens versus 56,482 for 4.6, a 1.08x difference—a smaller increase than observed with images.
- The token counting API requires a Claude API key and allows pre-checking expected token counts for each model by specifying the model ID.
Evidence
- "Critics labeled the tokenizer change a ‘money grab,’ citing Anthropic’s lack of transparency regarding the reasons or methodology behind the alteration. Technical counterarguments suggest the change could be an intentional design for performance improvements, potentially improving inference quality by breaking down text into more meaningful units. Speculation also arose about replacing the tokenizer with a smaller learning model, similar to Byte Latent Transformer. Data from tokens.billchambers.me/leaderboard shows large-scale comparisons between 4.6 and 4.7, with one user reporting a 40% increase in tokens for their prompts. Practical experience reveals that token costs escalate in agent systems due to re-transmitting the entire context (including previous tool call results) upon timeouts, potentially consuming three times the tokens for a failed API call. Developers are responding by maintaining the default model in Claude CLI as 4.6 and using the `--model claude-opus-4-7` flag only when necessary, and by downsampling high-resolution images before upload."
How to Apply
- "If considering migrating to Opus 4.7, pre-measure the token cost increase for your existing system prompts and representative inputs using Simon Willison’s Claude Token Counter (https://tools.simonwillison.net/claude-token-counter). If upgrading image processing pipelines to Opus 4.7, pre-resize images to 682x318 if high resolution isn’t essential to maintain token costs comparable to Opus 4.6. When using Claude CLI or API, separate models based on task complexity to manage costs, using Sonnet 4.6 or Haiku 4.5 as defaults and specifying `--model claude-opus-4-7` only for complex tasks. For agent systems, monitor tokens at both the token and action levels; track whether side effects actually executed to reduce unnecessary re-attempts and minimize token waste."
Terminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.