Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code
TL;DR Highlight
This article explains how to run the Google Gemma 4 26B-A4B model locally on macOS using LM Studio 0.4.0's lms CLI and integrate it with Claude Code. Thanks to the MoE architecture, it can run at 51 tok/s on a 48GB MacBook Pro, enabling coding tasks without API costs.
Who Should Read
Developers who want to adopt local models instead of cloud AI due to API costs or data privacy concerns. Specifically, developers who have an Apple Silicon Mac with 48GB or more of memory and are using AI coding tools like Claude Code.
Core Mechanics
- Cloud AI APIs have issues with rate limits, costs, privacy, and network latency, which local models can replace with the advantages of zero API costs, no data leakage, and stable availability.
- Google Gemma 4 is not a single model but consists of four model families: E2B, E4B (optimized for on-device use), 26B-A4B (MoE), and 31B (Dense). E2B and E4B support audio input, and the 31B Dense model achieves the highest benchmark scores of 85.2% on MMLU Pro and 89.2% on AIME 2026.
- The 26B-A4B model uses the MoE (Mixture of Experts, an architecture that selectively activates only some of the total parameters) approach, having 128 + 1 shared experts, but only activates 8 experts (3.8B parameters) per token. This results in inference costs at the level of a 4B dense model but with much higher quality.
- The effective performance of 26B-A4B is estimated to be around the level of a 10B dense model (sqrt(26B × 4B) ≈ 10B), scoring 82.6% on MMLU Pro and 88.3% on AIME 2026, approaching the performance of 31B Dense (85.2%, 89.2%). Based on Elo scores, it is also comparable to Qwen 3.5 397B-A17B or Kimi-K2.5, which require 400B~1000B parameters (~1441).
- On a 14-inch MacBook Pro M4 Pro (48GB unified memory), Gemma 4 26B-A4B operates at 51 tok/s and supports a 256K context window, vision input, native function/tool calling, and configurable thinking modes.
- LM Studio 0.4.0 introduces llmster (a standalone inference engine separated from the desktop app) and the lms CLI, enabling model download, loading, and serving solely through the terminal without a GUI. It can also be used in headless servers, CI/CD pipelines, and SSH sessions.
- Key new features of LM Studio 0.4.0 include the llmster daemon (background service), lms CLI, parallel request processing (simultaneous requests in consecutive batches), stateful REST API (/v1/chat endpoint, maintaining conversation history), and MCP integration.
Evidence
- "A commenter shared a setup for connecting Gemma 4 26B-A4B to Claude Code using a llama.cpp server on a M1 Max 64GB MacBook. They pointed out that Gemma 4 26B-A4B is about twice as fast at token generation than Qwen3.5 35B-A3B (40 tok/s), but significantly lags behind in the tau2 benchmark (agent task capability measurement) with 68% vs 81%. Therefore, it may not be suitable for heavy agentic tasks that require many tool calls."
How to Apply
- If you want to reduce API costs or avoid sending code/data to external servers, install LM Studio 0.4.0 or later, download and load Gemma 4 26B-A4B with the `lms` CLI, and serve it as an OpenAI-compatible API to replace Claude Code's backend model with a free local model.
- If you have less than 48GB of memory or need faster speeds, adjust your model selection considering that the entire MoE model weights are loaded into memory. For example, in a 32GB environment, consider Gemma 4 E4B or smaller quantized versions, or use Ollama or llama.cpp to reduce memory usage with lightweight quantized GGUF files like `unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL`.
- When applying local models to agentic coding tasks that require many tool calls, keep in mind that Gemma 4 26B-A4B is vulnerable with 68% on the tau2 benchmark compared to Qwen3.5 35B-A3B (81%). It is better to evaluate coding-specialized models like Qwen3-coder first for such purposes.
- If you want to integrate local LLMs into headless servers or CI/CD environments, you can choose to use LM Studio 0.4.0's headless mode (lms CLI + llmster daemon) or directly run a llama.cpp server. llama.cpp can easily launch a server with `llama-server --reasoning auto --fit on -hf <model name> --temp 1.0`.
Code Example
# Run Gemma 4 26B-A4B local server using llama.cpp + Swival
$ llama-server \
--reasoning auto \
--fit on \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
--temp 1.0 --top-p 0.95 --top-k 64
# Run Swival agent in a separate terminal
$ uvx swival --provider llamacpp
# If running with Ollama
$ ollama run gemma4:26bTerminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.