Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code
TL;DR Highlight
This article explains how to run the Google Gemma 4 26B-A4B model locally on macOS using LM Studio 0.4.0's lms CLI and integrate it with Claude Code. Thanks to the MoE architecture, it can run at 51 tok/s on a 48GB MacBook Pro, enabling coding tasks without API costs.
Who Should Read
Developers who want to adopt local models instead of cloud AI due to API costs or data privacy concerns. Specifically, developers who have an Apple Silicon Mac with 48GB or more of memory and are using AI coding tools like Claude Code.
Core Mechanics
- Cloud AI APIs have issues with rate limits, costs, privacy, and network latency, which local models can replace with the advantages of zero API costs, no data leakage, and stable availability.
- Google Gemma 4 is not a single model but consists of four model families: E2B, E4B (optimized for on-device use), 26B-A4B (MoE), and 31B (Dense). E2B and E4B support audio input, and the 31B Dense model achieves the highest benchmark scores of 85.2% on MMLU Pro and 89.2% on AIME 2026.
- The 26B-A4B model uses the MoE (Mixture of Experts, an architecture that selectively activates only some of the total parameters) approach, having 128 + 1 shared experts, but only activates 8 experts (3.8B parameters) per token. This results in inference costs at the level of a 4B dense model but with much higher quality.
- The effective performance of 26B-A4B is estimated to be around the level of a 10B dense model (sqrt(26B × 4B) ≈ 10B), scoring 82.6% on MMLU Pro and 88.3% on AIME 2026, approaching the performance of 31B Dense (85.2%, 89.2%). Based on Elo scores, it is also comparable to Qwen 3.5 397B-A17B or Kimi-K2.5, which require 400B~1000B parameters (~1441).
- On a 14-inch MacBook Pro M4 Pro (48GB unified memory), Gemma 4 26B-A4B operates at 51 tok/s and supports a 256K context window, vision input, native function/tool calling, and configurable thinking modes.
- LM Studio 0.4.0 introduces llmster (a standalone inference engine separated from the desktop app) and the lms CLI, enabling model download, loading, and serving solely through the terminal without a GUI. It can also be used in headless servers, CI/CD pipelines, and SSH sessions.
- Key new features of LM Studio 0.4.0 include the llmster daemon (background service), lms CLI, parallel request processing (simultaneous requests in consecutive batches), stateful REST API (/v1/chat endpoint, maintaining conversation history), and MCP integration.
Evidence
- "A commenter shared a setup for connecting Gemma 4 26B-A4B to Claude Code using a llama.cpp server on a M1 Max 64GB MacBook. They pointed out that Gemma 4 26B-A4B is about twice as fast at token generation than Qwen3.5 35B-A3B (40 tok/s), but significantly lags behind in the tau2 benchmark (agent task capability measurement) with 68% vs 81%. Therefore, it may not be suitable for heavy agentic tasks that require many tool calls."
How to Apply
- If you want to reduce API costs or avoid sending code/data to external servers, install LM Studio 0.4.0 or later, download and load Gemma 4 26B-A4B with the `lms` CLI, and serve it as an OpenAI-compatible API to replace Claude Code's backend model with a free local model.
- If you have less than 48GB of memory or need faster speeds, adjust your model selection considering that the entire MoE model weights are loaded into memory. For example, in a 32GB environment, consider Gemma 4 E4B or smaller quantized versions, or use Ollama or llama.cpp to reduce memory usage with lightweight quantized GGUF files like `unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL`.
- When applying local models to agentic coding tasks that require many tool calls, keep in mind that Gemma 4 26B-A4B is vulnerable with 68% on the tau2 benchmark compared to Qwen3.5 35B-A3B (81%). It is better to evaluate coding-specialized models like Qwen3-coder first for such purposes.
- If you want to integrate local LLMs into headless servers or CI/CD environments, you can choose to use LM Studio 0.4.0's headless mode (lms CLI + llmster daemon) or directly run a llama.cpp server. llama.cpp can easily launch a server with `llama-server --reasoning auto --fit on -hf <model name> --temp 1.0`.
Code Example
# Run Gemma 4 26B-A4B local server using llama.cpp + Swival
$ llama-server \
--reasoning auto \
--fit on \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
--temp 1.0 --top-p 0.95 --top-k 64
# Run Swival agent in a separate terminal
$ uvx swival --provider llamacpp
# If running with Ollama
$ ollama run gemma4:26bTerminology
Related Papers
Jamesob's guide to running SOTA LLMs locally
2천 달러짜리 RTX 3090 한 장부터 4만 달러짜리 RTX PRO 6000 4장 셋업까지, 로컬에서 최신 LLM을 직접 돌리는 방법을 하드웨어 선택·구성·실행 설정까지 통째로 정리한 실전 가이드다.
Faster embeddings: how we rebuilt the ONNX path in Manticore
Manticore Search가 기존 SentenceTransformers/Candle 백엔드를 ONNX Runtime으로 교체해 텍스트 임베딩 생성 속도를 평균 14배 향상시켰다. 별도 모델 서비스 없이 DB 내부에서 직접 임베딩을 처리하는 구조에서 INSERT 속도가 곧 임베딩 속도이기 때문에 이 개선은 실질적인 ingest 처리량 향상으로 직결된다.
Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction
멀티벡터 검색 모델의 문서 벡터를 1비트 이진값으로 압축하고 쿼리 벡터만 int8로 유지하는 비대칭 양자화 기법으로, 스토리지를 97% 줄이면서 검색 품질 손실을 0.61점(NDCG@10 기준)에 그치게 만든 실제 프로덕션 적용 사례다.
Show HN: Bash4LLM+ – A lightweight, dependency-free Bash wrapper for LLM APIs
Python이나 Node.js 없이 순수 Bash만으로 Groq 등 OpenAI 호환 LLM API를 호출할 수 있는 단일 스크립트 도구로, Termux(Android)를 포함한 모든 Unix 환경에서 동작한다.
Wayfinder Router: deterministic routing of queries between local and hosted LLM
프롬프트의 복잡도를 모델 호출 없이 오프라인으로 점수화해서 간단한 쿼리는 로컬 모델로, 어려운 쿼리는 유료 모델로 자동 라우팅하는 CLI 도구다. LLM 비용을 줄이면서도 응답 품질을 유지하고 싶은 개발자에게 유용하다.
Apple Neural Engine: Architecture, Programming, and Performance
Apple 기기에 내장된 AI 전용 칩인 ANE(Apple Neural Engine)를 리버스 엔지니어링으로 분석한 302페이지짜리 기술 문서로, Core ML 아래 숨겨진 내부 구조와 직접 접근 경로를 처음으로 공개한다.