Ollama is now powered by MLX on Apple Silicon in preview
TL;DR Highlight
Ollama has switched its inference backend on Apple Silicon from llama.cpp to Apple's MLX framework, delivering up to nearly 2x faster inference speeds. On M5 chips, it also leverages the GPU Neural Accelerator, bringing meaningful performance gains to coding agent workflows.
Who Should Read
Developers running coding agents like Claude Code or Codex, or local LLMs, on Mac (Apple Silicon). Especially MacBook/Mac Studio users with 32GB or more of unified memory.
Core Mechanics
- Starting with Ollama 0.19, the inference backend on macOS has switched from llama.cpp (GGUF format) to MLX, a machine learning framework created by Apple. MLX is optimized for Apple Silicon's unified memory architecture, actively leveraging the structure where CPU, GPU, and Neural Engine share memory.
- On M5 Max, the Prefill speed (the rate at which the prompt is processed before the first token is generated) improved by approximately 57%, from 1154 tokens/s in version 0.18 to 1810 tokens/s in version 0.19. Decode speed (the rate at which tokens are generated) improved by approximately 93%, from 58 tokens/s to 112 tokens/s. The test model was Alibaba's Qwen3.5-35B-A3B.
- On M5, M5 Pro, and M5 Max chips, the newly added GPU Neural Accelerator further boosts both TTFT (Time To First Token, the latency before the first response) and token generation speed.
- This update introduces support for NVFP4 (a 4-bit floating-point quantization format developed by NVIDIA). It reduces memory usage and storage while better preserving model accuracy compared to the existing Q4_K_M format. Since this format is primarily used in cloud production environments, it enables direct comparison between local and production results.
- The caching system has been significantly improved. Cache can now be reused across multiple conversations, reducing memory usage and increasing cache hit rates for tools like Claude Code that share system prompts. Additionally, an 'intelligent checkpoint' feature that saves snapshots at appropriate positions within prompts and a smarter eviction policy that retains shared prefixes longer even after old branches are deleted have been added.
- The officially recommended model in this preview release is the Qwen3.5-35B-A3B NVFP4 version, tuned for coding tasks. It requires 32GB or more of unified memory and is primarily targeted at use cases integrating with coding agents like Claude Code or OpenClaw.
- Future updates were teased, including easier ways to import custom fine-tuned models into Ollama and expansion of supported architectures. Currently, only specific architectures take the MLX path.
Evidence
- "A user shared benchmark results from an M4 Pro + 48GB RAM environment. Comparing the same Qwen3.5-35B-A3B model across formats, NVFP4 (PromptEval 13.2 t/s, Decode 66.5 t/s) was about 2x faster than Q4_K_M (6.6 t/s, 30.0 t/s), while int4 (59.4 t/s, 84.4 t/s) was the fastest overall. However, the user noted they did not verify quality differences. There were observations that LM Studio has supported MLX for a long time, and some users shared experiences where GGUF format consistently produced better benchmark results — not by a large margin, but a difference exists. Some viewed Ollama's MLX adoption as belated. A user running Qwen 70B 4-bit with llama.cpp on an M2 Max 96GB expressed hope that the MLX transition would improve memory handling, while also being curious about actual performance comparisons with GGUF paths and large models — this comment also reconfirmed that Ollama had been using llama.cpp internally. An M4 Max 48GB RAM user reported that even simple queries like 'Hello world' took 6–25 seconds, which appears to be due to the model going through a 'thinking' process rather than the inference itself. Given the nature of coding-focused models, thinking may be enabled by default, so checking the configuration was advised. A majority expressed optimism about the future of on-device LLMs, citing reasons such as privacy, elimination of external API dependencies, distributed data center demand, and power savings. However, some comments pointed out the practical limitation that running coding agents comfortably on a 16GB RAM Mac is still difficult."
How to Apply
- "If you want to use Claude Code integrated with a local LLM on an M1/M2/M3/M4 Mac (32GB or more), you can get started right away by upgrading to Ollama 0.19 and running the command `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`. Compared to version 0.18, the Decode speed is approximately 2x faster, noticeably reducing wait times in agent loops. If you're running RAG or agent workflows on a local Mac that repeatedly process long system prompts or large contexts (50k+ tokens), the new intelligent cache checkpoint feature reduces the cost of repeatedly processing the same prefix — simply upgrading is enough to see the benefit. If you're on a newer chip like M5 Max, prioritize using NVFP4 quantized models. Quality is more stable than int4, and since it's the same format used in cloud production (NVIDIA GPU servers), local test results can be applied directly to production. If you're already using llama.cpp or LM Studio with GGUF models and are satisfied, note that community benchmarks suggest minor quality differences exist, so rather than switching immediately, it's better to compare quality directly with the same model before deciding."
Code Example
# Integrate with Claude Code
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
# Integrate with OpenClaw
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
# Chat directly
ollama run qwen3.5:35b-a3b-coding-nvfp4
# Measure performance (using --verbose flag)
ollama run qwen3.5:35b-a3b-nvfp4 "calculate fibonacci numbers in a one-line bash script" --verbose
# Performance comparison by format (based on M4 Pro 48GB)
# Model PromptEvalRate EvalRate
# qwen3.5:35b-a3b-q4_K_M 6.6 t/s 30.0 t/s
# qwen3.5:35b-a3b-nvfp4 13.2 t/s 66.5 t/s
# qwen3.5:35b-a3b-int4 59.4 t/s 84.4 t/sTerminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.