Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference
TL;DR Highlight
Google's open-source model Gemma 4 can now run on iPhone with full local inference without the cloud, demonstrating that on-device AI has moved beyond the experimental stage and entered a practical phase.
Who Should Read
iOS/Android developers looking to add AI features to mobile apps or considering edge AI solutions with privacy and offline requirements.
Core Mechanics
- Google's open-source model family, Gemma 4, can now run inference completely locally and offline on iPhone. There is no API call or cloud dependency.
- The flagship 31B variant in the model lineup showed similar performance to Qwen 3.5's 27B model in benchmarks. Gemma has approximately 4 billion more parameters.
- There are lightweight variants specifically designed for mobile deployment, E2B (2 billion parameters) and E4B (4 billion parameters), with Google even recommending E2B for its own apps. This is a choice considering memory and heat limitations.
- To get started, simply download the 'Google AI Edge Gallery' app from the App Store and select the desired model variant. Local inference is possible immediately without any separate settings or accounts.
- Google AI Edge Gallery is not just a simple text interface, but a platform that includes image recognition, voice interaction, and an extensible Skills framework. It is positioned as a foundation for developers to experiment with on-device AI.
- Inference is executed via the iPhone's GPU (Metal), and actual response latency is reported to be noticeably low. Actual benchmark figures measured Prefill speed of 231 t/s, Decode speed of 16 t/s, and time to first token of 1.16 seconds on an iPhone 16 Pro.
- The ability to operate offline is of practical value in enterprise use cases such as field work, medical environments, and where cloud processing is impossible due to data privacy regulations.
- The community has mentioned that it can also be run identically on Android via AI Core or llama.cpp.
Evidence
- There was criticism regarding the use of GPU (Metal). One comment stated, 'It seems they gave up on custom attention kernel compilation for Apple's dedicated NPU, ANE (Apple Neural Engine), and went around with Metal.' Metal is easy to port, but consumes much more battery than a dedicated NPU. It was evaluated as just a flashy tech demo until the ANE backend is rewritten.
- A developer created and released an offline code generation app called 'pucky' that runs on iPhone using Gemma 4 on GitHub (https://github.com/blixt/pucky). While the 4B model is technically executable, it automatically switches to 2B due to memory constraints, and it generates a single TypeScript file and compiles it with oxc. They stated that it is difficult to pass App Store review and must be built directly in Xcode.
- Experiences sharing being blocked by Apple's guideline 2.5.2 when attempting to deploy apps including local LLMs in the App Store have been shared. This points to Apple blocking LLM usage within the app store, and a practical concern that the distribution path for on-device AI apps may be limited.
- There was also criticism regarding the model structure characteristics of Gemma 4. 'Gemma 4 tends to activate almost all weights, resulting in high power consumption.' It was pointed out that it is less efficient than Qwen3-coder, which uses MoE (Mixture of Experts) to activate only about 3 billion parameters at a time. It was evaluated that there is still a lot of performance room left on the table.
- There were also warnings about the reliability of small models. An experience was shared where it confidently gave the wrong answer ('Yes, it is') when asked, 'Is it okay to give an avocado to a dog?' This case reminds us that it is dangerous to use small on-device models directly for medical or safety-related judgments.
How to Apply
- If you are developing enterprise field apps in areas such as medical, financial, and military where cloud AI APIs cannot be used due to data privacy regulations, adopting Google AI Edge Gallery and Gemma 4 E2B/E4B variants as an on-device inference base can satisfy both regulatory compliance and AI functionality.
- If you are trying to embed AI functionality in an iOS app, you may be blocked by Apple's guideline 2.5.2 when deploying to the App Store, so it is good to review TestFlight or enterprise deployment paths in advance. As in the case of community developers, Xcode direct build and sideloading methods can also be alternatives.
- When linking an LLM to a mobile app, model size selection is important. Considering memory and heat, E2B is more realistic than E4B. In fact, if you try 4B, it often automatically falls back to 2B due to memory constraints, so it is safer to design the UX based on E2B from the beginning.
- If you are concerned about battery consumption of on-device inference based on Gemma 4, you should add additional logic to limit usage frequency in battery-sensitive scenarios (e.g., minimize background processing, maintain short context) until support for ANE (Apple Neural Engine) is added instead of the current GPU (Metal) based backend.
Terminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.