Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
TL;DR Highlight
We open-sourced a real-time multimodal AI speech and video conversation system that runs completely locally on Apple Silicon M3 Pro without the internet. It is attracting attention for its ability to handle speech recognition, video understanding, and TTS simultaneously without cloud costs.
Who Should Read
Developers who want to build their own on-device AI voice assistants, or backend/ML developers who want to build a multimodal pipeline locally without cloud AI API costs.
Core Mechanics
- This project (Parlor) is a real-time multimodal conversation system that takes microphone and camera input and responds with voice, with all processing done only on the user's device.
- It uses Google's recently released Gemma 4 E2B model for language understanding and video recognition, and utilizes the LiteRT-LM runtime for GPU acceleration.
- Kokoro is used for TTS (text-to-speech). It operates with the MLX backend on Mac and the ONNX backend on Linux.
- The structure streams microphone and camera data from the browser to a FastAPI server via WebSocket, and the server returns the processed audio to the browser via WebSocket.
- The browser runs Silero VAD (Voice Activity Detection model) to enable hands-free conversation without push-to-talk, and also supports barge-in functionality to interrupt the AI while it's speaking.
- TTS is streamed sentence by sentence, so audio playback starts before the entire response is generated, reducing perceived latency.
- The motivation for development was to eliminate server costs and make an English learning service sustainable. Six months ago, RTX 5090 was required for real-time processing, but now it's possible with M3 Pro.
- The developer particularly emphasized that Gemma 4 E2B supports multilingualism, allowing users to freely mix their native language and the language they are learning in conversation, which is especially useful for language learning.
Evidence
- "During offline environment testing, a bug was discovered where the page would stop at 'loading...' when localhost was first opened with the internet disconnected. One user reported that it works normally if the page is loaded once while connected to the internet and then the connection is disconnected. I was impressed with the fast performance, including video input, on an M4 Pro 48GB.\n\nA developer who is creating a similar project shared that Gemma 4 E2B is still too heavy despite being E2B, and they are using the Qwen 0.8B model instead. This demonstrates the trade-off between model size and real-time responsiveness is a real barrier.\n\nMultiple comments agreed that the latency of Kokoro TTS is very low, and one developer commented, 'Apple should have used this in Siri,' criticizing Apple for falling behind in technology.\n\nA comment shared information that only the text portion of Gemma 4 E2B can be fine-tuned. They shared their experience of fine-tuning it into an 'AI that talks like a pirate' along with a related video link, and also noted that the TTS portion cannot be fine-tuned.\n\nSome users reported that the voice recognition speed of Gemma E2B does not reach real-time on hardware such as M1 Max (64GB), RTX 5060 Ti (16GB), and Snapdragon 8 Gen 2, and asked for solutions. This suggests that performance may not be guaranteed in environments other than M3 Pro."
How to Apply
- If you need a hands-free workshop assistant or a voice AI while driving long distances, you can launch Parlor as a local server and open a browser to use a voice assistant for timer, calculation, memo search, etc. without the internet and without push-to-talk.
- If you are operating an English learning service or a multilingual conversation app and cloud API costs are burdensome, you can eliminate server costs by referring to the structure of Parlor (FastAPI + WebSocket + Gemma 4 E2B + Kokoro) and converting it to an on-device pipeline.
- If you want to change the response style of the Gemma 4 E2B model to a specific domain, you can apply text fine-tuning to the E2B model to learn the desired tone or response pattern. Keep in mind that fine-tuning does not apply to the TTS portion, so it only applies to the text generation stage.
- If you feel that Gemma 4 E2B is too heavy for low-spec environments (M1 or lower, 16GB or lower GPU, etc.), as mentioned in the comments, it is a good idea to first try replacing it with a smaller model like Qwen 0.8B and measure the real-time latency to confirm the trade-off.
Code Example
# Architecture flow (excerpt from README)
Browser (mic + camera)
│
│ WebSocket (audio PCM + JPEG frames)
▼
FastAPI server
├── Gemma 4 E2B via LiteRT-LM (GPU) → understands speech + vision
└── Kokoro TTS (MLX on Mac, ONNX on Linux) → speaks back
│
│ WebSocket (streamed audio chunks)
▼
Browser (playback + transcript)
# Installation and execution (based on README)
git clone https://github.com/fikrikarim/parlor
cd parlor
cp .env.example .env
# Modify necessary settings in .env
pip install -r requirements.txt
uvicorn src.main:app --reloadTerminology
Related Papers
Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s
Apple Silicon에서 Swift로 직접 행렬 곱셈 커널을 구현하며 CPU, SIMD, AMX, GPU(Metal)를 단계별로 최적화해 Gflop/s에서 Tflop/s 수준까지 성능을 높이는 과정을 상세히 설명한 글이다. 프레임워크 없이 LLM 학습의 핵심 연산을 밑바닥부터 구현하고 싶은 개발자에게 Apple Silicon의 성능 한계를 체감할 수 있는 드문 자료다.
Removing fsync from our local storage engine
FractalBits가 fsync 없이 SSD 전용 KV 스토리지 엔진을 구현해 동일 조건 대비 약 65% 높은 쓰기 성능을 달성한 설계 방법을 공유했다. fsync의 메타데이터 오버헤드를 피하기 위해 사전 할당, O_DIRECT, SSD 원자 쓰기 단위 정렬 저널을 조합한 구조가 핵심이다.
Google Chrome silently installs a 4 GB AI model on your device without consent
Google Chrome이 사용자 동의 없이 Gemini Nano 4GB 모델 파일을 자동 다운로드하고, 삭제해도 재다운로드되는 문제가 발견됐다. GDPR 위반 가능성과 수십억 대 기기에 적용될 때의 환경 비용 문제가 제기되고 있다.
How OpenAI delivers low-latency voice AI at scale
OpenAI redesigned its WebRTC stack to serve real-time voice AI to over 900 million users, detailing the design decisions and trade-offs of a relay + transceiver split architecture.
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Deterministic Leaf Enumeration (DLE) cuts self-consistency’s redundant sampling by deterministically exploring a tree of possible sequences, simultaneously improving math/code reasoning performance and speed.