MobileKernelBench: LLM이 모바일 디바이스용 효율적인 Kernel을 작성할 수 있을까?

MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?

Mar 12, 2026•Xingze Zou, Jing Wang, Yuhua Zheng +8•View PDF

TL;DR Highlight

LLM이 모바일 추론 엔진(MNN)용 C++ 커널을 자동 생성할 수 있는지 벤치마크하고, 멀티 에이전트 시스템 MoKA로 컴파일 성공률 93.7%를 달성한 연구.

Who Should Read

온디바이스 AI 추론 최적화를 담당하는 모바일 ML 엔지니어, 또는 LLM 기반 코드 자동화 파이프라인을 구축하는 개발자.

Core Mechanics

GPT-5, Claude-Sonnet-4.5 같은 최신 LLM도 모바일 커널 생성에선 컴파일 실패율이 54% 이상 — MNN 프레임워크 특화 지식 부족이 주원인
LoRA 파인튜닝과 GRPO(강화학습) 모두 성능 개선이 미미 — 모바일 추론 프레임워크 학습 데이터 자체가 너무 적어서 파인튜닝으로는 한계
MoKA(멀티 에이전트 시스템)는 Coder + Debugger + Accelerator 세 역할을 분리해 plan-and-execute 루프로 반복 개선
Debugger는 tree-sitter로 컴파일 에러를 파싱하고 저장소 구조를 파악해 cross-file 의존성 오류까지 수정
Accelerator는 on-device 프로파일링 결과를 받아 SIMD 벡터화, 캐시 블로킹 등 하드웨어 최적화를 자동 제안 — LayerNorm2D에서 최대 6.82x 속도 향상 달성
190개 태스크, 95개 ONNX 연산자로 구성된 MobileKernelBench 공개 — PyTorch/ONNX 쌍 포맷으로 크로스 프레임워크 호환성 보장

Evidence

MoKA 컴파일 성공률(CSR) 93.7% — 베이스라인 Claude-Sonnet-4.5 단일 쿼리 46.3% 대비 +47.4%p
MoKA fast1.5(기본 MNN 대비 1.5배 이상 빠른 커널 비율) 27.4% — 베이스라인 4.7%, pass@10 5.3% 대비 압도적 차이
MoKA 기능 정확도(FCR) 75.3% — Claude pass@10의 47.9% 대비 +27.4%p, 단일 쿼리 34.2% 대비 +41.1%p
LayerNorm2D 케이스: 10번 반복 최적화로 최대 6.82x 속도 향상, 평균 2.82x 달성

How to Apply

모바일 추론 엔진(MNN, NCNN 등)용 커널 개발 시, Coder → Debugger → Accelerator 순서로 역할을 분리한 멀티 에이전트 루프를 구성하면 단순 반복 프롬프팅보다 훨씬 높은 품질의 코드를 얻을 수 있다.
컴파일 에러 디버깅 자동화가 필요한 경우, tree-sitter로 에러 위치를 파싱하고 저장소 트리를 제공하는 컨텍스트 주입 방식을 프롬프트에 적용하면 LLM의 API 환각을 크게 줄일 수 있다.
성능 최적화 에이전트를 만들 때, 매 이터레이션마다 '단 하나의 병목만 찾아 단 하나의 최적화만 제안'하도록 프롬프트를 제한하면 탐색 공간이 줄고 히스토리 기반 자기 반성이 효과적으로 작동한다.

Code Example

snippet

# MoKA Accelerator 프롬프트 예시 (논문 Appendix B.3 기반)

accelerator_prompt = """
You are an expert in model deployment, proficient in PyTorch and C++ programming,
and familiar with the coding style of the MNN framework.
Your task is to analyse the performance bottlenecks of the following MNN operator code
and propose optimisation methods to accelerate it.

Then identify **exactly one** highest-impact speed bottleneck,
propose **exactly one** optimisation method and propose a modification plan.

Operator information: {op_info}
Current implementation: {code_book}
Current performance: {performance}
History optimisation info: {history_optmz_info}

Requirements:
- Return **one and only one** optimisation method -- the largest expected speedup.
- Keep fields brief; avoid lists of alternatives, disclaimers, or generic advice.
- Avoid the totally same optimizations that have already been attempted.

Output format (JSON):
{{
  "bottleneck": "<max 100 words>",
  "optimisation_method": "<max 100 words>",
  "modification_plan": "<max 100 words>"
}}
"""

# Debugger 컴파일 에러 프롬프트 예시
debugger_prompt = """
Operator information: {op_info}
Current implementation: {code_book}
Compilation errors: {compile_error}

Analyze the errors and provide suggestions.
Note:
- Only provide semantic suggestions (in text).
- For cross-file errors, refer to the relevant code snippets and adjust only the current code.
- Do NOT suggest modifications to other MNN framework files.

Output format (JSON):
{{
  "local_error_suggestion": [],
  "crossfile_error_suggestion": []
}}
"""

Terminology

MNN알리바바가 만든 모바일용 딥러닝 추론 엔진. 스마트폰에서 AI 모델을 빠르게 실행하기 위한 경량 프레임워크.

ONNX서로 다른 딥러닝 프레임워크(PyTorch, TensorFlow 등) 간에 모델을 주고받기 위한 표준 파일 형식. USB 표준처럼 호환성을 위한 공통 규격.

KernelGPU/CPU에서 실제 연산을 수행하는 저수준 코드 조각. 행렬 곱셈, 컨볼루션 같은 연산을 하드웨어에 맞게 최적화한 함수.

GRPO강화학습 기반 학습법으로, 보상 신호를 통해 모델이 더 좋은 코드를 생성하도록 훈련시키는 방식. 정답을 직접 알려주지 않고 점수로 피드백.

CSRCompilation Success Rate. LLM이 생성한 코드가 컴파일 오류 없이 빌드되는 비율.

FCRFunctional Correctness Rate. 컴파일에 성공한 코드가 실제로 올바른 결과값을 내는 비율.

SIMDCPU에서 하나의 명령어로 여러 데이터를 동시에 처리하는 기술. 예: 8개 숫자를 한 번에 더하는 것. ARM NEON이 모바일용 SIMD.

Android NDK안드로이드 앱에서 C/C++ 코드를 사용할 수 있게 해주는 개발 도구 모음. 모바일에서 고성능 네이티브 코드를 빌드할 때 필요.

Related Resources

Original Abstract (Expand)

Large language models (LLMs) have demonstrated remarkable capabilities in code generation, yet their potential for generating kernels specifically for mobile de- vices remains largely unexplored. In this work, we extend the scope of automated kernel generation to the mobile domain to investigate the central question: Can LLMs write efficient kernels for mobile devices? To enable systematic investigation, we introduce MobileKernelBench, a comprehensive evaluation framework comprising a benchmark prioritizing operator diversity and cross-framework interoperability, coupled with an automated pipeline that bridges the host-device gap for on-device verification. Leveraging this framework, we conduct extensive evaluation on the CPU backend of Mobile Neural Network (MNN), revealing that current LLMs struggle with the engineering complexity and data scarcity inher-ent to mobile frameworks; standard models and even fine-tuned variants exhibit high compilation failure rates (over 54%) and negligible performance gains due to hallucinations and a lack of domain-specific grounding. To overcome these limitations, we propose the Mobile K ernel A gent (MoKA), a multi-agent system equipped with repository-aware reasoning and a plan-and-execute paradigm.Validated on MobileKernelBench, MoKA achieves state-of-the-art performance, boosting compilation success to 93.7% and enabling 27.4% of generated kernelsto deliver measurable speedups over native libraries.