c-CRAB: AI 코드 리뷰 에이전트 벤치마크

Code Review Agent Benchmark

Mar 24, 2026•Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf +3•View PDF

TL;DR Highlight

텍스트 유사도 대신 실행 가능한 테스트로 코드 리뷰 에이전트를 평가 — Claude Code 32.1%, 4개 도구 합산 41.5%로 인간(100%)과 큰 격차 확인

Who Should Read

AI 기반 코드 리뷰 도구를 평가하거나 구축하는 개발자, 코드 품질 자동화 파이프라인을 설계하는 엔지니어

Core Mechanics

현재 SOTA 코드 리뷰 에이전트(Claude Code 32.1%, Devin 24.8%, PR-Agent 23.1%, Codex 20.1%)는 인간 리뷰(100%)와 큰 격차 — 4개 합쳐도 41.5%
자동화 도구는 Robustness·Testing은 잘 찾지만 Maintainability(7.9~27%), Design, Documentation은 심각하게 부족 — 리포지토리별 컨벤션을 모름
Claude Code는 Robustness에서 75% 통과율(최고 성능)이지만 댓글 수가 PR당 7.3개로 압도적으로 많아 개발자 부담 증가
자동화 도구와 인간 리뷰어는 서로 다른 관점을 다룸 → 대체가 아닌 보완 관계
AGENTS.md 등 리포지토리 특화 컨텍스트 문서화가 자동화 도구 성능 향상의 핵심 방향

Evidence

SWE-CARE에서 671개 PR → 4단계 필터링(리뷰 필터링·Docker 환경·NL→테스트·에이전트 검증) → 최종 184 PR, 234 테스트, 67개 리포지토리
실행 기반 평가: 인간 리뷰 코멘트를 fail-then-pass 테스트로 변환 — BLEU/ROUGE/임베딩 유사도가 동일 이슈를 다른 표현으로 0점 처리하는 한계 극복

How to Apply

c-CRAB 데이터셋(github.com/c-CRAB-Benchmark)으로 자체 코드 리뷰 에이전트 성능 측정 가능
AGENTS.md에 리포지토리 코딩 컨벤션·아키텍처 규칙을 문서화하면 자동화 도구 성능 향상 기대
자동화 리뷰(Robustness·Testing 강점) + 인간 리뷰(Maintainability·Design 강점) 조합으로 보완적 워크플로우 구성

Terminology

c-CRAB(Code-CRAB)실행 기반 코드 리뷰 평가 벤치마크 — 인간 리뷰를 실행 테스트로 변환

fail-then-pass 테스트원본 코드에서 실패하고 수정 후 통과하는 테스트 — 리뷰 이슈를 오라클로 인코딩

LLM-as-a-JudgeLLM이 다른 LLM 출력을 평가하는 방식 — 편향·불안정성 문제로 c-CRAB에서 배제

Original Abstract (Expand)

Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.