코드 생성을 위한 Large Language Models: 도전과제, 기법, 평가, 응용에 대한 종합 서베이

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Mar 3, 2025•Nam Huynh, Beiyu Lin•View PDF

TL;DR Highlight

LLM 기반 코드 생성의 한계부터 fine-tuning 기법, 평가 지표, 실제 적용 사례까지 한 번에 정리한 서베이 논문

Who Should Read

AI 코드 생성 도구(GitHub Copilot, CodeLlama 등)를 실무에 도입하려는 백엔드/풀스택 개발자, 또는 코드 생성 LLM을 직접 fine-tuning하거나 평가 파이프라인을 구축하려는 ML 엔지니어.

Core Mechanics

LLM 코드 생성의 4대 문제: 엄청난 컴퓨팅 리소스(Llama 3.1-405B는 학습에 GPU 3100만 시간), 문법/의미 오류(semantic 오류가 50% 이상), 편향(GPT-4 생성 코드의 38.92%에 성별 편향), 보안 취약점(Copilot 생성 코드의 약 40%가 보안 취약)
리소스 제약 해결책: 7B/13B 같은 작은 모델이 동일 컴퓨팅 예산 안에서 70B보다 5~15% 더 좋은 성능을 내는 경우도 있음. 무조건 큰 모델이 답이 아님
Fine-tuning 기법 3가지: (1) 도메인 특화 데이터셋 튜닝 - LoRA(적은 파라미터만 업데이트하는 기법)가 전체 fine-tuning 대비 EM@10 지표 25.4% 향상, (2) 실행 피드백 기반 강화학습(RLEF) - Llama 3.1 70B로 경쟁 프로그래밍 37.5% 해결(기존 최고 29% 대비 개선), (3) Chain-of-Thought 프롬프트 - 작은 모델 성능 130% 이상 향상
AceCoder 프롬프트 기법: 유사 코드 예제를 검색해서 프롬프트에 포함하는 방식으로 Pass@1 기준 MBPP 56.4%, MBJP 70.7%, MBJSP 88.4% 향상
평가 지표 선택이 중요: 텍스트 유사도 기반 BLEU는 코드 평가에 부적합. 실제 실행 여부를 보는 pass@k, 문법/의미/데이터흐름을 모두 보는 CodeBLEU, LLM이 직접 평가하는 ICE-Score 등 코드 전용 지표 사용 권장
현재 벤치마크 현황: HumanEval(164개 Python 문제), ClassEval(클래스 수준 100개), SWE-bench(실제 GitHub 이슈 2294개 - Claude 2도 1.96%만 해결), BigCodeBench(실용적 라이브러리 활용 1140개)

Evidence

RLEF 적용 Llama 3.1 70B: CodeContests 경쟁 프로그래밍 solve rate 37.5% (기존 SOTA AlphaCodium 29% 대비 향상)
LoRA fine-tuning: CoNaLa 데이터셋에서 ICL 대비 EM@10 25.4%, CodeBLEU 22.8% 향상 / CodeAlpacaPy에서 EM@10 150%, CodeBLEU 29.8% 향상
데이터 프루닝(불필요한 학습 데이터 제거): 전체 데이터의 1%만 써도 HumanEval 4.1% 성능 향상, 전체 데이터 학습과 거의 동등한 성능
ClarifyGPT 프레임워크: GPT-4 평균 성능 68.02% → 75.75%, ChatGPT 58.55% → 67.22%로 개선

How to Apply

보안이 중요한 코드 생성 상황이라면: GPT-3.5/GPT-4에는 RCI(Recursive Criticism and Improvement, 반복 비판·개선) 프롬프트 기법이 가장 효과적. 'persona' 스타일 프롬프트는 오히려 보안 취약점 생성이 가장 많으니 피할 것
작은 모델로 코드 생성 품질 올리기: Chain-of-Thought으로 '해결 계획(solution plan)'을 먼저 생성하게 한 후 코드를 생성하는 CodePLAN 패턴 적용. 큰 모델 없이 작은 모델로 pass@1 130% 이상 개선 가능
RAG 기반 코드 검색 시스템 구축: RepoRift처럼 GitHub 저장소 컨텍스트를 주입하고 multi-stream ensemble로 검색 결과를 재랭킹하면 Success@10 78.2% 달성 가능. 사용자 쿼리의 모호성과 어휘 불일치 문제를 컨텍스트 보강으로 해결하는 패턴 참고

Code Example

snippet

# AceCoder 스타일 프롬프트: 유사 예제를 포함해서 코드 생성 품질 올리기

system_prompt = """
You are an expert programmer. I will provide you with:
1. Similar code examples for reference
2. The programming task requirements
3. Expected test cases

Generate correct, efficient code based on the examples and requirements.
"""

def build_acecoder_prompt(task_description, similar_examples, test_cases):
    prompt = f"""
## Similar Code Examples (for reference):
{similar_examples}

## Task Requirements:
{task_description}

## Expected Test Cases:
{test_cases}

## Your Code:
"""
    return prompt

# Chain-of-Thought (CodePLAN 스타일) 프롬프트
def build_cot_code_prompt(task_description):
    prompt = f"""
Task: {task_description}

Step 1 - Solution Plan:
First, let me think through the approach:
- What inputs do I need to handle?
- What are the edge cases?
- What algorithm/data structure should I use?
- What are the step-by-step logical steps?

Step 2 - Implementation:
Now let me write the code based on the plan above:
"""
    return prompt

# RCI (Recursive Criticism and Improvement) 보안 강화 프롬프트
def build_rci_security_prompt(initial_code, security_context):
    prompt = f"""
Review the following code for security vulnerabilities:

```
{initial_code}
```

Critique:
- Check for SQL injection, buffer overflow, path traversal (CWE-22), integer overflow (CWE-190)
- Identify any unsafe input handling

Improved secure version:
"""
    return prompt

Terminology

pass@kLLM이 생성한 k개의 코드 샘플 중 하나라도 테스트를 통과하면 성공으로 보는 지표. pass@1은 첫 번째 시도에서 맞추는 확률.

CodeBLEU코드 품질을 자동으로 점수 매기는 지표. 단순 텍스트 유사도가 아니라 코드 문법 구조(AST)와 데이터 흐름까지 비교함.

RLEFReinforcement Learning from Execution Feedback의 약자. 코드를 실제로 실행해서 나온 결과(성공/실패)를 보상 신호로 삼아 모델을 개선하는 학습법.

LoRA모델 전체를 다시 학습하지 않고 작은 행렬 2개만 추가로 학습하는 경량 fine-tuning 기법. 메모리와 시간을 크게 아낄 수 있음.

PEFTParameter-Efficient Fine-Tuning의 약자. LoRA, Prompt Tuning 등 파라미터 일부만 업데이트해서 모델을 효율적으로 튜닝하는 기법들의 총칭.

quantization모델 가중치의 숫자 정밀도를 낮춰서(예: 32bit → 4bit) 메모리 사용량을 줄이는 기법. 정밀도를 낮출수록 메모리는 줄지만 품질도 약간 떨어짐.

SWE-bench실제 GitHub 이슈와 PR로 구성된 코드 생성 벤치마크. 단순 함수 작성이 아니라 실제 오픈소스 프로젝트의 버그를 고치는 현실적인 난이도.

Chain-of-ThoughtLLM에게 바로 답을 내지 말고 '생각하는 과정'을 단계적으로 적어가며 풀게 하는 프롬프트 기법. 복잡한 문제에서 정확도가 크게 올라감.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate executable code. We begin with understanding LLMs' limitations and challenges in automated code generation. Subsequently, we review various fine-tuning techniques designed to enhance both the performance and adaptability of LLMs in code generation tasks. We then review the existing metrics and benchmarks for evaluations to assess model performance based on fine-tuning techniques. Finally, we explore the applications of LLMs (e.g. CodeLlama, GitHub Copilot, ToolGen) in code generation tasks to illustrate their roles and functionalities. This survey provides a comprehensive overview of LLMs for code generation, helps researchers in diverse fields better understand the current state-of-the-art technologies, and offers the potential of effectively leveraging LLMs for code generation tasks.