Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications
TL;DR Highlight
A survey covering everything from LLM code generation limitations to fine-tuning techniques, evaluation metrics, and real-world applications.
Who Should Read
Backend/full-stack developers introducing AI coding tools (GitHub Copilot, CodeLlama, etc.) to production, or ML engineers building code generation LLM fine-tuning or evaluation pipelines.
Core Mechanics
- 4 major problems in LLM code generation: enormous compute resources (Llama 3.1-405B required 31M GPU hours), syntax/semantic errors (semantic errors 50%+), bias (38.92% of GPT-4 generated code had gender bias), security vulnerabilities (~40% of Copilot generated code has security issues)
- Resource constraint solution: 7B/13B models sometimes outperform 70B by 5-15% within the same compute budget
- 3 fine-tuning techniques: (1) Domain-specific dataset tuning — LoRA shows 25.4% EM@10 improvement over full fine-tuning, (2) Execution feedback RL (RLEF) — Llama 3.1 70B solves 37.5% of competitive programming, (3) Chain-of-Thought — 130%+ improvement on small models
- AceCoder prompting: retrieve similar code examples and include in prompt for Pass@1 improvement of MBPP 56.4%, MBJP 70.7%, MBJSP 88.4%
- Evaluation metric matters: BLEU is unsuitable for code. Code-specific metrics recommended: pass@k (actual execution), CodeBLEU (syntax+semantics+data flow), ICE-Score (LLM-evaluated)
- Current benchmarks: HumanEval (164 Python), ClassEval (class-level 100), SWE-bench (2294 real GitHub issues — Claude 2 only solves 1.96%), BigCodeBench (1140 practical library use)
Evidence
- RLEF with Llama 3.1 70B: CodeContests competitive programming solve rate 37.5% (vs AlphaCodium SOTA 29%)
- LoRA fine-tuning: CoNaLa dataset EM@10 +25.4%, CodeBLEU +22.8% vs ICL
- Data pruning: using only 1% of data still improved HumanEval 4.1%, nearly matching full data training
- ClarifyGPT framework: GPT-4 avg performance 68.02% to 75.75%, ChatGPT 58.55% to 67.22%
How to Apply
- For security-critical code generation: RCI (Recursive Criticism and Improvement) prompting is most effective for GPT-3.5/GPT-4. Avoid 'persona' style prompts — they generate the most security vulnerabilities.
- Improving small model code generation quality: apply CodePLAN pattern — have model generate a 'solution plan' first via CoT, then generate code. 130%+ pass@1 improvement without needing large models.
- Building RAG-based code retrieval: inject GitHub repository context and use multi-stream ensemble for re-ranking — achieves Success@10 78.2%.
Code Example
# AceCoder style prompt: Improve code generation quality by including similar examples
system_prompt = """
You are an expert programmer. I will provide you with:
1. Similar code examples for reference
2. The programming task requirements
3. Expected test cases
Generate correct, efficient code based on the examples and requirements.
"""
def build_acecoder_prompt(task_description, similar_examples, test_cases):
prompt = f"""
## Similar Code Examples (for reference):
{similar_examples}
## Task Requirements:
{task_description}
## Expected Test Cases:
{test_cases}
## Your Code:
"""
return prompt
# Chain-of-Thought (CodePLAN style) prompt
def build_cot_code_prompt(task_description):
prompt = f"""
Task: {task_description}
Step 1 - Solution Plan:
First, let me think through the approach:
- What inputs do I need to handle?
- What are the edge cases?
- What algorithm/data structure should I use?
- What are the step-by-step logical steps?
Step 2 - Implementation:
Now let me write the code based on the plan above:
"""
return prompt
# RCI (Recursive Criticism and Improvement) security hardening prompt
def build_rci_security_prompt(initial_code, security_context):
prompt = f"""
Review the following code for security vulnerabilities:
```
{initial_code}
```
Critique:
- Check for SQL injection, buffer overflow, path traversal (CWE-22), integer overflow (CWE-190)
- Identify any unsafe input handling
Improved secure version:
"""
return promptTerminology
Related Resources
Original Abstract (Expand)
Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate executable code. We begin with understanding LLMs' limitations and challenges in automated code generation. Subsequently, we review various fine-tuning techniques designed to enhance both the performance and adaptability of LLMs in code generation tasks. We then review the existing metrics and benchmarks for evaluations to assess model performance based on fine-tuning techniques. Finally, we explore the applications of LLMs (e.g. CodeLlama, GitHub Copilot, ToolGen) in code generation tasks to illustrate their roles and functionalities. This survey provides a comprehensive overview of LLMs for code generation, helps researchers in diverse fields better understand the current state-of-the-art technologies, and offers the potential of effectively leveraging LLMs for code generation tasks.