Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Mar 3, 2025•Nam Huynh, Beiyu Lin•View PDF

TL;DR Highlight

A survey covering everything from LLM code generation limitations to fine-tuning techniques, evaluation metrics, and real-world applications.

Who Should Read

Backend/full-stack developers introducing AI coding tools (GitHub Copilot, CodeLlama, etc.) to production, or ML engineers building code generation LLM fine-tuning or evaluation pipelines.

Core Mechanics

4 major problems in LLM code generation: enormous compute resources (Llama 3.1-405B required 31M GPU hours), syntax/semantic errors (semantic errors 50%+), bias (38.92% of GPT-4 generated code had gender bias), security vulnerabilities (~40% of Copilot generated code has security issues)
Resource constraint solution: 7B/13B models sometimes outperform 70B by 5-15% within the same compute budget
3 fine-tuning techniques: (1) Domain-specific dataset tuning — LoRA shows 25.4% EM@10 improvement over full fine-tuning, (2) Execution feedback RL (RLEF) — Llama 3.1 70B solves 37.5% of competitive programming, (3) Chain-of-Thought — 130%+ improvement on small models
AceCoder prompting: retrieve similar code examples and include in prompt for Pass@1 improvement of MBPP 56.4%, MBJP 70.7%, MBJSP 88.4%
Evaluation metric matters: BLEU is unsuitable for code. Code-specific metrics recommended: pass@k (actual execution), CodeBLEU (syntax+semantics+data flow), ICE-Score (LLM-evaluated)
Current benchmarks: HumanEval (164 Python), ClassEval (class-level 100), SWE-bench (2294 real GitHub issues — Claude 2 only solves 1.96%), BigCodeBench (1140 practical library use)

Evidence

RLEF with Llama 3.1 70B: CodeContests competitive programming solve rate 37.5% (vs AlphaCodium SOTA 29%)
LoRA fine-tuning: CoNaLa dataset EM@10 +25.4%, CodeBLEU +22.8% vs ICL
Data pruning: using only 1% of data still improved HumanEval 4.1%, nearly matching full data training
ClarifyGPT framework: GPT-4 avg performance 68.02% to 75.75%, ChatGPT 58.55% to 67.22%

How to Apply

For security-critical code generation: RCI (Recursive Criticism and Improvement) prompting is most effective for GPT-3.5/GPT-4. Avoid 'persona' style prompts — they generate the most security vulnerabilities.
Improving small model code generation quality: apply CodePLAN pattern — have model generate a 'solution plan' first via CoT, then generate code. 130%+ pass@1 improvement without needing large models.
Building RAG-based code retrieval: inject GitHub repository context and use multi-stream ensemble for re-ranking — achieves Success@10 78.2%.

Code Example

snippet

# AceCoder style prompt: Improve code generation quality by including similar examples

system_prompt = """
You are an expert programmer. I will provide you with:
1. Similar code examples for reference
2. The programming task requirements
3. Expected test cases

Generate correct, efficient code based on the examples and requirements.
"""

def build_acecoder_prompt(task_description, similar_examples, test_cases):
    prompt = f"""
## Similar Code Examples (for reference):
{similar_examples}

## Task Requirements:
{task_description}

## Expected Test Cases:
{test_cases}

## Your Code:
"""
    return prompt

# Chain-of-Thought (CodePLAN style) prompt
def build_cot_code_prompt(task_description):
    prompt = f"""
Task: {task_description}

Step 1 - Solution Plan:
First, let me think through the approach:
- What inputs do I need to handle?
- What are the edge cases?
- What algorithm/data structure should I use?
- What are the step-by-step logical steps?

Step 2 - Implementation:
Now let me write the code based on the plan above:
"""
    return prompt

# RCI (Recursive Criticism and Improvement) security hardening prompt
def build_rci_security_prompt(initial_code, security_context):
    prompt = f"""
Review the following code for security vulnerabilities:

```
{initial_code}
```

Critique:
- Check for SQL injection, buffer overflow, path traversal (CWE-22), integer overflow (CWE-190)
- Identify any unsafe input handling

Improved secure version:
"""
    return prompt

Terminology

pass@kIf any one of k generated code samples passes tests, count as success. pass@1 is the probability of getting it right on first try.

CodeBLEUAn automated metric for code quality scoring comparing syntax structure (AST) and data flow.

RLEFReinforcement Learning from Execution Feedback. Uses actual code execution results as reward signal to improve the model.

LoRAA lightweight fine-tuning technique adding only 2 small matrices. Saves significant memory and time.

PEFTParameter-Efficient Fine-Tuning. Generic term for techniques updating only a fraction of parameters.

quantizationReduces model weight precision (e.g., 32bit to 4bit) to cut memory usage.

SWE-benchA code generation benchmark from real GitHub issues and PRs. Realistic difficulty — fixing actual open-source bugs.

Chain-of-ThoughtA prompting technique having LLMs write out thinking process step by step before giving an answer.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate executable code. We begin with understanding LLMs' limitations and challenges in automated code generation. Subsequently, we review various fine-tuning techniques designed to enhance both the performance and adaptability of LLMs in code generation tasks. We then review the existing metrics and benchmarks for evaluations to assess model performance based on fine-tuning techniques. Finally, we explore the applications of LLMs (e.g. CodeLlama, GitHub Copilot, ToolGen) in code generation tasks to illustrate their roles and functionalities. This survey provides a comprehensive overview of LLMs for code generation, helps researchers in diverse fields better understand the current state-of-the-art technologies, and offers the potential of effectively leveraging LLMs for code generation tasks.