Usage of Large Language Model for Code Generation Tasks: A Review

Jan 1, 2025•S. Bistarelli, Marco Fiore, Ivan Mercanti +1•View PDF

TL;DR Highlight

A survey paper summarizing the current state and limitations of LLM-based code generation.

Who Should Read

Developers and researchers who want a comprehensive overview of where LLM code generation stands today and what challenges remain.

Core Mechanics

Covers the full landscape of LLM-based code generation: models, benchmarks, prompting strategies, and evaluation metrics
Identifies key limitations: poor performance on complex multi-file tasks, limited understanding of project-level context, and hallucinated APIs
Benchmarks like HumanEval are saturated and no longer discriminate between top models; harder benchmarks are needed
Security and correctness of generated code remain open problems — models frequently produce vulnerable or subtly wrong code
Highlights the gap between benchmark performance and real-world usability

Evidence

Comprehensive review of dozens of models and benchmarks published through 2024
HumanEval pass@1 rates for top models now exceed 90%, indicating benchmark saturation
Multiple studies show high rates of security vulnerabilities in LLM-generated code

How to Apply

Use this survey as a reference when deciding which code generation model or prompting strategy to adopt for your use case.
Don't rely solely on HumanEval scores when comparing models — look at harder benchmarks like SWE-bench or LiveCodeBench.
Always run static analysis and security scanners on LLM-generated code before merging to production.

Code Example

snippet

# Example of improving code generation quality with few-shot prompts (OpenAI API)
import openai

client = openai.OpenAI()

prompt = """
Please write a function based on the examples below.

Example 1:
# Returns the sum of two numbers
def add(a, b):
    return a + b

Example 2:
# Returns the product of two numbers
def multiply(a, b):
    return a * b

Now implement the following:
# A function that returns the difference between the maximum and minimum values in a list
"""

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are an expert Python developer. Write clear and efficient code."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.2  # Low temperature is recommended for code generation
)

print(response.choices[0].message.content)

Terminology

pass@kA code generation metric: the probability that at least one of k generated solutions passes all tests. pass@1 is the most commonly reported.

HumanEvalOpenAI's code generation benchmark with 164 function-writing problems; now considered too easy for frontier models.

SWE-benchA harder benchmark requiring models to resolve real GitHub issues in large codebases, testing project-level code understanding.

Hallucinated APIWhen an LLM generates code that calls a function or method that doesn't exist in the actual library.