CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
TL;DR Highlight
Salesforce's open-source code-generation LLM up to 16.1B — breaking down complex problems into multi-step instructions boosts code quality by 10%+ points.
Who Should Read
Backend/ML engineers developing AI code assistants or LLM-based code generation pipelines. Developers wanting to fine-tune open-source code models or optimize prompt strategies.
Core Mechanics
- Releases CODEGEN models in 4 sizes (350M–16.1B parameters) as open source, including JAXFORMER training library and checkpoints
- Sequential training on natural language (ThePile) → multi-language code (BigQuery) → Python-only (BigPython) improves Python code quality at each step
- CODEGEN-MONO 16.1B achieves 29.28% pass@1 on HumanEval, equal to or better than Codex 12B (28.81%) — even the 6.1B model at 26.13% is close to Codex 12B
- Breaking complex problems into multiple steps (Multi-Turn) raises pass rate by ~10-22%p over single-turn instructions — consistently observed across all model sizes
- Releases MTPB (Multi-Turn Programming Benchmark) — 115 problems decomposed into 3+ turns, the first multi-turn code generation benchmark
- Lower prompt perplexity (how well the model understands the instruction) correlates with higher code generation success; multi-turn decomposition consistently lowers perplexity vs single-turn
Evidence
- HumanEval pass@1: CODEGEN-MONO 16.1B 29.28% vs Codex 12B 28.81% — competitive at similar scale
- MTPB pass rate: multi-turn 47.34% vs single-turn 38.74% (16.1B) — ~8.6%p improvement; 350M model shows 16.98% vs 5.75% — a whopping 11.23%p difference
- MTPB avg prompt perplexity: multi-turn 8.05 vs single-turn 10.25 (16.1B) — confirms higher model comprehension with multi-turn decomposition
- MBPP benchmark pass@1: CODEGEN-MONO 16.1B 35.28%, 6.1B 32.48% — far surpassing INCODER 6B (21.30%)
How to Apply
- For complex feature requests, decompose into sub-tasks and request sequentially instead of a single prompt — e.g., instead of 'implement linear regression', do '① import numpy ② create data matrix ③ calculate weights ④ output predictions' in order
- Load CODEGEN checkpoints from Hugging Face (e.g., Salesforce/codegen-16B-mono) as a base for domain-specific fine-tuning — expect significantly higher Python task performance than generic GPT-J
- When building a code assistant chatbot, accumulate (concatenate) previous turn prompts + generated code as context for the next turn — leverages multi-turn capability and handles complex problems much better than simple single-response mode
Code Example
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load CODEGEN model (Hugging Face)
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
# Multi-turn approach: accumulate previous turn results as context
context = "# Import libraries.\nimport numpy as np\n"
# Turn 1
prompt_1 = "# Define a list named 'nums' with values [3, 1, 4, 1, 5, 9, 2, 6].\n"
inputs = tokenizer(context + prompt_1, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.2, do_sample=True)
generated_1 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Turn 1 output:\n", generated_1)
# Turn 2: include previous turn result in context
prompt_2 = "# Sort 'nums' in descending order and print the result.\n"
context_2 = generated_1 + "\n"
inputs_2 = tokenizer(context_2 + prompt_2, return_tensors="pt")
outputs_2 = model.generate(**inputs_2, max_new_tokens=50, temperature=0.2, do_sample=True)
generated_2 = tokenizer.decode(outputs_2[0], skip_special_tokens=True)
print("Turn 2 output:\n", generated_2)Terminology
Related Resources
Original Abstract (Expand)
Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.