CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Mar 25, 2022•Erik Nijkamp, Bo Pang, Hiroaki Hayashi +5•View PDF

TL;DR Highlight

Salesforce's open-source code-generation LLM up to 16.1B — breaking down complex problems into multi-step instructions boosts code quality by 10%+ points.

Who Should Read

Backend/ML engineers developing AI code assistants or LLM-based code generation pipelines. Developers wanting to fine-tune open-source code models or optimize prompt strategies.

Core Mechanics

Releases CODEGEN models in 4 sizes (350M–16.1B parameters) as open source, including JAXFORMER training library and checkpoints
Sequential training on natural language (ThePile) → multi-language code (BigQuery) → Python-only (BigPython) improves Python code quality at each step
CODEGEN-MONO 16.1B achieves 29.28% pass@1 on HumanEval, equal to or better than Codex 12B (28.81%) — even the 6.1B model at 26.13% is close to Codex 12B
Breaking complex problems into multiple steps (Multi-Turn) raises pass rate by ~10-22%p over single-turn instructions — consistently observed across all model sizes
Releases MTPB (Multi-Turn Programming Benchmark) — 115 problems decomposed into 3+ turns, the first multi-turn code generation benchmark
Lower prompt perplexity (how well the model understands the instruction) correlates with higher code generation success; multi-turn decomposition consistently lowers perplexity vs single-turn

Evidence

HumanEval pass@1: CODEGEN-MONO 16.1B 29.28% vs Codex 12B 28.81% — competitive at similar scale
MTPB pass rate: multi-turn 47.34% vs single-turn 38.74% (16.1B) — ~8.6%p improvement; 350M model shows 16.98% vs 5.75% — a whopping 11.23%p difference
MTPB avg prompt perplexity: multi-turn 8.05 vs single-turn 10.25 (16.1B) — confirms higher model comprehension with multi-turn decomposition
MBPP benchmark pass@1: CODEGEN-MONO 16.1B 35.28%, 6.1B 32.48% — far surpassing INCODER 6B (21.30%)

How to Apply

For complex feature requests, decompose into sub-tasks and request sequentially instead of a single prompt — e.g., instead of 'implement linear regression', do '① import numpy ② create data matrix ③ calculate weights ④ output predictions' in order
Load CODEGEN checkpoints from Hugging Face (e.g., Salesforce/codegen-16B-mono) as a base for domain-specific fine-tuning — expect significantly higher Python task performance than generic GPT-J
When building a code assistant chatbot, accumulate (concatenate) previous turn prompts + generated code as context for the next turn — leverages multi-turn capability and handles complex problems much better than simple single-response mode

Code Example

snippet

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load CODEGEN model (Hugging Face)
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")

# Multi-turn approach: accumulate previous turn results as context
context = "# Import libraries.\nimport numpy as np\n"

# Turn 1
prompt_1 = "# Define a list named 'nums' with values [3, 1, 4, 1, 5, 9, 2, 6].\n"
inputs = tokenizer(context + prompt_1, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.2, do_sample=True)
generated_1 = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Turn 1 output:\n", generated_1)

# Turn 2: include previous turn result in context
prompt_2 = "# Sort 'nums' in descending order and print the result.\n"
context_2 = generated_1 + "\n"
inputs_2 = tokenizer(context_2 + prompt_2, return_tensors="pt")
outputs_2 = model.generate(**inputs_2, max_new_tokens=50, temperature=0.2, do_sample=True)
generated_2 = tokenizer.decode(outputs_2[0], skip_special_tokens=True)
print("Turn 2 output:\n", generated_2)

Terminology

pass@kThe probability that at least 1 of k generated code attempts is correct. pass@1 is hitting it on the first try, pass@10 is getting at least one right in 10 attempts.

HumanEvalOpenAI's code generation benchmark. 164 Python function problems where performance is measured by whether generated code passes tests.

perplexityA measure of how 'confused' a model is by the input — lower is better comprehension. Lower prompt perplexity correlates with better code generation.

Multi-TurnA conversational style where a complex task is broken into multiple sequential exchanges, each building on the previous response.

FIM (Fill-In-the-Middle)Training where the model learns to fill in the middle of code given the prefix and suffix — key for IDE autocomplete features.

Related Resources

Original Abstract (Expand)

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.