DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

Jan 25, 2024•Daya Guo, Qihao Zhu, Dejian Yang +10•View PDF

TL;DR Highlight

How DeepSeek-Coder (1.3B-33B), an open-source code LLM, surpassed GPT-3.5 and matched CodeLlama-34B with a 6.7B model.

Who Should Read

Developers building code autocomplete or code generation features, or building tools like Copilot. AI engineers looking for the best-performing open-source code model.

Core Mechanics

Open-source code model series with 1.3B-33B parameters. Trained on 2 trillion tokens in 87 programming languages. Allows both research and commercial use.
Training data structured at repository level, not file level — files reordered by topological sort of import/dependency, reflecting actual project structure. This significantly improves cross-file code completion performance.
FIM (Fill-In-the-Middle) training applied at 50% ratio for enhanced code autocomplete. Using 100% improves FIM but degrades general code generation — trade-off discovered.
Context length extended to 16K tokens — handles long files and multi-file scenarios. Implemented with RoPE scaling factor adjustment.
DeepSeek-Coder-Instruct 33B surpasses GPT-3.5-Turbo on HumanEval. Only open-source model to exceed GPT-3.5-Turbo on LeetCode contest problems.
DeepSeek-Coder-v1.5 continues pre-training from a general LLM (DeepSeek-LLM-7B) — maintains coding performance while significantly improving math reasoning and natural language understanding.

Evidence

HumanEval multilingual average: DeepSeek-Coder-Base 33B 50.3% vs CodeLlama-Base 34B 41.0% (9%p difference). 6.7B model at 44.7% beats CodeLlama 34B with 5x fewer parameters.
LeetCode Contest 180 problems: DeepSeek-Coder-Instruct 33B 27.8% vs GPT-3.5-Turbo 23.3%. CodeLlama-Instruct 34B at only 9.4%.
FIM code completion (Single-Line Infilling average): DeepSeek-Coder-Base 7B 80.7% vs CodeLlama-Base 13B 75.5% — smaller model beats larger model.
DeepSeek-Coder-v1.5 math reasoning: GSM8K 62.4% (up from 43.2% in existing 6.7B, +19%p), MATH 24.7% (up from 19.2%, +5.5%p)

How to Apply

For code autocomplete tools, recommend DeepSeek-Coder-Base 6.7B — provide before and after context in FIM format to fill in middle code. Token format: `<|fim_start|>prefix_code<|fim_hole|>suffix_code<|fim_end|>`
For LLM solving complex coding problems, CoT (Chain-of-Thought) prompts are essential — add instructions like 'first write a step-by-step explanation then write the code' to improve LeetCode hard problem performance
For multi-file projects requiring context, use BM25 to search for relevant files and build a cross-file context within 512 tokens — dramatically improves cross-file completion accuracy (Python EM: 9.53% → 16.14%)

Code Example

snippet

# DeepSeek-Coder FIM (Fill-In-the-Middle) usage example
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base")

# Fill in the middle code using FIM format
prefix = "def bubble_sort(arr):\n    n = len(arr)\n    for i in range(n):\n"
suffix = "\n    return arr"

# PSM mode: <fim_start>prefix<fim_hole>suffix<fim_end>
input_text = f"<|fim_start|>{prefix}<|fim_hole|>{suffix}<|fim_end|>"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

# LeetCode-style CoT prompt example
problem = """두 정수 배열 nums1, nums2에서 교집합을 구하라.
Please complete the code below to solve the above problem:
```python
class Solution:
    def intersection(self, nums1, nums2):
```"""

# Add CoT
cot_prompt = problem + "\nYou need first to write a step-by-step outline and then write the code."
print(cot_prompt)

Terminology

FIM (Fill-In-the-Middle)Training where the model fills in the middle given code prefix and suffix. Core technology for IDE cursor-position autocomplete.

Repository-level Pre-trainingTraining on entire GitHub repositories rather than individual files. Teaches models to understand inter-file import relationships and write code across multiple files.

BPE (Byte Pair Encoding)A method to split text into tokens (pieces). Frequently occurring character combinations are merged into single tokens to build a vocabulary.

RoPE (Rotary Position Embedding)Technology telling transformer models 'this word is at position N.' Easily extendable context length via scaling factor adjustment.

Cross-file completionCode autocomplete functionality that understands code across multiple files in a project. Requires understanding import relationships and function definitions in other files.

Related Resources

Original Abstract (Expand)

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.