DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence
TL;DR Highlight
How DeepSeek-Coder (1.3B-33B), an open-source code LLM, surpassed GPT-3.5 and matched CodeLlama-34B with a 6.7B model.
Who Should Read
Developers building code autocomplete or code generation features, or building tools like Copilot. AI engineers looking for the best-performing open-source code model.
Core Mechanics
- Open-source code model series with 1.3B-33B parameters. Trained on 2 trillion tokens in 87 programming languages. Allows both research and commercial use.
- Training data structured at repository level, not file level — files reordered by topological sort of import/dependency, reflecting actual project structure. This significantly improves cross-file code completion performance.
- FIM (Fill-In-the-Middle) training applied at 50% ratio for enhanced code autocomplete. Using 100% improves FIM but degrades general code generation — trade-off discovered.
- Context length extended to 16K tokens — handles long files and multi-file scenarios. Implemented with RoPE scaling factor adjustment.
- DeepSeek-Coder-Instruct 33B surpasses GPT-3.5-Turbo on HumanEval. Only open-source model to exceed GPT-3.5-Turbo on LeetCode contest problems.
- DeepSeek-Coder-v1.5 continues pre-training from a general LLM (DeepSeek-LLM-7B) — maintains coding performance while significantly improving math reasoning and natural language understanding.
Evidence
- HumanEval multilingual average: DeepSeek-Coder-Base 33B 50.3% vs CodeLlama-Base 34B 41.0% (9%p difference). 6.7B model at 44.7% beats CodeLlama 34B with 5x fewer parameters.
- LeetCode Contest 180 problems: DeepSeek-Coder-Instruct 33B 27.8% vs GPT-3.5-Turbo 23.3%. CodeLlama-Instruct 34B at only 9.4%.
- FIM code completion (Single-Line Infilling average): DeepSeek-Coder-Base 7B 80.7% vs CodeLlama-Base 13B 75.5% — smaller model beats larger model.
- DeepSeek-Coder-v1.5 math reasoning: GSM8K 62.4% (up from 43.2% in existing 6.7B, +19%p), MATH 24.7% (up from 19.2%, +5.5%p)
How to Apply
- For code autocomplete tools, recommend DeepSeek-Coder-Base 6.7B — provide before and after context in FIM format to fill in middle code. Token format: `<|fim_start|>prefix_code<|fim_hole|>suffix_code<|fim_end|>`
- For LLM solving complex coding problems, CoT (Chain-of-Thought) prompts are essential — add instructions like 'first write a step-by-step explanation then write the code' to improve LeetCode hard problem performance
- For multi-file projects requiring context, use BM25 to search for relevant files and build a cross-file context within 512 tokens — dramatically improves cross-file completion accuracy (Python EM: 9.53% → 16.14%)
Code Example
# DeepSeek-Coder FIM (Fill-In-the-Middle) usage example
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-6.7b-base")
# Fill in the middle code using FIM format
prefix = "def bubble_sort(arr):\n n = len(arr)\n for i in range(n):\n"
suffix = "\n return arr"
# PSM mode: <fim_start>prefix<fim_hole>suffix<fim_end>
input_text = f"<|fim_start|>{prefix}<|fim_hole|>{suffix}<|fim_end|>"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))
# LeetCode-style CoT prompt example
problem = """두 정수 배열 nums1, nums2에서 교집합을 구하라.
Please complete the code below to solve the above problem:
```python
class Solution:
def intersection(self, nums1, nums2):
```"""
# Add CoT
cot_prompt = problem + "\nYou need first to write a step-by-step outline and then write the code."
print(cot_prompt)Terminology
Related Resources
Original Abstract (Expand)
The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.