daVinci-Env: Open SWE Environment Synthesis at Scale
TL;DR Highlight
An open-source pipeline that auto-generates 45,320 Docker environments for SWE agent training, enabling a Qwen2.5-72B-based model to top SWE-bench.
Who Should Read
Researchers training SWE (software engineering) agents and teams looking to build reproducible training pipelines for code agent fine-tuning.
Core Mechanics
- Identified the bottleneck in SWE agent training: lack of diverse, reproducible execution environments to run and verify code changes
- Built an open-source pipeline that automatically generates 45,320 Docker environments for SWE training tasks
- Each environment includes the necessary dependencies, repo setup, and test suite to verify agent-generated code
- Using this training infrastructure, a Qwen2.5-72B-based model achieves top performance on SWE-bench
- The pipeline is fully reproducible and released as open source
- Scales to diverse repositories and programming languages automatically
Evidence
- 45,320 Docker environments successfully generated and validated
- Qwen2.5-72B model trained on this data achieves top SWE-bench leaderboard performance
- Environment generation pipeline runs fully automated without manual intervention
- Open-source release enables reproducibility across research groups
How to Apply
- Use the released pipeline to generate execution environments for any GitHub repository you want to include in SWE training
- The Docker environments handle dependency management automatically — just point it at a repo and commit range
- Fine-tune your base model on the generated environment/task pairs for SWE agent training
Code Example
# Example of basic Dockerfile structure generated by the OpenSWE Dockerfile agent
# (using openswe-python-{version} base image)
FROM openswe-python-3.12
# Inject repository via COPY without network access (avoids GitHub rate limits)
COPY repo /testbed
WORKDIR /testbed/
# Install packages after activating conda environment (bash -lc required)
ENV DEBIAN_FRONTEND=noninteractive
RUN bash -lc 'pip install -r requirements.txt'
RUN bash -lc 'pip install -e .' # Development mode install (ensures reference to local code)
RUN bash -lc 'pip install pytest'
# Example evaluation script (exit code capture required)
# #!/bin/bash
# . /opt/conda/etc/profile.d/conda.sh
# conda activate testbed
# cd /testbed
# git apply -v --allow-empty - <<'EOF_114329324912'
# [CONTENT OF TEST PATCH]
# EOF_114329324912
# echo ">>>>> Start Test Output"
# pytest --no-header -rA --tb=no -p no:cacheprovider tests/test_target.py
# rc=$?
# echo ">>>>> End Test Output"
# echo "OPENSWE_EXIT_CODE=$rc"Terminology
Related Resources
Original Abstract (Expand)
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.