daVinci-Env: Open SWE Environment Synthesis at Scale

Mar 13, 2026•Dayuan Fu, Shenyu Wu, Yunze Wu +11•View PDF

TL;DR Highlight

An open-source pipeline that auto-generates 45,320 Docker environments for SWE agent training, enabling a Qwen2.5-72B-based model to top SWE-bench.

Who Should Read

Researchers training SWE (software engineering) agents and teams looking to build reproducible training pipelines for code agent fine-tuning.

Core Mechanics

Identified the bottleneck in SWE agent training: lack of diverse, reproducible execution environments to run and verify code changes
Built an open-source pipeline that automatically generates 45,320 Docker environments for SWE training tasks
Each environment includes the necessary dependencies, repo setup, and test suite to verify agent-generated code
Using this training infrastructure, a Qwen2.5-72B-based model achieves top performance on SWE-bench
The pipeline is fully reproducible and released as open source
Scales to diverse repositories and programming languages automatically

Evidence

45,320 Docker environments successfully generated and validated
Qwen2.5-72B model trained on this data achieves top SWE-bench leaderboard performance
Environment generation pipeline runs fully automated without manual intervention
Open-source release enables reproducibility across research groups

How to Apply

Use the released pipeline to generate execution environments for any GitHub repository you want to include in SWE training
The Docker environments handle dependency management automatically — just point it at a repo and commit range
Fine-tune your base model on the generated environment/task pairs for SWE agent training

Code Example

snippet

# Example of basic Dockerfile structure generated by the OpenSWE Dockerfile agent
# (using openswe-python-{version} base image)

FROM openswe-python-3.12

# Inject repository via COPY without network access (avoids GitHub rate limits)
COPY repo /testbed
WORKDIR /testbed/

# Install packages after activating conda environment (bash -lc required)
ENV DEBIAN_FRONTEND=noninteractive
RUN bash -lc 'pip install -r requirements.txt'
RUN bash -lc 'pip install -e .'  # Development mode install (ensures reference to local code)
RUN bash -lc 'pip install pytest'

# Example evaluation script (exit code capture required)
# #!/bin/bash
# . /opt/conda/etc/profile.d/conda.sh
# conda activate testbed
# cd /testbed
# git apply -v --allow-empty - <<'EOF_114329324912'
# [CONTENT OF TEST PATCH]
# EOF_114329324912
# echo ">>>>> Start Test Output"
# pytest --no-header -rA --tb=no -p no:cacheprovider tests/test_target.py
# rc=$?
# echo ">>>>> End Test Output"
# echo "OPENSWE_EXIT_CODE=$rc"

Terminology

SWE AgentAn AI agent that solves software engineering tasks — writing code, fixing bugs, implementing features based on GitHub issue descriptions.

SWE-benchA benchmark where models must solve real GitHub issues by writing code that passes existing test suites.

Docker EnvironmentA containerized, reproducible execution environment ensuring code runs the same way everywhere.

Execution EnvironmentThe runtime context (dependencies, repo state, tests) needed to run and verify code changes.

Related Resources

Original Abstract (Expand)

Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With $891K spent on environment construction and an additional $576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.