CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation | AI Paper Digest

TL;DR Highlight

A multi-agent framework that co-evolves plans and code, simultaneously achieving 11-20% higher accuracy and a 4-10 reduction in API calls compared to existing methods.

Who Should Read

AI engineers designing or improving LLM-based code generation pipelines. Specifically, developers looking to enhance the performance of debugging agents in complex programming problems.

Core Mechanics

The core problem of existing multi-agent code generation systems is that 'they keep fixing the code even if the plan is wrong.' CollabCoder introduces a CDM (Collaborative Decision-Making) module that dynamically decides whether to update the plan or the code at each iteration.
CDM performs three perspectives (plan analysis, code analysis, plan-code consistency analysis) simultaneously and makes consensus-based decisions based on the confidence weights of each analysis (wπ=0.4, wc=0.3, walign=0.3).
The RT (Reasoning Trajectory) module accumulates past debugging history to guide the next correction direction. Unlike existing methods that debug from scratch every time, it remembers failure patterns to reduce repetitive mistakes.
Code-specialized models (Seed-Coder-8B, Qwen2.5-Coder-32B) choose code-level modifications 2-3 times more often, while general-purpose models (GPT-4o mini) choose plan-level modifications much more frequently. Debugging strategies automatically vary depending on the model characteristics.
The benefits of CollabCoder are more pronounced in difficult problems at the competitive programming level. The difference is small in easy sections, but CollabCoder solves 7 problems in the difficult section (1600-1800) compared to MapCoder (3 problems) and CodeSIM (5 problems).
The same trend is maintained in the latest frontier models such as GPT-5.2 and Qwen3-Coder-Next (80B). The accuracy gap narrows, but consistently outperforms in terms of API calls and token usage.

Evidence

"Achieved 6.6-7.1%p higher Pass@1 than MapCoder and 4.7-5.3%p higher than CodeSIM on LiveCodeBench and xCodeEval, based on GPT-4o mini. Simultaneously reduced token consumption by 57% compared to MapCoder and 42% compared to CodeSIM.\nOn LiveCodeBench, with an inference budget of 10 API calls, CollabCoder achieved 33.93% vs MapCoder 30.36% vs CodeSIM 31.25%. At budget t=5, CollabCoder solved 44/90 problems, while Reflexion stagnated at 37/90 and Best-of-N at 33/90.\nOn basic benchmarks (HumanEval, MBPP), with Qwen2.5-Coder-32B as the base, CollabCoder averaged 82.50% vs CodeSIM 80.22% vs MapCoder 79.84%, with 4.12 API calls, less than half of MapCoder (9.05 calls).\nRemoving CDM lowered the average accuracy of Seed-Coder-8B by 4.24%p, and removing RT lowered it by 3.36%p. Both modules contribute independently to performance and achieve the best performance when used together."

How to Apply

If you have an existing debugging loop that repeatedly modifies only the code, add a step to determine 'is the problem with the plan vs. the implementation' with a separate LLM call at each iteration. You can simply mimic CollabCoder's CDM by requesting the three perspectives of plan analysis, code analysis, and consistency analysis in prompts and deciding by majority vote.
Add a history memory to your debugging agent. Summarizing 'what modifications were attempted and why they failed' at each iteration as text (Reasoning Trajectory) and including it in the next prompt can reduce the rate of repeating the same mistakes.
When using code-specialized models (e.g., Qwen2.5-Coder), set a higher weight for plan updates. According to the paper, these models tend to fix only the code even if the plan is wrong, so intentionally increasing wπ (e.g., to 0.5 or higher) can better induce plan-level modifications.

Code Example

snippet

Terminology

Pass@1The probability that the model will get the code right on the first attempt. Expressed as a percentage, it represents how many times out of 100 the correct code was produced on the first try.

multi-agent frameworkA system where multiple AI agents cooperate by dividing roles. Each of the planner, coder, and debugger takes on a specialized role and exchanges results with each other.

CDMAbbreviation for Collaborative Decision-Making. A module that sums up multiple analysis results to decide 'whether to change the plan or change the code.' Similar to a majority vote.

Reasoning TrajectoryA memory that accumulates debugging history to guide the next correction direction. It plays the same role as a human developer remembering 'Ah, this method didn't work last time.'

Plan-Code Co-EvolutionA method in which the plan and code evolve together by influencing each other. Previously, the plan was fixed once it was created, but now the plan can be revised based on code failure results.

Chain-of-ThoughtA prompting technique that involves having the model write out intermediate reasoning steps before providing the final answer. Similar to a person writing out the problem-solving process while solving a problem.

API callsThe number of requests sent to the LLM. More calls increase processing time and cost. CollabCoder's core efficiency metric is achieving the same performance with fewer API calls.

Related Papers

Related Resources

Original Abstract (Expand)

Automated code generation remains a persistent challenge in software engineering, as conventional multi-agent frameworks are often constrained by static planning, isolated execution, high computational overhead, and limited adaptability to complex tasks. This paper introduces CollabCoder, a novel Plan-Code Co-Evolution framework that improves code generation through dynamic multi-agent collaboration. The core idea is to design a collaborative decision-making process between the plan module and the code module to decide which module should be executed for the debugging process. Extensive experiments on widely used benchmarks demonstrate that CollabCoder consistently improves code quality and robustness across tasks. Importantly, CollabCoder achieves performance comparable to or exceeding current state-of-the-art methods while reducing computational overhead, with efficiency gains becoming more pronounced as benchmark difficulty increases. On the more challenging LiveCodeBench and xCodeEval benchmarks, our approach improves performance by 11-20% over strong baselines while reducing the number of API calls by an average of 4-10 per execution.