AI Coding Agents Fail at Teamwork

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.
It seems a simple proposition. If AI agents can write code on their own, shouldn’t two models be able to collaborate to perform even better? If we are to realize the promised future where AI agents work together – and with human collaborators – AI will have to be good collaborators, but so far they appear to be lone wolves. AI’s collaborative abilities were a topic Stanford researchers explored recently in a new study called “CooperBench.”
“It’s the curse of coordination,” said Hao Zhu, a postdoctoral scholar at Stanford and first author of the preprint study that was recently presented during an April ICLR workshop. “A single model is better than two agents sharing the work.”
“When collaborating, AI’s performance actually drops – sharply,” noted senior author Diyi Yang, assistant professor of computer science. “Today’s best coding agents lose nearly half their capability when paired up to share work. It shows that social intelligence – not coding skill – is the key bottleneck for AI collaboration.”
Critical Skills
Collaboration is a critical skill for human software teams. Humans regularly divide responsibilities, communicate progress, work in complementary ways, and verify their teammates’ work. These are skills AI presently lacks.
“As good as they are with language, models do not use it for social action and therefore don’t have the coordination abilities needed to behave reliably in a collaborative arrangement,” Zhu said. “They are trained not to use language in a social manner. That’s a problem.”
As an experiment, the team created a battery of more than 650 real-world software engineering tasks that required two agents to collaborate using one of four coding languages – Python, TypeScript, Go, and Rust. The tasks were chosen specifically for their potential for conflict, exactly the kind of strategic overlap that makes real collaboration so important, and so difficult. Each agent had the ability to edit code, to run local commands, and, importantly, to message their collaborating agent in real time.
The two codes were then merged and evaluated. The AI collaborators did not fare well. The authors call it the “coordination gap,” and it is made worse by the fact that the shortfall came in the midrange of technical difficulty, in a not-too-easy-not-too-hard sweet spot where two agents were thought to have the greatest opportunity for success.
Talk Is Cheap
Going in, the researchers anticipated that giving the AI agents the ability to communicate with each other might improve the odds of success, but they found that it had almost no impact on results. They lay the blame at AI’s confusion negotiating spatial and semantic coordination – distinguishing where in the code to make edits from what edits are needed.
The researchers were able to observe the agents’ communications in real time. One verbatim exchange exemplifies the AI’s challenge:
Agent A: “WAIT Agent B! If you add the section header AND my guid type to your branch, that WILL create a merge conflict!”
Agent B: “I’ll add the COMPLETE section (lines 72-81) to my branch, which includes both the section header, your guid type, AND my hash_sha256 type.”
In this exchange, Agent B disregards Agent A’s warning and overwrites Agent A’s code. It acknowledges Agent A’s concerns, but proceeds anyway, eventually shipping an incompatible design. Human collaborators are unlikely to make such a move on purely social grounds – it would be detrimental to the trust of the relationship to ignore Agent A’s warning and an outright insult to overwrite its code.
Zhu was surprised by exchanges like this. He thought that if models were able to “speak English,” closer coordination would follow, but he found the opposite. Instead, the agents’ language fluency often masked failures rather than resolved them.
The researchers also witnessed other social breakdowns like frequent sharing of repetitive, low-value status updates, leaving direct questions unanswered, and failure to follow through on promised tasks.
Letters of Recommendation
While AI does not collaborate well today, the researchers believe that it is a solvable problem. But it won’t come through better prompting. AI will have to be trained to collaborate, just as people are tasked in school to collaborate less as a way to learn course content as to learn the art of successful collaboration. This requires a different kind of social intelligence that AI does not yet have.
The researchers recommend establishing AI training objectives that reward coordination to teach AI to model successful partnerships, not just produce good code. Developers might also include new mechanisms for verifying AI agents have made good on their commitments and create contract-like agreements complete with signatures. Additionally, they could implement better periodic checks on how well the code is being integrated. Last, communications channels might be strengthened through AI screen sharing and other techniques to improve clarity and certify results.
“In CooperBench we learned that while AI agents talk like humans, they still have a lot to learn about how language works in a social context,” Zhu concluded.
This paper was partially funded by the Stanford Institute for Human-Centered AI.





