AI Coding Agents Fail at Teamwork | Stanford HAI
Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • AI Glossary
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

news

AI Coding Agents Fail at Teamwork

Date
June 01, 2026
Topics
Generative AI
Machine Learning
illustration of two people paddling in opposite directions

Two models working together perform worse than one alone, exposing a critical gap in artificial intelligence capabilities.

It seems a simple proposition. If AI agents can write code on their own, shouldn’t two models be able to collaborate to perform even better? If we are to realize the promised future where AI agents work together – and with human collaborators – AI will have to be good collaborators, but so far they appear to be lone wolves. AI’s collaborative abilities were a topic Stanford researchers explored recently in a new study called “CooperBench.” 

“It’s the curse of coordination,” said Hao Zhu, a postdoctoral scholar at Stanford and first author of the preprint study that was recently presented during an April ICLR workshop. “A single model is better than two agents sharing the work.”

“When collaborating, AI’s performance actually drops – sharply,” noted senior author Diyi Yang, assistant professor of computer science. “Today’s best coding agents lose nearly half their capability when paired up to share work. It shows that social intelligence – not coding skill – is the key bottleneck for AI collaboration.”

Critical Skills

Collaboration is a critical skill for human software teams. Humans regularly divide responsibilities, communicate progress, work in complementary ways, and verify their teammates’ work. These are skills AI presently lacks.

“As good as they are with language, models do not use it for social action and therefore don’t have the coordination abilities needed to behave reliably in a collaborative arrangement,” Zhu said. “They are trained not to use language in a social manner. That’s a problem.”

As an experiment, the team created a battery of more than 650 real-world software engineering tasks that required two agents to collaborate using one of four coding languages – Python, TypeScript, Go, and Rust. The tasks were chosen specifically for their potential for conflict, exactly the kind of strategic overlap that makes real collaboration so important, and so difficult. Each agent had the ability to edit code, to run local commands, and, importantly, to message their collaborating agent in real time. 

The two codes were then merged and evaluated. The AI collaborators did not fare well. The authors call it the “coordination gap,” and it is made worse by the fact that the shortfall came in the midrange of technical difficulty, in a not-too-easy-not-too-hard sweet spot where two agents were thought to have the greatest opportunity for success.

Talk Is Cheap

Going in, the researchers anticipated that giving the AI agents the ability to communicate with each other might improve the odds of success, but they found that it had almost no impact on results. They lay the blame at AI’s confusion negotiating spatial and semantic coordination – distinguishing where in the code to make edits from what edits are needed.

The researchers were able to observe the agents’ communications in real time. One verbatim exchange exemplifies the AI’s challenge: 

  • Agent A: “WAIT Agent B! If you add the section header AND my guid type to your branch, that WILL create a merge conflict!”  

  • Agent B: “I’ll add the COMPLETE section (lines 72-81) to my branch, which includes both the section header, your guid type, AND my hash_sha256 type.”

In this exchange, Agent B disregards Agent A’s warning and overwrites Agent A’s code. It acknowledges Agent A’s concerns, but proceeds anyway, eventually shipping an incompatible design. Human collaborators are unlikely to make such a move on purely social grounds – it would be detrimental to the trust of the relationship to ignore Agent A’s warning and an outright insult to overwrite its code.

Zhu was surprised by exchanges like this. He thought that if models were able to “speak English,” closer coordination would follow, but he found the opposite. Instead, the agents’ language fluency often masked failures rather than resolved them. 

The researchers also witnessed other social breakdowns like frequent sharing of repetitive, low-value status updates, leaving direct questions unanswered, and failure to follow through on promised tasks.

Letters of Recommendation

While AI does not collaborate well today, the researchers believe that it is a solvable problem. But it won’t come through better prompting. AI will have to be trained to collaborate, just as people are tasked in school to collaborate less as a way to learn course content as to learn the art of successful collaboration. This requires a different kind of social intelligence that AI does not yet have. 

The researchers recommend establishing AI training objectives that reward coordination to teach AI to model successful partnerships, not just produce good code. Developers might also include new mechanisms for verifying AI agents have made good on their commitments and create contract-like agreements complete with signatures. Additionally, they could implement better periodic checks on how well the code is being integrated. Last, communications channels might be strengthened through AI screen sharing and other techniques to improve clarity and certify results. 

“In CooperBench we learned that while AI agents talk like humans, they still have a lot to learn about how language works in a social context,” Zhu concluded.

This paper was partially funded by the Stanford Institute for Human-Centered AI.

Share
Link copied to clipboard!
Contributor(s)
Andrew Myers

Related News

Today's AI Talks Like “Nobody.” New Research Gives It Real Personality.
Jun 08, 2026
News
3D illustration of mirrored human profiles in blue and yellow layers

PsychAdapter lets researchers dial in on personality traits, age, and mental health characteristics to generate text that sounds like real individuals, opening the door to training simulations and personalized content.

News
3D illustration of mirrored human profiles in blue and yellow layers

Today's AI Talks Like “Nobody.” New Research Gives It Real Personality.

HealthcareGenerative AISciences (Social, Health, Biological, Physical)Jun 08

PsychAdapter lets researchers dial in on personality traits, age, and mental health characteristics to generate text that sounds like real individuals, opening the door to training simulations and personalized content.

Reading Today’s Headlines Through AI: A Real-Time Audit of Six Commercial Chatbots
Mirac Suzgun and James Zou
Jun 03, 2026
News

In a new study, scholars measured how accurately popular AI chatbots answered questions about the emerging news and found substantial regional disparity, dependence on distinct information ecosystems, and acute fragility under imperfect prompts.

News

Reading Today’s Headlines Through AI: A Real-Time Audit of Six Commercial Chatbots

Mirac Suzgun and James Zou
Communications, MediaGenerative AIJun 03

In a new study, scholars measured how accurately popular AI chatbots answered questions about the emerging news and found substantial regional disparity, dependence on distinct information ecosystems, and acute fragility under imperfect prompts.

How AI is Transforming Scientific Discovery While Keeping Humans at the Center
Shana Lynch
May 27, 2026
News

From designing new antibodies to simulating 1,000 years of climate in a day, AI is transforming what's possible—but humans remain the ones deciding what matters.

News

How AI is Transforming Scientific Discovery While Keeping Humans at the Center

Shana Lynch
Sciences (Social, Health, Biological, Physical)Generative AIMay 27

From designing new antibodies to simulating 1,000 years of climate in a day, AI is transforming what's possible—but humans remain the ones deciding what matters.