Last December, during a meandering walk near the Mississippi River in New Orleans after the 2022 NeurIPS Conference, Stanford associate professor of computer science and psychology Noah D. Goodman and PhD student Eric Zelikman stumbled upon an idea that could change how large language models (LLMs) solve tasks: They needed to try guiding LLMs to solve problems the way people do — by breaking down hard tasks into smaller pieces and solving them one at a time.
“I'm not sure I thought it actually would work,” recalled Goodman. But their idea did work and much better than they could have hoped. In a new paper, their team, with PhD students Qian Huang and Gabriel Poesia and co-led by Stanford Graduate School of Education assistant professor Nick Haber, showed that LLMs that implemented Parsel — a natural language processing framework they proposed that automatically solves and combines the solutions to many small problems to solve a large one — performed 75 percent better than baselines on competition-level coding problems.
The result came as a surprise to the team, given that before the walk in New Orleans, they designed Parsel as a tool to help students learn how to code.
Now, a tool for teaching could actually be used to significantly advance the capabilities of LLMs. Before the Parsel framework, complex code written by LLMs was prone to failure because a single mistake would cause the entire program to break. Leveraging Parsel means that LLMs can finally write successful multi-part code based on the same algorithmic reasoning style that human programmers use, and all that’s needed is natural language as input.
Parsing Into Parts
To use Parsel as a tool for education, a student would start by simply typing plain English to tell it what behaviors a new program must be able to do to accomplish a task. From those descriptions, Parsel then identifies which parts are related and need to be run together in a sequence, starting with the simplest tasks first. Finally, Parsel runs through different iterations of these coded parts, testing each of them until it lands on the version that satisfies everything the student requested.
In this way, Parsel does the heavy lifting in generating code with correct syntax and allows students to focus on the bigger picture. “What we struggle to teach kids in introductory computer science is this idea of algorithmic decomposition, and syntax often gets in the way of learning that core skill,” said Goodman.
But the researchers realized that LLMs have the opposite problem. While they can easily generate the syntax for a given programming language, they struggle to use algorithmic reasoning to build complex programs with many parts. It means that every line of code they generate is an opportunity to mess up. “Some piece is going to break,” said Haber.
To find out if this kind of reasoning would help the performance of LLMs on competitive coding tasks, the researchers prompted LLMs to first create a higher-level sketch with step-by-step instructions before diving into the problem. Then, the LLMs used the sketch to generate Parsel code — a natural language decomposition of the task into function descriptions and test cases — to run the task.
They soon found that not only were their LLMs doing better than all previous models on a variety of competition-level coding problems from the APPS (Automated Programming Progress Standard) dataset but they could also be used to successfully generate step-by-step movement plans for an embodied robot or even generate a mathematical proof.
“This sort of reasoning that we're forcing it to do is something quite domain general … we demonstrated interesting results around coding in particular, but I think there are a lot of directions,” said Haber.
A Wide-Open Future
Nothing quite like the Parsel framework had ever been attempted before, according to the scholars. “Up to the point that Parsel existed, I don't believe anyone thought it was currently possible to generate these kinds of programs from entirely natural language,” said Zelikman.
Moving forward, Goodman, Haber, Zelikman, and their colleagues are excited to continue working on Parsel as a tool for computer science education. “The education side is really exciting,” Zelikman emphasized. “We’re going to do more work developing that and seeing how it can be made more accessible to students.”
They also plan to continue testing Parsel to see how much it can help LLMs solve complex tasks that are more reflective of what programmers do in the real world. Haber noted that while it was exciting that they were able to show such dramatic improvements, they were limited by the datasets available and the difficulty of being the first ones to define a measure of success for such a pioneering new framework. Most prior work focused on coding problems that would normally be solved with a single function, which are not representative of real-world programming.
In the future, the team expects Parsel to evolve and expand beyond education and coding improvements for LLMs. “It certainly leads me to dream pretty wildly with where the next five to 10 years might take this,” said Haber. “You might imagine that these are things that can code with people, that can offload a lot of the dirty work in creating programs, and somehow free up people's ability to be thinking on a very different level when they're creating.”
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.