TextGrad: AutoGrad for Text

Date

June 19, 2024

Topics

Scholars develop a new framework that optimizes compound AI systems by backpropagating large language model feedback.

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other sophisticated components. Complex components such as LLMs, tool users, and simulators are connected to solve many real-world problems. However, building and optimizing these systems have been fairly ad-hoc. How often have you had to go between ChatGPT and your code to solve a problem?

What’s missing: a principled, easy-to-use, and flexible optimizer for complex AI systems. Deep learning faced a similar hurdle in its infancy: Neural networks were difficult to optimize until backpropagation and autograd came to save the day.

Read the study TextGrad: Automatic "Differentiation" via Text

Inspired by this, we propose and design a system for automatic differentiation via text called TextGrad. TextGrad offers a PyTorch-like API designed to be fairly intuitive. This means that with few lines of code, you can automatically transform a “reason step by step” prompt you use to classify your data, in a more sophisticated one that is tailored to your specific application.

In particular, TextGrad does this by backpropagating text feedback from the output of a language models to all the possible earlier components to optimize all kinds of variables in various systems. Everything is text in TextGrad, this means that we use language models to 1) evaluate outputs, 2) criticize them and 3) update the inputs.

Illustration of how classical neural network backpropagation works compred to TextGrad's process

In classical neural networks we use the back-propagation method to learn the weights that allow us to make good predictions (a). TextGrad (b) shows that it is possible to train more complex AI systems following a similar idea: backpropagating textual feedback. Backpropagated feedback is explicit, as the language model suggests that “this response can be improved by…”.

In a new preprint paper, we demonstrate the power of TextGrad in many different problems. We show how to achieve state-of-the-art performance in GPQA (PhD-level question answering) and LeetCode Hard (difficult programming problems). We tackle impactful scientific problems by optimizing molecules for drug discovery and improve patient outcomes by optimizing treatment plans. We find that TextGrad works in many domains out-of-the-box without modifying the framework.

TextGrad: The missing piece in LLM optimization

Example of TextGrad abstractions

The analogy with Pytorch: Textgrad shares Pytorch abstractions to provide a flexible framework in a familiar setting.

Pytorch is the most popular framework to build complex neural networks; different reasons made it successful over the years, but one of them is the flexibility and “friendliness” of its syntax. TextGrad provides a Pytorch equivalent for optimizing text pipelines. Let’s assume you have a classification task, how do you use GPT-4 to solve that? Well, you write a system prompt, run some data, check the results, update the prompt, and repeat.

TextGrad provides an API that follows PyTorch syntax and allows users t optimize any prompt or results by using only textual feedback, provided by a (potentially different) language model. We can optimize the prompt to a language model using a handful of sample data. In addition to this, TextGrad also allows language models to self-refine their responses, with evaluations provided by any potentially black-box function such as language models themselves, or outputs of code interpreters.

In general, a sample code that optimizes a system prompt would look like this:

An example of TextGrad code

An example of TextGrad code to optimize a system prompt

Here is how we improved the prompt for ChatGPT (GPT-3.5-turbo) for one of the popular benchmarks in Language Modeling. TextGrad optimization allowed us to go from 78% to 92% accuracy with only a few iterations of optimization. If you want to reproduce this and do more tinkering with TextGrad, we have a notebook ready for you.

An example of prompt optimization

An example of prompt optimization with TextGrad

Test-time Optimization

Beyond optimizing a model’s prompt, we can also optimize its response. When this can be useful? An example of this could be question answering: we get a first reply from a model and then we want to refine that answer. An additional example is code generation: Let’s say we ask GPT-4 to solve a coding problem. Is the first solution GPT-4 came up with good? Does it have a good runtime?

To fix these issues, we use TextGrad to provide textual feedback and to perform multiple iterations of optimization. This allows us to improve the original code and solve bugs introduced in the first generation.

an example of code update and gradient.

An example of code update and gradient. We have some code at iteration t that contains a bug and the gradients provided by TextGrad suggest a way to fix the problem. The problem is fixed at iteration t+1

We tested TextGrad on a very hard benchmark based on LeetCode problems. This means that the problems for which we generate a solution have to be sent to the LeetCode platform for evaluation on unseen test cases. Our results suggest that this is an effective way of solving coding problems.

Exciting Scientific Applications

LLMs are trained on huge amounts of data, giving them a broad background of scientific knowledge. However, they sometimes fall short when dealing with highly specialized problems. This is where the integration with external resources developed by domain experts, like computational tools or high-fidelity databases, becomes invaluable. With TextGrad, we can easily combine the broad knowledge base of LLMs with the specialized capabilities of scientific tools so that researchers can leverage the strengths of both.

Drug Discovery: With TextGrad, we can optimize chemical structures for two key properties: druglikeness (how easily it will be absorbed by the body) and binding affinity (how tightly it will bind to a target protein). Druglikeness is measured using the QED score, which ranges from 0-1 with 1 the most druglike, and the binding affinity by the Vina score, where the more negative the better. Since binding affinity generally favors larger molecules while druglikeness favors smaller ones, optimizing both is an important but challenging task.

We can describe a chemical structure using a textual description known as a SMILES string, and calculate the druglikeness and binding affinity using chemoinformatics tools. In short, the variable is a molecule, and the loss is that molecule’s druglikeness (QED score) and binding affinity (Vina score).

An example of TextGrad for molecule optimization.

An example of TextGrad for molecule optimization. The QED and Vina scores measure druglikeness and binding affinity respectively, and are calculated using chemoinformatics software. Using this information, TextGrad provides gradients that suggest better chemical structures.

Two charts, one showing the druglikness and binding affinity distribution of molecules before and after 10 iterations of TextGrad optimization, and the other showing the optimization trajectory of 10 iterations of TextGrad

Left, druglikeness and binding affinity distribution of molecules before and after 10 iterations of TextGrad optimization, compared with clinically approved drugs for the same target proteins. Right: Example optimization trajectory showing 10 iterations of TextGrad, with the properties of clinically approved drugs for comparison.

After a couple of iterations, TextGrad produces molecules with a similar distribution of druglikeness and binding affinity to the molecules in the DrugBank, a database of clinically approved drugs. Although we should keep in mind that drugs are optimized for more properties than just druglikeness and binding affinity, we’re excited that TextGrad can help integrate LLMs with cheminformatics tools, opening up new avenues in drug discovery.

Radiotherapy Treatment Planning: We use TextGrad to optimize radiotherapy treatment plans which determine the necessary dose of radiotherapy and pinpoint the exact locations that need treatment. In particular, the goal for treatment planning is to deliver the prescribed dose to the planning target volume (PTV), which encompasses the tumor and an additional margin to account for uncertainties in planning or treatment delivery, while protecting critical normal tissues, known as organs at risk (OARs), from receiving unsafe doses. Human planners often use a trial-and-error approach, iteratively adjusting optimization hyperparameters based on the results of the optimization process until the plans meet clinical requirements. This makes the process inefficient, time-consuming, and costly.

TextGrad optimizes the radiotherapy treatment plans by iteratively providing gradients to the planning system to balance the tradeoff between the PTV and OARs. The following example illustrates the gradients from TextGrad that were used to improve the planning system.

text gradients showing how to improve radiotherapy treatment plan

Text gradients on how to improve radiotherapy treatment plan.

Across five patients, we have compared the performance of TextGrad to that of clinicians and found that TextGrad gets performance comparable to, sometimes even outperforms, clinicians. Even though we should keep in mind that clinicians optimize their plans for a very large set of criteria, and other patient information such as previous treatment, disease history, etc. – it is promising that TextGrad can optimize for these two important properties that can improve patient outcomes.

Looking Forward

As the paradigm of AI shifts from training individual models to optimizing compound systems involving multiple interacting LLM components and tools, we need a new generation of automated optimizers. TextGrad opens up fascinating opportunities for training large compound AI models. We are working on extending TextGrad’s capabilities to handle more complex agents such as LangChain. We are excited about releasing this package and cannot wait to see what other researchers and practitioners will build on top of it.

Additional resources:

Read the paper
Access these resources on Github
Check out these TextGrad tutorials

Paper authors: Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

Related News

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Mar 05, 2025

Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Media Mention

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Natural Language ProcessingMachine LearningGenerative AIFoundation ModelsMar 05

Carlos Guestrin to Lead Stanford AI Lab as it Joins Forces with Stanford HAI

Shana Lynch

Quick ReadFeb 20, 2025

News

The computer scientist will invest in SAIL’s vibrant research community as it builds the future of technical AI.

News

Carlos Guestrin to Lead Stanford AI Lab as it Joins Forces with Stanford HAI

Shana Lynch

Machine LearningQuick ReadFeb 20

The computer scientist will invest in SAIL’s vibrant research community as it builds the future of technical AI.

Why Corporate AI Projects Succeed or Fail

Dylan Walsh

Feb 18, 2025

News

Stanford researchers uncover the key factors behind successful AI development in the workplace.

News

Why Corporate AI Projects Succeed or Fail

Dylan Walsh

Economy, MarketsMachine LearningFeb 18

Stanford researchers uncover the key factors behind successful AI development in the workplace.

news

TextGrad: AutoGrad for Text

Date

June 19, 2024

Topics

Machine Learning

Scholars develop a new framework that optimizes compound AI systems by backpropagating large language model feedback.

Read the study TextGrad: Automatic "Differentiation" via Text

Illustration of how classical neural network backpropagation works compred to TextGrad's process

TextGrad: The missing piece in LLM optimization

Example of TextGrad abstractions

The analogy with Pytorch: Textgrad shares Pytorch abstractions to provide a flexible framework in a familiar setting.

In general, a sample code that optimizes a system prompt would look like this:

An example of TextGrad code

An example of TextGrad code to optimize a system prompt

An example of prompt optimization

An example of prompt optimization with TextGrad

Test-time Optimization

an example of code update and gradient.

Exciting Scientific Applications

An example of TextGrad for molecule optimization.

text gradients showing how to improve radiotherapy treatment plan

Text gradients on how to improve radiotherapy treatment plan.

Looking Forward

Additional resources:

Read the paper
Access these resources on Github
Check out these TextGrad tutorials

Paper authors: Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.