Skip to main content Skip to secondary navigation
Page Content

TextGrad: AutoGrad for Text

Scholars develop a new framework that optimizes compound AI systems by backpropagating large language model feedback.

Image
Chart illustrating TextGrad backpropagating text feedback from the output of a language models to all the possible earlier components to optimize all kinds of variables in various systems.

TextGrad backpropagates text feedback from the output of a language model to all the possible earlier components to optimize variables in various systems.

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other sophisticated components. Complex components such as LLMs, tool users, and simulators are connected to solve many real-world problems. However, building and optimizing these systems have been fairly ad-hoc. How often have you had to go between ChatGPT and your code to solve a problem? 

What’s missing: a principled, easy-to-use, and flexible optimizer for complex AI systems. Deep learning faced a similar hurdle in its infancy: Neural networks were difficult to optimize until backpropagation and autograd came to save the day. 

Read the study TextGrad: Automatic "Differentiation" via Text

 

Inspired by this, we propose and design a system for automatic differentiation via text called TextGrad. TextGrad offers a PyTorch-like API designed to be fairly intuitive. This means that with few lines of code, you can automatically transform a “reason step by step” prompt you use to classify your data, in a more sophisticated one that is tailored to your specific application.

In particular, TextGrad does this by backpropagating text feedback from the output of a language models to all the possible earlier components to optimize all kinds of variables in various systems. Everything is text in TextGrad, this means that we use language models to 1) evaluate outputs, 2) criticize them and 3) update the inputs. 

Illustration of how classical neural network backpropagation works compred to TextGrad's process

In classical neural networks we use the back-propagation method to learn the weights that allow us to make good predictions (a). TextGrad (b) shows that it is possible to train more complex AI systems following a similar idea: backpropagating textual feedback. Backpropagated feedback is explicit, as the language model suggests that “this response can be improved by…”.

In a new preprint paper, we demonstrate the power of TextGrad in many different problems. We show how to achieve state-of-the-art performance in GPQA (PhD-level question answering) and LeetCode Hard (difficult programming problems). We tackle impactful scientific problems by optimizing molecules for drug discovery and improve patient outcomes by optimizing treatment plans. We find that TextGrad works in many domains out-of-the-box without modifying the framework.

TextGrad: The missing piece in LLM optimization 

Example of TextGrad abstractions

The analogy with Pytorch: Textgrad shares Pytorch abstractions to provide a flexible framework in a familiar setting.

Pytorch is the most popular framework to build complex neural networks; different reasons made it successful over the years, but one of them is the flexibility and “friendliness” of its syntax. TextGrad provides a Pytorch equivalent for optimizing text pipelines.  Let’s assume you have a classification task, how do you use GPT-4 to solve that? Well, you write a system prompt, run some data, check the results, update the prompt, and repeat. 

TextGrad provides an API that follows PyTorch syntax and allows users t optimize any prompt or results by using only textual feedback, provided by a (potentially different) language model. We can optimize the prompt to a language model using a handful of sample data. In addition to this, TextGrad also allows language models to self-refine their responses, with evaluations provided by any potentially black-box function such as language models themselves, or outputs of code interpreters.

In general, a sample code that optimizes a system prompt would look like this:

An example of TextGrad code

An example of TextGrad code to optimize a system prompt

Here is how we improved the prompt for ChatGPT (GPT-3.5-turbo) for one of the popular benchmarks in Language Modeling. TextGrad optimization allowed us to go from 78% to 92% accuracy with only a few iterations of optimization. If you want to reproduce this and do more tinkering with TextGrad, we have a notebook ready for you.

An example of prompt optimization

An example of prompt optimization with TextGrad

Test-time Optimization

Beyond optimizing a model’s prompt, we can also optimize its response. When this can be useful? An example of this could be question answering: we get a first reply from a model and then we want to refine that answer. An additional example is code generation: Let’s say we ask GPT-4 to solve a coding problem. Is the first solution GPT-4 came up with good? Does it have a good runtime?

To fix these issues, we use TextGrad to provide textual feedback and to perform multiple iterations of optimization. This allows us to improve the original code and solve bugs introduced in the first generation.

an example of code update and gradient.

An example of code update and gradient. We have some code at iteration t that contains a bug and the gradients provided by TextGrad suggest a way to fix the problem. The problem is fixed at iteration t+1

We tested TextGrad on a very hard benchmark based on LeetCode problems. This means that the problems for which we generate a solution have to be sent to the LeetCode platform for evaluation on unseen test cases. Our results suggest that this is an effective way of solving coding problems.

Exciting Scientific Applications

LLMs are trained on huge amounts of data, giving them a broad background of scientific knowledge. However, they sometimes fall short when dealing with highly specialized  problems. This is where the integration with external resources developed by domain experts, like computational tools or high-fidelity databases, becomes invaluable. With TextGrad, we can easily combine the broad knowledge base of LLMs with the specialized capabilities of scientific tools so that researchers can leverage the strengths of both. 

Drug Discovery: With TextGrad, we can optimize chemical structures for two key properties: druglikeness (how easily it will be absorbed by the body) and binding affinity (how tightly it will bind to a target protein). Druglikeness is measured using the QED score, which ranges from 0-1 with 1 the most druglike, and the binding affinity by the Vina score, where the more negative the better. Since binding affinity generally favors larger molecules while druglikeness favors smaller ones, optimizing both is an important but challenging task. 

We can describe a chemical structure using a textual description known as a SMILES string, and calculate the druglikeness and binding affinity using chemoinformatics tools. In short, the variable is a molecule, and the loss is that molecule’s druglikeness (QED score) and binding affinity (Vina score). 

An example of TextGrad for molecule optimization.

An example of TextGrad for molecule optimization. The QED and Vina scores measure druglikeness and binding affinity respectively, and are calculated using chemoinformatics software. Using this information, TextGrad provides gradients that suggest better chemical structures.

Two charts, one showing the druglikness and binding affinity distribution of molecules before and after 10 iterations of TextGrad optimization, and the other showing the optimization trajectory of 10 iterations of TextGrad

Left, druglikeness and binding affinity distribution of molecules before and after 10 iterations of TextGrad optimization, compared with clinically approved drugs for the same target proteins. Right: Example optimization trajectory showing 10 iterations of TextGrad, with the properties of clinically approved drugs for comparison.

After a couple of iterations, TextGrad produces molecules with a similar distribution of druglikeness and binding affinity to the molecules in the DrugBank, a database of clinically approved drugs. Although we should keep in mind that drugs are optimized for more properties than just druglikeness and binding affinity, we’re excited that TextGrad can help integrate LLMs with cheminformatics tools, opening up new avenues in drug discovery.

Radiotherapy Treatment Planning: We use TextGrad to optimize radiotherapy treatment plans which determine the necessary dose of radiotherapy and pinpoint the exact locations that need treatment. In particular, the goal for treatment planning is to deliver the prescribed dose to the planning target volume (PTV), which encompasses the tumor and an additional margin to account for uncertainties in planning or treatment delivery, while protecting critical normal tissues, known as organs at risk (OARs), from receiving unsafe doses. Human planners often use a trial-and-error approach, iteratively adjusting optimization hyperparameters based on the results of the optimization process until the plans meet clinical requirements. This makes the process inefficient, time-consuming, and costly. 

TextGrad optimizes the radiotherapy treatment plans by iteratively providing gradients to the planning system to balance the tradeoff between the PTV and OARs. The following example illustrates the gradients from TextGrad that were used to improve the planning system. 

text gradients showing how to improve radiotherapy treatment plan

Text gradients on how to improve radiotherapy treatment plan.

Across five patients, we have compared the performance of TextGrad to that of clinicians and found that TextGrad gets performance comparable to, sometimes even outperforms, clinicians. Even though we should keep in mind that clinicians optimize their plans for a very large set of criteria, and other patient information such as previous treatment, disease history, etc. – it is promising that TextGrad can optimize for these two important properties that can improve patient outcomes.

Looking Forward

As the paradigm of AI shifts from training individual models to optimizing compound systems involving multiple interacting LLM components and tools, we need a new generation of automated optimizers. TextGrad opens up fascinating opportunities for training large compound AI models. We are working on extending TextGrad’s capabilities to handle more complex agents such as LangChain. We are excited about releasing this package and cannot wait to see what other researchers and practitioners will build on top of it.

Additional resources:

Paper authors: Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more

More News Topics