Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News

Stanford HAI offers critical resources for faculty and students to continue groundbreaking research across the vast AI landscape.
Stanford HAI offers critical resources for faculty and students to continue groundbreaking research across the vast AI landscape.

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric, by creating and collecting demonstrations. We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled for relatively small LMs like 770M parameter T5 and Llama2-13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains. DSPy is available at https://github.com/stanfordnlp/dspy
The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric, by creating and collecting demonstrations. We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled for relatively small LMs like 770M parameter T5 and Llama2-13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains. DSPy is available at https://github.com/stanfordnlp/dspy

This brief underscores the safety risks inherent in custom fine-tuning of large language models.
This brief underscores the safety risks inherent in custom fine-tuning of large language models.


Small models get better, regulation moves to the states, and more.
Small models get better, regulation moves to the states, and more.

Although data visualizations are a relatively recent invention, most people are expected to know how to read them. How do current machine learning systems compare with people when performing tasks involving data visualizations? Prior work evaluating machine data visualization understanding has relied upon weak benchmarks that do not resemble the tests used to assess these abilities in humans. We evaluated several state-of-the-art algorithms on data visualization literacy assessments designed for humans, and compared their responses to multiple cohorts of human participants with varying levels of experience with high school-level math. We found that these models systematically underperform all human cohorts and are highly sensitive to small changes in how they are prompted. Among the models we tested, GPT-4V most closely approximates human error patterns, but gaps remain between all models and humans. Our findings highlight the need for stronger benchmarks for data visualization understanding to advance artificial systems towards human-like reasoning about data visualizations.
Although data visualizations are a relatively recent invention, most people are expected to know how to read them. How do current machine learning systems compare with people when performing tasks involving data visualizations? Prior work evaluating machine data visualization understanding has relied upon weak benchmarks that do not resemble the tests used to assess these abilities in humans. We evaluated several state-of-the-art algorithms on data visualization literacy assessments designed for humans, and compared their responses to multiple cohorts of human participants with varying levels of experience with high school-level math. We found that these models systematically underperform all human cohorts and are highly sensitive to small changes in how they are prompted. Among the models we tested, GPT-4V most closely approximates human error patterns, but gaps remain between all models and humans. Our findings highlight the need for stronger benchmarks for data visualization understanding to advance artificial systems towards human-like reasoning about data visualizations.

This brief highlights the benefits of open foundation models and calls for greater focus on their marginal risks.
This brief highlights the benefits of open foundation models and calls for greater focus on their marginal risks.
