Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Christopher Potts | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
peopleFaculty

Christopher Potts

Professor of Linguistics and, by courtesy, of Computer Science, Stanford University

External Bio
Latest Work
Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, Omar Khattab
Nov 14
Research
Your browser does not support the video tag.

Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel algorithm for optimizing LM programs. MIPRO outperforms baseline optimizers on five of seven diverse multi-stage LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 13% accuracy. We have released our new optimizers and benchmark in DSPy at [http://dspy.ai](http://dspy.ai).

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Christopher Potts, Jing Huang, Zhengxuan Wu, Mor Geva, Atticus Geiger
Aug 14
Research
Your browser does not support the video tag.

Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute–Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/ explanare/ravel.

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Zhengxuan Wu, Atticus Geiger, Jing Huang, Noah Goodman, Christopher Potts, Aryaman Arora, Zheng Wang
Jun 01
Research
Your browser does not support the video tag.

Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce pyvene, an open-source Python library that supports customizable interventions on a range of different PyTorch modules. pyvene supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how pyvene provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at ‘https://github.com/stanfordnlp/pyvene‘.

Share
Link copied to clipboard!

All Related

DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines
Omar Khattab, Matei Zaharia, Christopher Potts
Jan 16, 2024
Research
Your browser does not support the video tag.

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric, by creating and collecting demonstrations. We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled for relatively small LMs like 770M parameter T5 and Llama2-13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains. DSPy is available at https://github.com/stanfordnlp/dspy

DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines

Omar Khattab, Matei Zaharia, Christopher Potts
Jan 16, 2024

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded “prompt templates”, i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, or imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric, by creating and collecting demonstrations. We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. On top of that, DSPy programs compiled for relatively small LMs like 770M parameter T5 and Llama2-13b-chat are competitive with many approaches that rely on large and proprietary LMs like GPT-3.5 and on expert-written prompt chains. DSPy is available at https://github.com/stanfordnlp/dspy

Foundation Models
Natural Language Processing
Machine Learning
Your browser does not support the video tag.
Research
Evaluating Human and Machine Understanding of Data Visualizations
Arnav Verma, Kushin Mukherjee, Christopher Potts, Elisa Kreiss, Judith Fan
Jan 01, 2024
Research
Your browser does not support the video tag.

Although data visualizations are a relatively recent invention, most people are expected to know how to read them. How do current machine learning systems compare with people when performing tasks involving data visualizations? Prior work evaluating machine data visualization understanding has relied upon weak benchmarks that do not resemble the tests used to assess these abilities in humans. We evaluated several state-of-the-art algorithms on data visualization literacy assessments designed for humans, and compared their responses to multiple cohorts of human participants with varying levels of experience with high school-level math. We found that these models systematically underperform all human cohorts and are highly sensitive to small changes in how they are prompted. Among the models we tested, GPT-4V most closely approximates human error patterns, but gaps remain between all models and humans. Our findings highlight the need for stronger benchmarks for data visualization understanding to advance artificial systems towards human-like reasoning about data visualizations.

Evaluating Human and Machine Understanding of Data Visualizations

Arnav Verma, Kushin Mukherjee, Christopher Potts, Elisa Kreiss, Judith Fan
Jan 01, 2024

Although data visualizations are a relatively recent invention, most people are expected to know how to read them. How do current machine learning systems compare with people when performing tasks involving data visualizations? Prior work evaluating machine data visualization understanding has relied upon weak benchmarks that do not resemble the tests used to assess these abilities in humans. We evaluated several state-of-the-art algorithms on data visualization literacy assessments designed for humans, and compared their responses to multiple cohorts of human participants with varying levels of experience with high school-level math. We found that these models systematically underperform all human cohorts and are highly sensitive to small changes in how they are prompted. Among the models we tested, GPT-4V most closely approximates human error patterns, but gaps remain between all models and humans. Our findings highlight the need for stronger benchmarks for data visualization understanding to advance artificial systems towards human-like reasoning about data visualizations.

Foundation Models
Your browser does not support the video tag.
Research
A Moderate Proposal for Radically Better AI-powered Web Search
Christopher Potts
Jul 06, 2021
news

Large language models could give us instant answers, but at a cost to trust. Stanford scholars propose an alternative.

A Moderate Proposal for Radically Better AI-powered Web Search

Christopher Potts
Jul 06, 2021

Large language models could give us instant answers, but at a cost to trust. Stanford scholars propose an alternative.

Natural Language Processing
Machine Learning
news