DALLE
The Conference and Workshop on Neural Information Processing Systems (NeurIPS) runs Nov. 28 to Dec. 9 this year. Some 2,906 papers were accepted this year. Stanford contributions include work on knowledge graphs, multimodal models, reinforcement learning, imitation learning, activation compression, incontext learning, and more. The schedule is not yet finalized, so more sessions and posters will be added (and if I missed your contribution, please let us know, below).
Here is the latest Stanford research you’ll see if you attend:
Keynote

Emmanuel Candès is the BarnumSimons Chair in Mathematics and Statistics at Stanford University, and Professor of Electrical Engineering (by courtesy). His research interests lie at the interface of statistics, information theory, signal processing and computational mathematics. He received his Ph.D. in statistics from Stanford University in 1998. Candès has received several awards including the Alan T. Waterman Award from NSF, which is the highest honor bestowed by NSF to earlycareer scientists, and the MacArthur Fellowship, popularly known as the ‘genius award’. He has given over 80 plenary lectures at major international conferences, not only in mathematics and statistics but in many other areas as well including biomedical imaging and solidstate physics. He was elected to the National Academy of Sciences and to the American Academy of Arts and Sciences in 2014. Learn more

Chelsea Finn (Stanford) · Fanny Yang · Hongseok Namkoong · Masashi Sugiyama · Jacob Eisenstein · Jonas Peters · Rebecca Roelofs · Shiori Sagawa (Stanford)· Pang Wei Koh (Stanford)· Yoonho Lee (Stanford)
This workshop brings together domain experts and ML researchers working on mitigating distribution shifts in realworld applications. Distribution shifts—where a model is deployed on a data distribution different from what it was trained on—pose significant robustness challenges in realworld ML applications. Such shifts are often unavoidable in the wild and have been shown to substantially degrade model performance in applications such as biomedicine, wildlife conservation, sustainable development, robotics, education, and criminal justice. For example, models can systematically fail when tested on patients from different hospitals or people from different demographics. This workshop aims to convene a diverse set of domain experts and methodsoriented researchers working on distribution shifts. We are broadly interested in methods, evaluations and benchmarks, and theory for distribution shifts, and we are especially interested in work on distribution shifts that arise naturally in realworld application contexts. Learn more

Yuxi Li · Emma Brunskill (Stanford)· MINMIN CHEN · Omer Gottesman · Lihong Li · Yao Liu · Zhiwei Tony Qin · Matthew Taylor
Discover how to improve the adoption of RL in practice, by discussing key research problems, SOTA, and success stories/insights/lessons with regards to practical RL algorithms, practical issues, and applications with leading experts from both academia and industry. Learn more

Pan Lu · Swaroop Mishra · Sean Welleck · Yuhuai Wu · Hannaneh Hajishirzi · Percy Liang (Stanford)
Mathematical reasoning is a unique aspect of human intelligence and a fundamental building block for scientific and intellectual pursuits. However, learning mathematics is often a challenging human endeavor that relies on expert instructors to create, teach and evaluate mathematical material. From an educational perspective, AI systems that aid in this process offer increased inclusion and accessibility, efficiency, and understanding of mathematics. Moreover, building systems capable of understanding, creating, and using mathematics offers a unique setting for studying reasoning in AI. This workshop will investigate the intersection of mathematics education and AI. Learn more

Elizabeth Wood · Adji Bousso Dieng · Aleksandrina Goeva · Alex X Lu · Anshul Kundaje (Stanford) · Chang Liu · Debora Marks · Ed Boyden · Eli N Weinstein · Lorin Crawford · Mor Nitzan · Romain Lopez · Tamara Broderick · Ray Jones · Wouter Boomsma · Yixin Wang

Ryan Gardner · Gino Perrotta · Corey Lowman · Casey Richardson · Andrew Newman · Jared Markowitz · Nathan Drenkow · Bart Paulhamus · Ashley J Llorens · Todd Neller · Raman Arora · Bo Li · Mykel J Kochenderfer (Stanford)
Reconnaissance Blind Chess (RBC) is like chess except a player cannot see her opponent's pieces in general. Rather, each player chooses a 3x3 square of the board to privately observe each turn. Stateoftheart algorithms, including those used to create agents for previous games like chess, Go, and poker, break down in Reconnaissance Blind Chess for several reasons including the imperfect information, absence of obvious abstractions, and lack of common knowledge. Build the best bot for this challenge in making strong decisions in competitive multiagent scenarios in the face of uncertainty! Learn more

Weihua Hu (Stanford) · Matthias Fey · Hongyu Ren (Stanford)· Maho Nakata · Yuxiao Dong · Jure Leskovec (Stanford)
Enabling effective and efficient machine learning (ML) over largescale graph data (e.g., graphs with billions of edges) can have a huge impact on both industrial and scientific applications. At KDD Cup 2021, we organized the OGB LargeScale Challenge (OGBLSC), where we provided large and realistic graph ML tasks. Our KDD Cup attracted huge attention from graph ML community (more than 500 team registrations across the globe), facilitating innovative methods being developed to yield significant performance breakthrough. However, the problem of machine learning over large graphs is not solved yet and it is important for the community to engage in a focused multiyear effort in this area (like ImageNet and MSCOCO). Here we propose an annual ML challenge around largescale graph datasets, which will drive forward method development and allow for tracking progress. We propose the 2nd OGBLSC (referred to as OGBLSC 2022) around the OGBLSC datasets. Our proposed challenge consists of three tracks, covering core graph ML tasks of nodelevel prediction (academic paper classification with 240 million nodes), linklevel prediction (knowledge graph completion with 90 million entities), and graphlevel prediction (molecular property prediction with 4 million graphs). Importantly, we have updated two out of the three datasets based on the lessons learned from our KDD Cup, so that the resulting datasets are more challenging and realistic. Our datasets are extensively validated through our baseline analyses and last year’s KDD Cup. We also provide the baseline code as well as Python package to easily load the datasets and evaluate the model performance. Learn more

Qian Huang · Hongyu Ren · Jure Leskovec (All Stanford)
Fewshot knowledge graph (KG) completion task aims to perform inductive reasoning over the KG: given only a few support triplets of a new relation ⋈ (e.g., (chop, ⋈, kitchen), (read, ⋈, library)) predict the query triplets of the same unseen relation ⋈, e.g., (sleep, ⋈, ?), with the existing knowledge in KG. Current approaches cast the problem in a metalearning framework, where the model needs to be first jointly trained over many training fewshot tasks, each being defined by its own relation, so that learning/prediction on the target fewshot task can be effective. However, in realworld KGs, curating many training tasks is a challenging ad hoc process. Here we propose Connection Subgraph Reasoner (CSR), which can make predictions for the target fewshot task directly without the need for pretraining on the human curated set of training tasks. The key to CSR is that we explicitly model a shared connection subgraph between support and query triplets, as inspired by the principle of eliminative induction. To adapt to specific KG, we design a corresponding selfsupervised pretraining scheme with the objective of reconstructing automatically sampled connection subgraphs. Our pretrained model can then be directly applied to target fewshot tasks without the need for training fewshot tasks. Extensive experiments on real KGs, including NELL, FB15K237, and ConceptNet, demonstrate the effectiveness of our framework: we show that even a learningfree implementation of CSR can already perform competitively to existing methods on target fewshot tasks; with pretraining, CSR can achieve significant gains of up to 56% on the more challenging inductive fewshot tasks where the entities are also unseen during (pre)training. Learn more

Xiang Li (Stanford) · John Thickstun · Ishaan Gulrajani · Percy Liang (Stanford)· Tatsunori Hashimoto (Stanford)
Controlling the behavior of language models (LMs) without retraining is a major open problem in natural language generation. While recent works have demonstrated successes on controlling simple sentence attributes (e.g., sentiment), there has been little progress on complex, finegrained controls (e.g., syntactic structure). To address this challenge, we develop a new nonautoregressive language model based on continuous diffusions that we call DiffusionLM. Building upon the recent successes of diffusion models in continuous domains, DiffusionLM iteratively denoises a sequence of Gaussian vectors into word vectors, yielding a sequence of intermediate latent variables. To control its generation, we iteratively perform gradient updates on these intermediate variables. DiffusionLM has three properties that enable complex, finegrained controllable text generation: the continuous nature of diffusion models enables gradientbased control; the nonautoregressive generation order enables more complex, global controls; and incremental denoising induces a coarsetofine hierarchy, which facilitates control at multiple granularities. We demonstrate successful control of DiffusionLM for six challenging finegrained control tasks, significantly outperforming prior work. Learn more

Rylan Schaeffer · Mikail Khona · Ila Rani Fiete
Research in Neuroscience, as in many scientific disciplines, is undergoing a renaissance based on deep learning. Unique to Neuroscience, deep learning models can be used not only as a tool but interpreted as models of the brain. The central claims of recent deep learningbased models of brain circuits are that they make novel predictions about neural phenomena or shed light on the fundamental functions being optimized. We show, through the casestudy of grid cells in the entorhinalhippocampal circuit, that one may get neither. We begin by reviewing the principles of grid cell mechanism and function obtained from firstprinciples modeling efforts, then rigorously examine the claims of deep learning models of grid cells. Using largescale hyperparameter sweeps and theorydriven experimentation, we demonstrate that the results of such models may be more strongly driven by particular, nonfundamental, and posthoc implementation choices than fundamental truths about neural circuits or the loss function(s) they might optimize. We discuss why these models cannot be expected to produce accurate models of the brain without the addition of substantial amounts of inductive bias, an informal No Free Lunch result for Neuroscience. Based on first principles work, we provide hypotheses for what additional loss functions will produce grid cells more robustly. In conclusion, caution and consideration, together with biological knowledge, are warranted in building and interpreting deep learning models in Neuroscience. Learn more

Michihiro Yasunaga (Stanford)· Antoine Bosselut · Hongyu Ren (Stanford)· Xikun Zhang (Stanford)· Christopher D Manning (Stanford)· Percy Liang (Stanford)· Jure Leskovec (Stanford)
Pretraining a language model (LM) on text helps various downstream NLP tasks. Recent works show that a knowledge graph (KG) can complement text data, offering structured background knowledge and scaffold useful for reasoning. However, these works are not pretrained to learn deep fusion of the two modalities at scale, limiting the potential to acquire fully joint representations of text and KG. Here we propose DRAGON (Deep Bidirectional LanguageKnowledge Pretraining), a selfsupervised approach to pretraining a deeply joint languageknowledge model from raw text and KG at scale. Specifically, our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities. We pretrain this model by unifying two selfsupervised reasoning objectives, masked language modeling and KG link prediction. DRAGON outperforms existing LMs and LM+KG models on diverse downstream tasks including question answering across general and biomedical domains, with +5\% absolute gain on average across the board. In particular, DRAGON achieves notable performance on complex reasoning about language and knowledge (+10\% on questions involving long context or multistep reasoning) and lowresource QA (+8\% on OBQA and RiddleSense), and new stateoftheart results on various BioNLP tasks. Learn more

Divyansh Garg · Skanda Vaidyanath (Stanford)· Kuno Kim (Stanford)· Jiaming Song (Stanford)· Stefano Ermon (Stanford)
Learning policies that effectually utilize language instructions in complex, multitask environments is an important problem in imitation learning. While it is possible to condition on the entire language instruction directly, such an approach could suffer from generalization issues. To encode complex instructions into skills that can generalize to unseen instructions, we propose Learning Interpretable Skill Abstractions (LISA), a hierarchical imitation learning framework that can learn diverse, interpretable skills from languageconditioned demonstrations. LISA uses vector quantization to learn discrete skill codes that are highly correlated with language instructions and the behavior of the learned policy. In navigation and robotic manipulation environments, LISA is able to outperform a strong nonhierarchical baseline in the low data regime and compose learned skills to solve tasks containing unseen longrange instructions. Our method demonstrates a more natural way to condition on language in sequential decisionmaking problems and achieve interpretable and controllable behavior with the learned skills. Learn more

Yiding Jiang · Evan Liu (Stanford) · Benjamin Eysenbach · J. Zico Kolter · Chelsea Finn (Stanford)
Identifying statistical regularities in solutions to some tasks in multitask reinforcement learning can accelerate learning new tasks. Skill learning offers one way of extracting these regularities by decomposing precollected experience into a sequence of skills. A popular approach to skill learning is maximizing the likelihood of the precollected experience with latent variable models. However, there are often many different solutions that maximize the likelihood equally well, including degenerate solutions. To address this underspecification, we propose a new objective that combines the maximum likelihood objective with a penalty on the description length of the skills. This penalty incentivizes the skills to maximally identify and extract common structure from the experiences. We demonstrate the effectiveness of our method on a multitask benchmark from prior work. We demonstrate the effectiveness of our method on a multitask benchmark from prior work. Further, while most prior works in the offline multitask setting focus on lowdimensional tasks, we demonstrate that our method can scale to challenging tasks with image observations. Additionally, the acquired skills can be used to solve downstream tasks with up to 8x fewer samples, as compared with skills acquired through maximizing likelihood. Learn more

Yongchan Kwon · James Zou (Stanford)
Shapley value is a popular approach for measuring the influence of individual features. While Shapley feature attribution is built upon desiderata from game theory, some of its constraints may be less natural in certain machine learning settings, leading to unintuitive model interpretation. In particular, the Shapley value uses the same weight for all marginal contributionsi.e. it gives the same importance when a large number of other features are given versus when a small number of other features are given. This property can be problematic if larger feature sets are more or less informative than smaller feature sets. Our work performs a rigorous analysis of the potential limitations of Shapley feature attribution. We identify simple settings where the Shapley value is mathematically suboptimal by assigning larger attributions for less influential features. Motivated by this observation, we propose WeightedSHAP, which generalizes the Shapley value and learns which marginal contributions to focus directly from data. On several realworld datasets, we demonstrate that the influential features identified by WeightedSHAP are better able to recapitulate the model's predictions compared to the features identified by the Shapley value. Learn more

Victor Weixin Liang (Stanford)· Yuhui Zhang (Stanford)· Yongchan Kwon · Serena Yeung (Stanford)· James Zou (Stanford)
We present modality gap, an intriguing geometric phenomenon of the representation space of multimodal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multimodal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multimodal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zeroshot classification performance and fairness. Learn more

Shivam Garg · Dimitris Tsipras · Gregory Valiant · Percy Liang (All Stanford)
Incontext learning is the ability of a model to condition on a prompt sequence consisting of incontext examples (inputoutput pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, incontext learning happens only at inference time without any parameter updates to the model. While large language models such as GPT3 exhibit some ability to perform incontext learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To investigate this, we consider the problem of training a model to incontext learn a function class (e.g., linear functions): given data derived from some functions in the class, can we train a model (e.g., a Transformer) to incontext learn most functions from that class? We show empirically that standard Transformers can be trained from scratch to perform incontext learning of linear functionsthat is, the trained model is able to learn unseen linear functions from incontext examples with performance comparable to the optimal least squares estimator. In fact, incontext learning is possible even under two forms of distribution shift: (i) between the training data of the Transformer and inferencetime prompts, and (ii) between the incontext examples and the query input during inference. We also show that we can train Transformers to incontext learn more complex function classes: sparse linear functions where the model outperforms least squares and nearly matches the performance of Lasso, and twolayer neural networks where the model performs comparably to neural networks trained on incontext examples using gradient descent. Learn more

Jue WANG · Binhang Yuan · Luka Rimanic · Yongjun He · Tri Dao (Stanford)· Beidi Chen (Stanford)· Christopher Ré (Stanford)· Ce Zhang
Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallelstyle training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AQSGD, a novel activation compression algorithm for communicationefficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AQSGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve O(1/T) convergence rate for nonconvex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with nonlinear activation functions. We then show that AQSGD can be optimized and implemented efficiently, without additional endtoend runtime overhead. We evaluated AQSGD to finetune language models with up to 1.5 billion parameters, compressing activations to 24 bits. AQSGD provides up to 4.3× endtoend speedup in slower networks, without sacrificing model quality. Moreover, we also show that AQSGD can be combined with stateoftheart gradient compression algorithms to enable endtoend communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision. This provides up to 4.9× endtoend speedup, without sacrificing model quality. Learn more

Binhang Yuan · Yongjun He · Tianyi Zhang (Stanford) · Jared Davis (Stanford) · Tri Dao (Stanford) · Beidi Chen (Stanford) · Percy Liang (Stanford)· Christopher Ré (Stanford)· Ce Zhang
Training foundation models, such as GPT3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lowerbandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. Stateoftheart schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational tasklets'' in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geodistributed devices simulated using realworld network measurements.In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8× faster than prior stateoftheart training systems (Megatron). Learn more

Mohamad Kazem Shirani Faradonbeh · Mohamad Sadegh Shirani Faradonbeh (Stanford) · Mohsen Bayati (Stanford)
Diffusion processes that evolve according to linear stochastic differential equations are an important family of continuoustime dynamic decisionmaking models. Optimal policies are wellstudied for them, under full certainty about the drift matrices. However, little is known about datadriven control of diffusion processes with uncertain drift matrices as conventional discretetime analysis techniques are not applicable. In addition, while the task can be viewed as a reinforcement learning problem involving exploration and exploitation tradeoff, ensuring system stability is a fundamental component of designing optimal policies. We establish that the popular Thompson sampling algorithm learns optimal actions fast, incurring only a squareroot of time regret, and also stabilizes the system in a short time period. To the best of our knowledge, this is the first such result for Thompson sampling in a diffusion process control problem. We validate our theoretical results through empirical simulations with real parameter matrices from two settings of airplane and blood glucose control. Moreover, we observe that Thompson sampling significantly improves (worstcase) regret, compared to the stateoftheart algorithms, suggesting Thompson sampling explores in a more guarded fashion. Our theoretical analysis involves characterization of a certain \emph{optimality manifold} that ties the local geometry of the drift parameters to the optimal control of the diffusion process. We expect this technique to be of broader interest. Learn more

Alex Tamkin · Dat Nguyen · Salil Deshpande · Jesse Mu · Noah Goodman (All Stanford)
Models can fail in unpredictable ways during deployment due to task ambiguity, when multiple behaviors are consistent with the provided training data. An example is an object classifier trained on red squares and blue circles: when encountering blue squares, the intended behavior is undefined. We investigate whether pretrained models are better active learners, capable of disambiguating between the possible tasks a user may be trying to specify. Intriguingly, we find that better active learning is an emergent property of the pretraining process: pretrained models require up to 5 times fewer labels when using uncertaintybased active learning, while nonpretrained models see no or even negative benefit. We find these gains come from an ability to select examples with attributes that disambiguate the intended behavior, such as rare product categories or atypical backgrounds. These attributes are far more linearly separable in pretrained model's representation spaces vs nonpretrained models, suggesting a possible mechanism for this behavior. Learn more

Yash Chandak · Shiv Shankar · Nathaniel Bastian · Bruno da Silva · Emma Brunskill (Stanford) · Philip Thomas
Methods for sequential decisionmaking are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because realworld problems are often subject to changes due to external factors (\textit{passive} nonstationarity), changes induced by interactions with the system itself (\textit{active} nonstationarity), or both (\textit{hybrid} nonstationarity). In this work, we take the first steps towards the fundamental challenge of onpolicy and offpolicy evaluation amidst structured changes due to active, passive, or hybrid nonstationarity. Towards this goal, we make a \textit{higherorder stationarity} assumption such that nonstationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importanceweighted instrumentvariable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by realworld applications that exhibit nonstationarity. Learn more

Hung Le · Yue Wang · Akhilesh Gotmare · Silvio Savarese (Stanford)· Steven Chu Hong Hoi
Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using largescale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised learning procedure to train a code generation model from natural language problem descriptions and groundtruth programs only. Such paradigm has largely ignored some important but potentially useful signals in the problem specification such as unit tests, either during training or inference stages, which thus results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose CodeRL'', a new framework to improve pretrained LMs for program synthesis tasks through deep reinforcement learning (RL). Specifically, during training, we treat the codegenerating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoderdecoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the APPS benchmark, but also shows strong zeroshot capability with new SOTA results on the simpler MBPP benchmark. Learn more

Dylan Asmar · Mykel J Kochenderfer (Both Stanford)
The level of autonomy is increasing in systems spanning multiple domains, but these systems still experience failures. One way to mitigate the risk of failures is to integrate human oversight of the autonomous systems and rely on the human to take control when the autonomy fails. In this work, we formulate a method of collaborative decision making through action suggestions that improves action selection without taking control of the system. Our approach uses each suggestion efficiently by incorporating the implicit information shared through suggestions to modify the agent's belief and achieves better performance with fewer suggestions than naively following the suggested actions. We assume collaborative agents share the same objective and communicate through valid actions. By assuming the suggested action is dependent only on the state, we can incorporate the suggested action as an independent observation of the environment. The assumption of a collaborative environment enables us to use the agent's policy to estimate the distribution over action suggestions. We propose two methods that use suggested actions and demonstrate the approach through simulated experiments. The proposed methodology results in increased performance while also being robust to suboptimal suggestions. Learn more

Viraj Mehta · Ian Char · Joseph Abbate · Rory Conlin · Mark Boyer · Stefano Ermon (Stanford) · Jeff Schneider · Willie Neiswanger (Stanford)
Many potential applications of reinforcement learning (RL) are stymied by the large numbers of samples required to learn an effective policy. This is especially true when applying RL to realworld control tasks, e.g. in the sciences or robotics, where executing a policy in the environment is costly. In popular RL algorithms, agents typically explore either by adding stochasticity to a rewardmaximizing policy or by attempting to gather maximal information about environment dynamics without taking the given task into account. In this work, we develop a method that allows us to plan for exploration while taking both the task and the current knowledge about the dynamics into account. The key insight to our approach is to plan an action sequence that maximizes the expected information gain about the optimal trajectory for the task at hand. We demonstrate that our method learns strong policies with 2x fewer samples than strong exploration baselines and 200x fewer samples than model free methods on a diverse set of lowtomedium dimensional control tasks in both the openloop and closedloop control settings. Learn more

Ibrahim Alabdulmohsin · Jessica Schrouff · Sanmi Koyejo (Stanford)
We propose a novel reductiontobinary (R2B) approach that enforces demographic parity for multiclass classification with nonbinary sensitive attributes via a reduction to a sequence of binary debiasing tasks. We prove that R2B satisfies optimality and bias guarantees and demonstrate empirically that it can lead to an improvement over two baselines: (1) treating multiclass problems as multilabel by debiasing labels independently and (2) transforming the features instead of the labels. Surprisingly, we also demonstrate that independent label debiasing yields competitive results in most (but not all) settings. We validate these conclusions on synthetic and realworld datasets from social science, computer vision, and healthcare. Learn more

Alexander Soen · Ibrahim Alabdulmohsin · Sanmi Koyejo (Stanford) · Yishay Mansour · Nyalleng Moorosi · Richard Nock · Ke Sun · Lexing Xie
We introduce a new family of techniques to postprocess (wrap") a blackbox classifier in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimization can correct any twist in prediction, unfairness being treated as a twist. In the postprocessing, we learn a wrapper function which we define as an αtree, which modifies the prediction. We provide two generic boosting algorithms to learn αtrees. We show that our modification has appealing properties in terms of composition of αtrees, generalization, interpretability, and KL divergence between modified and original predictions. We exemplify the use of our technique in three fairness notions: conditional value at risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets. Learn more

Evan Liu · Moritz Stephan · Allen Nie · Chris Piech · Emma Brunskill · Chelsea Finn (All Stanford)
Creating interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is hard — standard approaches require instructors to hand grade studentimplemented interactive programs. As a result, online platforms that serve millions, like Code.org, are unable to provide any feedback on assignments for implementing interactive programs, which critically hinders students’ ability to learn. Recent work proposes to train reinforcement learning agents to interact with a student’s program, aiming to explore states indicative of errors. However, this approach only provides binary feedback of whether a program is correct or not, while students require finergrained feedback on the specific errors in their programs to understand their mistakes. In this work, we show that exploring to discover errors can be cast as a metaexploration problem. This enables us to construct a principled objective for discovering errors and an algorithm for optimizing this objective, which provides finegrained feedback. We evaluate our approach on a set of 700K real anonymized student programs from a Code.org interactive assignment. Our approach provides feedback with 94.3% accuracy, improving over existing approaches by over 17.7% and coming within 1.5% of humanlevel accuracy. Learn more

Michael Poli (Stanford)· Stefano Massaroli · Federico Berto · Jinkyoo Park · Tri Dao (Stanford)· Christopher Ré ·(Stanford) Stefano Ermon (Stanford)
Spectrum analysis provides one of the most effective paradigms for informationpreserving dimensionality reduction in data: often, a simple description of naturally occurring signals can be obtained via few terms of periodic basis functions. Neural operators designed for frequency domain learning  frequency domain models (FDMs)  are based on complexvalued transforms i.e. Fourier Transforms (FT), and layers that perform computation on the spectrum and input data separately. This design introduces considerable computational overhead: for each layer, a forward and inverse FT. Instead, this work introduces a blueprint for frequency domain learning through a single transform: transform once (T1). To enable efficient, direct learning in the frequency domain we develop a variance preserving weight initialization scheme and investigate various choices of transforms. Our results noticeably streamline the design process of FDMs, pruning redundant transforms, and leading to speedups of 3x to 10x that increase with data resolution and model size. We perform extensive experiments on learning to solve partial differential equations, including incompressible NavierStokes, turbulent flows around airfoils, and highresolution video of smoke dynamics. T1 models improve on the test performance of SOTA FDMs while requiring significantly less computation, with over 20% reduction in predictive error across tasks. Learn more

Boyang Deng · Sumith Kulal (Stanford) · Zhengyang Dong (Stanford)· Congyue Deng (Stanford)· Yonglong Tian · Jiajun Wu (Stanford)
Shape programs encode shape structures by representing object parts as subroutines and constructing the overall shape by composing these subroutines. This usually involves the reuse of subroutines for repeatable parts, enabling the modeling of correlations among shape elements such as geometric similarity. However, existing learningbased shape programs suffer from limited representation capacity because they use coarse geometry representations such as geometric primitives and lowresolution voxel grids. Further, their training requires manually annotated groundtruth programs, which are expensive to attain. We address these limitations by proposing Shape Programs with Repeatable Implicit Parts (ProGRIP). Using implicit functions to represent parts, ProGRIP greatly boosts the representation capacity of shape programs while preserving the higherlevel structure of repetitions and symmetry. Meanwhile, we free ProGRIP from any inaccessible supervised training via devising a matchingbased unsupervised training objective. Our empirical studies show that ProGRIP outperforms existing structured representations in shape reconstruction fidelity as well as segmentation accuracy of semantic parts. Learn more

Zixian Ma · Rose Wang · Michael Bernstein · FeiFei Li · Ranjay Krishna (All Stanford)
Modern multiagent reinforcement learning frameworks rely on centralized training and reward shaping to perform well. However, centralized training and dense rewards are not readily available in the real world. Current multiagent algorithms struggle to learn in the alternative setup of decentralized training or sparse rewards. To address these issues, we propose a selfsupervised intrinsic reward called \textit{alignment} inspired by the selforganization principle in Zoology. Similar to how animals collaborate in a decentralized manner with those in their vicinity, agents trained with alignment learn behaviors that match their neighbors' expectations. This allows the agents to learn collaborative behaviors without any external reward or centralized training. We demonstrate the efficacy of our approach across 6 tasks in the multiagent particle and the complex Google Research football environments, comparing \textit{alignment} to sparse and curiositybased intrinsic rewards. When the number of agents increases, alignment scales well in all multiagent tasks except for one where agents have different capabilities. We show that agent coordination improves through alignment because agents learn to divide tasks amongst themselves, break coordination symmetries, and confuse adversaries. These results identify tasks where alignment is a more useful strategy than curiositydriven exploration for multiagent coordination, enabling agents to do zeroshot coordination. Learn more

Axel Levy ·(Stanford) Gordon Wetzstein (Stanford) · Julien N.P Martel (Stanford) · FREDERIC POITEVIN (SLAC)· Ellen Zhong
Cryoelectron microscopy (cryoEM) is an imaging modality that provides unique insights into the dynamics of proteins and other building blocks of life. The algorithmic challenge of jointly estimating the poses, 3D structure, and conformational heterogeneity of a biomolecule from millions of noisy and randomly oriented 2D projections in a computationally efficient manner, however, remains unsolved. Our method, cryoFIRE, performs ab initio heterogeneous reconstruction with unknown poses in an amortized framework, thereby avoiding the computationally expensive step of pose search while enabling the analysis of conformational heterogeneity. Poses and conformation are jointly estimated by an encoder while a physicsbased decoder aggregates the images into an implicit neural representation of the conformational space. We show that our method can provide one order of magnitude speedup on datasets containing millions of images, without any loss of accuracy. We validate that the joint estimation of poses and conformations can be amortized over the size of the dataset. For the first time, we prove that an amortized method can extract interpretable dynamic information from experimental datasets. Learn more

Tailin Wu (Stanford) · Takashi Maruyama · Jure Leskovec (Stanford)
Simulating the time evolution of Partial Differential Equations (PDEs) of largescale systems is crucial in many scientific and engineering domains such as fluid dynamics, weather forecasting and their inverse optimization problems. However, both classical solvers and recent deep learningbased surrogate models are typically extremely computationally intensive, because of their local evolution: they need to update the state of each discretized cell at each time step during inference. Here we develop Latent Evolution of PDEs (LEPDE), a simple, fast and scalable method to accelerate the simulation and inverse optimization of PDEs. LEPDE learns a compact, global representation of the system and efficiently evolves it fully in the latent space with learned latent evolution models. LEPDE achieves speedup by having a much smaller latent dimension to update during long rollout as compared to updating in the input space. We introduce new learning objectives to effectively learn such latent dynamics to ensure longterm stability. We further introduce techniques for speedingup inverse optimization of boundary conditions for PDEs via backpropagation through time in latent space, and an annealing technique to address the nondifferentiability and sparse interaction of boundary conditions. We test our method in a 1D benchmark of nonlinear PDEs, 2D NavierStokes flows into turbulent phase and an inverse optimization of boundary conditions in 2D NavierStokes flow. Compared to stateoftheart deep learningbased surrogate models and other strong baselines, we demonstrate up to 128x reduction in the dimensions to update, and up to 15x improvement in speed, while achieving competitive accuracy. Learn more

FanYun Sun · Isaac Kauvar · Ruohan Zhang · Jiachen Li · Mykel J Kochenderfer · Jiajun Wu · Nick Haber (All Stanford)
Modeling multiagent systems requires understanding how agents interact. Such systems are often difficult to model because they can involve a variety of types of interactions that layer together to drive rich social behavioral dynamics. Here we introduce a method for accurately modeling multiagent systems. We present Interaction Modeling with Multiplex Attention (IMMA), a forward prediction model that uses a multiplex latent graph to represent multiple independent types of interactions and attention to account for relations of different strengths. We also introduce Progressive Layer Training, a training strategy for this architecture. We show that our approach outperforms stateoftheart models in trajectory forecasting and relation inference, spanning three multiagent scenarios: social navigation, cooperative task achievement, and team sports. We further demonstrate that our approach can improve zeroshot generalization and allows us to probe how different interactions impact agent behavior. Learn more

Yann Dubois · Stefano Ermon · Tatsunori Hashimoto · Percy Liang (All Stanford)
Despite the empirical successes of selfsupervised learning (SSL) methods, it is unclear what characteristics of their representations lead to high downstream accuracies. In this work, we characterize properties that SSL representations should ideally satisfy. Specifically, we prove necessary and sufficient conditions such that for any task invariant to given data augmentations, probes (e.g., linear or MLP) trained on that representation attain perfect accuracy. These requirements lead to a unifying conceptual framework for improving existing SSL methods and deriving new ones. For contrastive learning, our framework prescribes simple but significant improvements to previous methods such as using asymmetric projection heads. For noncontrastive learning, we use our framework to derive a simple and novel objective. Our resulting SSL algorithms outperform baselines on standard benchmarks, including SwAV+multicrops on linear probing of ImageNet. Learn more

Eric Nguyen · Karan Goel · Albert Gu · Gordon Downs · Preey Shah · Tri Dao · Stephen Baccus · Christopher Ré (All Stanford)
Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuoussignal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly.However, they have not yet been able to achieve competitive performance on practical vision tasks such as largescale image and video classification.Building on a recent line of work on deep state space models (SSMs), we propose \method, a new multidimensional SSM layer that extends SSMs' continuoussignal modeling ability to multidimensional data including images and videos.We show that S4ND can model largescale visual data in 1D, 2D, and 3D as continuous multidimensional signals and demonstrate strong performance by simply swapping Conv2D and selfattention layers with \method\ layers in existing stateoftheart models.On ImageNet1k, \method\ exceeds the performance of a ViT baseline by 1.5% accuracy when training with a 1D sequence of patches, and matches ConvNeXt when modeling images in 2D. For videos, S4ND improves on an inflated 3D ConvNeXt in activity classification on HMDB51 by 4% accuracy.S4ND implicitly learns global, continuous convolutional kernels that are resolution invariant by construction, providing an inductive bias that enables generalization across multiple resolutions.By developing a simple bandlimiting modification to S4 to overcome aliasing, S4ND achieves strong zeroshot (unseen at test time) resolution performance, e.g. achieving 88.7% accuracy on CIFAR10 when trained on 16×16 and tested on 32×32 images.When trained with progressive resizing, S4ND comes within ∼1% of a highresolution model while training 22% faster. Learn more

Can Chang · Ni Mu · Jiajun Wu (Stanford) · Ling Pan · Huazhe Xu
A critical challenge in multiagent reinforcement learning(MARL) is for multiple agents to efficiently accomplish complex, longhorizon tasks. The agents often have difficulties in cooperating on common goals, dividing complex tasks, and planning through several stages to make progress. We propose to address these challenges by guiding agents with programs designed for parallelization, since programs as a representation contain rich structural and semantic information, and are widely used as abstractions for longhorizon tasks. Specifically, we introduce Efficient MultiAgent Reinforcement Learning with Parallel Program Guidance(EMAPP), a novel framework that leverages parallel programs to guide multiple agents to efficiently accomplish goals that require planning over 10+ stages. EMAPP integrates the structural information from a parallel program, promotes the cooperative behaviors grounded in program semantics, and improves the time efficiency via a task allocator. We conduct extensive experiments on a series of challenging, longhorizon cooperative tasks in the Overcooked environment. Results show that EMAPP outperforms strong baselines in terms of the completion rate, time efficiency, and zeroshot generalization ability by a large margin. Learn more

Willie Neiswanger · Lantao Yu · Shengjia Zhao · Chenlin Meng · Stefano Ermon (All Stanford)
Bayesian optimization (BO) is a popular method for efficiently inferring optima of an expensive blackbox function via a sequence of queries. Existing informationtheoretic BO procedures aim to make queries that most reduce the uncertainty about optima, where the uncertainty is captured by Shannon entropy. However, an optimal measure of uncertainty would, ideally, factor in how we intend to use the inferred quantity in some downstream procedure. In this paper, we instead consider a generalization of Shannon entropy from work in statistical decision theory (DeGroot 1962, Rao 1984), which contains a broad class of uncertainty measures parameterized by a problemspecific loss function corresponding to a downstream task. We first show that special cases of this entropy lead to popular acquisition functions used in BO procedures such as knowledge gradient, expected improvement, and entropy search. We then show how alternative choices for the loss yield a flexible family of acquisition functions that can be customized for use in novel optimization settings. Additionally, we develop gradientbased methods to efficiently optimize our proposed family of acquisition functions, and demonstrate strong empirical performance on a diverse set of sequential decision making tasks, including variants of topk optimization, multilevel set estimation, and sequence search. Learn more

Alexander Bergman (Stanford)· Petr Kellnhofer · Wang Yifan (Stanford) · Eric Chan (Stanford) · David Lindell · Gordon Wetzstein (Stanford)
Unsupervised learning of 3Daware generative adversarial networks (GANs) using only collections of singleview 2D photographs has very recently made much progress. These 3D GANs, however, have not been demonstrated for human bodies and the generated radiance fields of existing frameworks are not directly editable, limiting their applicability in downstream tasks. We propose a solution to these challenges by developing a 3D GAN framework that learns to generate radiance fields of human bodies or faces in a canonical pose and warp them using an explicit deformation field into a desired body pose or facial expression. Using our framework, we demonstrate the first highquality radiance field generation results for human bodies. Moreover, we show that our deformationaware training procedure significantly improves the quality of generated bodies or faces when editing their poses or facial expressions compared to a 3D GAN that is not trained with explicit deformations. Learn more

Annie Chen (Stanford)· Archit Sharma (Stanford)· Sergey Levine · Chelsea Finn (Stanford)
Reinforcement learning algorithms are typically designed to learn a performant policy that can repeatedly and autonomously complete a task, typically starting from scratch. However, many realworld situations operate under a different set of assumptions: the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once, ideally as quickly as possible, and while leveraging some prior knowledge or experience. For example, imagine a robot that is exploring another planet, where it cannot get help or supervision from humans. If it needs to navigate to a crater that it has never seen before in search of water, it does not really need to acquire a policy for reaching craters reliably, it only needs to reach this particular crater once. It must do so without the benefit of episodic resets and tackle a new, unknown terrain, but it can leverage prior experience it acquired on Earth. We formalize this problem setting, which we call singlelife reinforcement learning (SLRL), where an agent must complete a task once while contending with some form of novelty in a single trial without interventions, given some prior data. In this setting, we find that algorithms designed for standard episodic reinforcement learning can struggle, as they have trouble recovering from novel states especially when informative rewards are not provided. Motivated by this observation, we also propose an algorithm, Qweighted adversarial learning (QWALE), that addresses the dearth of supervision by employing a distribution matching strategy that leverages the agent's prior experience as guidance in novel situations. Our experiments on several singlelife continuous control problems indicate that methods based on our distribution matching formulation are 2060% more successful because they can more quickly recover from novel, outofdistribution states. Learn more

Eldar Abraham · Karel D'Oosterlinck · Amir Feder · Yair Gat · Atticus Geiger Stanford · Christopher Potts Stanford· Roi Reichart · Zhengxuan Wu Stanford
The increasing size and complexity of modern ML systems has improved their predictive capabilities but made their behavior harder to explain. Many techniques for model explanation have been developed in response, but we lack clear criteria for assessing these techniques. In this paper, we cast model explanation as the causal inference problem of estimating causal effects of realworld concepts on the output behavior of ML models given actual input data. We introduce CEBaB, a new benchmark dataset for assessing conceptbased explanation methods in Natural Language Processing (NLP). CEBaB consists of short restaurant reviews with humangenerated counterfactual reviews in which an aspect (food, noise, ambiance, service) of the dining experience was modified. Original and counterfactual reviews are annotated with multiplyvalidated sentiment ratings at the aspectlevel and reviewlevel. The rich structure of CEBaB allows us to go beyond input features to study the effects of abstract, realworld concepts on model behavior. We use CEBaB to compare the quality of a range of conceptbased explanation methods covering different assumptions and conceptions of the problem, and we seek to establish natural metrics for comparative assessments of these methods. Learn more

Anthony Corso (Stanford)· Sydney Katz (Stanford)· Craig Innes · Xin Du · Subramanian Ramamoorthy · Mykel J Kochenderfer (Stanford)
Modern autonomous systems rely on perception modules to process complex sensor measurements into state estimates. These estimates are then passed to a controller, which uses them to make safetycritical decisions. It is therefore important that we design perception systems to minimize errors that reduce the overall safety of the system. We develop a riskdriven approach to designing perception systems that accounts for the effect of perceptual errors on the performance of the fullyintegrated, closedloop system. We formulate a risk function to quantify the effect of a given perceptual error on overall safety, and show how we can use it to design safer perception systems by including a riskdependent term in the loss function and generating training data in risksensitive regions. We evaluate our techniques on a realistic visionbased aircraft detect and avoid application and show that riskdriven design reduces collision risk by 37% over a baseline system. Learn more

Evan Liu · Moritz Stephan · Allen Nie · Chris Piech · Emma Brunskill · Chelsea Finn (All Stanford)
Creating interactive software, such as websites or games, is a particularly engaging way to learn computer science. However, teaching and giving feedback on such software is hard — standard approaches require instructors to hand grade studentimplemented interactive programs. As a result, online platforms that serve millions, like Code.org, are unable to provide any feedback on assignments for implementing interactive programs, which critically hinders students’ ability to learn. Recent work proposes to train reinforcement learning agents to interact with a student’s program, aiming to explore states indicative of errors. However, this approach only provides binary feedback of whether a program is correct or not, while students require finergrained feedback on the specific errors in their programs to understand their mistakes. In this work, we show that exploring to discover errors can be cast as a metaexploration problem. This enables us to construct a principled objective for discovering errors and an algorithm for optimizing this objective, which provides finegrained feedback. We evaluate our approach on a set of 700K real anonymized student programs from a Code.org interactive assignment. Our approach provides feedback with 94.3% accuracy, improving over existing approaches by over 17.7% and coming within 1.5% of humanlevel accuracy. Learn more

Ben Sorscher (Stanford)· Robert Geirhos · Shashank Shekhar · Surya Ganguli (Stanford) · Ari Morcos
Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show both in theory and practice that we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a highquality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than powerlaw scaling performance on ResNets trained on CIFAR10, SVHN, and ImageNet. Given the importance of finding highquality pruning metrics, we perform the first largescale benchmarking study of 9 different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable selfsupervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good datapruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning. Learn more

Jesse Mu (Stanford) · Victor Zhong · Roberta Raileanu · Minqi Jiang · Noah Goodman (Stanford)· Tim Rocktäschel · Edward Grefenstette
Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use statebased novelty measures which reward lowlevel exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These languagebased variants outperform their nonlinguistic forms by 2346% across 13 challenging tasks from the MiniGrid and MiniHack environment suites. Learn more

Rishi Bommasani · Kathleen Creel · Ananya Kumar · Dan Jurafsky · Percy Liang (All Stanford)
As the scope of machine learning broadens, we observe a recurring theme of algorithmic monoculture: the same systems, or systems that share components (e.g., datasets, models), are deployed by multiple decisionmakers. While sharing offers advantages like amortizing effort, it also has risks. We introduce and formalize one such risk, outcome homogenization, defined here as the extent to which particular individuals or groups experience the same outcomes across different deployments. If the same individuals or groups exclusively experience undesirable outcomes, this may institutionalize systemic exclusion and reinscribe social hierarchy. We relate algorithmic monoculture and outcome homogenization by proposing the component sharing hypothesis: if algorithmic systems are increasingly built on the same data or models, then they will increasingly homogenize outcomes. We test this hypothesis on algorithmic fairness benchmarks based on the US Census, demonstrating that increased datasharing exacerbates homogenization, especially for small datasets. Further, given the current regime in AI of foundation models, i.e., pretrained models that can be adapted to myriad downstream tasks, we test whether modelsharing homogenizes outcomes across tasks. We observe mixed results: we find that for both vision and language settings, the specific methods for adapting a foundation model significantly influence the degree of outcome homogenization. We also identify societal challenges that inhibit the measurement, diagnosis, and rectification of outcome homogenization in deployed machine learning systems. Learn more

Tailin Wu (Stanford)· Megan Tjandrasuwita · Zhengxuan Wu (Stanford)· Xuelin Yang (Stanford)· Kevin Liu (Stanford)· Rok Sosic (Stanford)· Jure Leskovec (Stanford)
Humans have the remarkable ability to recognize and acquire novel visual concepts in a zeroshot manner. Given a highlevel, symbolic description of a novel concept in terms of previously learned visual concepts and their relations, humans can recognize novel concepts without seeing any examples. Moreover, they can acquire new concepts by parsing and communicating symbolic structures using learned visual concepts and relations. Endowing these capabilities in machines is pivotal in improving their generalization capability at inference time. In this work, we introduce Zeroshot Concept Recognition and Acquisition (ZeroC), a neurosymbolic architecture that can recognize and acquire novel concepts in a zeroshot way. ZeroC represents concepts as graphs of constituent concept models (as nodes) and their relations (as edges). To allow inference time composition, we employ energybased models (EBMs) to model concepts and relations. We design ZeroC architecture so that it allows a onetoone mapping between a symbolic graph structure of a concept and its corresponding EBM, which allows acquiring new concepts, communicating its graph structure, and applying it to classification and detection tasks at inference time. We introduce algorithms for learning and inference with ZeroC. We evaluate ZeroC on a challenging gridworld dataset which is designed to probe zeroshot concept recognition and acquisition, and demonstrate its capability. Learn more

Albert Gu (Stanford)· Karan Goel (Stanford) · Ankit Gupta · Christopher Ré (Stanford)
State space models (SSM) have recently been shown to be very effective as a deep learning layer as a promising alternative to sequence models such as RNNs, CNNs, or Transformers. The first version to show this potential was the S4 model, which is particularly effective on tasks involving longrange dependencies by using a prescribed state matrix called the HiPPO matrix. While this has an interpretable mathematical mechanism for modeling long dependencies, it also requires a custom representation and algorithm that makes the model difficult to understand and implement. On the other hand, a recent variant of S4 called DSS showed that restricting the state matrix to be fully diagonal can still preserve the performance of the original model when using a specific initialization based on approximating S4's matrix. This work seeks to systematically understand how to parameterize and initialize diagonal state space models. While it follows from classical results that almost all SSMs have an equivalent diagonal form, we show that the initialization is critical for performance. First, we explain why DSS works mathematically, as the diagonal approximation to S4 surprisingly recovers the same dynamics in the limit of infinite state dimension. We then systematically describe various design choices in parameterizing and computing diagonal SSMs, and perform a controlled empirical study ablating the effects of these choices. Our final model S4D is a simple diagonal version of S4 whose kernel computation requires just 3 lines of code and performs comparably to S4 in almost all settings, with stateoftheart results in image, audio, and medical timeseries domains, and 85\% average on the Long Range Arena benchmark. Learn more

Eric Zelikman (Stanford)· Yuhuai Wu · Jesse Mu (Stanford)· Noah Goodman (Stanford)
Generating stepbystep "chainofthought" rationales improves language model performance on complex reasoning tasks like mathematics or commonsense questionanswering. However, inducing language model rationale generation currently requires either constructing massive rationale datasets or sacrificing accuracy by using only fewshot inference. We propose a technique to iteratively leverage a small number of rationale examples and a large dataset without rationales, to bootstrap the ability to perform successively more complex reasoning. This technique, the "SelfTaught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; finetune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model finetuned to directly predict final answers, and performs comparably to finetuning a 30× larger stateoftheart language model on CommensenseQA. Thus, STaR lets a model improve itself by learning from its own generated reasoning. Learn more

Bahjat Kawar · Michael Elad · Stefano Ermon (Stanford) · Jiaming Song (Stanford)
Many interesting tasks in image restoration can be cast as linear inverse problems. A recent family of approaches for solving these problems uses stochastic algorithms that sample from the posterior distribution of natural images given the measurements.However, efficient solutions often require problemspecific supervised training to model the posterior, whereas unsupervised methods that are not problemspecific typically rely on inefficient iterative methods.This work addresses these issues by introducing Denoising Diffusion Restoration Models (DDRM), an efficient, unsupervised posterior sampling method.Motivated by variational inference, DDRM takes advantage of a pretrained denoising diffusion generative model for solving any linear inverse problem.We demonstrate DDRM's versatility on several image datasets for superresolution, deblurring, inpainting, and colorization under various amounts of measurement noise.DDRM outperforms the current leading unsupervised methods on the diverse ImageNet dataset in reconstruction quality, perceptual quality, and runtime, being 5× faster than the nearest competitor.DDRM also generalizes well for natural images out of the distribution of the observed ImageNet training set. Learn more

Hung Le · Yue Wang · Akhilesh Gotmare · Silvio Savarese (Stanford) · Steven Chu Hong Hoi
Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using largescale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised learning procedure to train a code generation model from natural language problem descriptions and groundtruth programs only. Such paradigm has largely ignored some important but potentially useful signals in the problem specification such as unit tests, either during training or inference stages, which thus results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose CodeRL'', a new framework to improve pretrained LMs for program synthesis tasks through deep reinforcement learning (RL). Specifically, during training, we treat the codegenerating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoderdecoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the APPS benchmark, but also shows strong zeroshot capability with new SOTA results on the simpler MBPP benchmark. Learn more

Marvin Zhang · Sergey Levine · Chelsea Finn (Stanford)
While deep neural networks can attain good accuracy on indistribution test points, many applications require robustness even in the face of unexpected perturbations in the input, changes in the domain, or other sources of distribution shift. We study the problem of test time robustification, i.e., using the test input to improve model robustness. Recent prior works have proposed methods for test time adaptation, however, they each introduce additional assumptions, such as access to multiple test points, that prevent widespread adoption. In this work, we aim to study and devise methods that make no assumptions about the model training process and are broadly applicable at test time. We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable: when presented with a test example, perform different data augmentations on the data point, and then adapt (all of) the model parameters by minimizing the entropy of the model's average, or marginal, output distribution across the augmentations. Intuitively, this objective encourages the model to make the same prediction across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions. In our experiments, we evaluate two baseline ResNet models, two robust ResNet50 models, and a robust vision transformer model, and we demonstrate that this approach achieves accuracy gains of 18% over standard model evaluation and also generally outperforms prior augmentation and adaptation strategies. For the setting in which only one test point is available, we achieve stateoftheart results on the ImageNetC, ImageNetR, and, among ResNet50 models, ImageNetA distribution shift benchmarks. Learn more

Lingjiao Chen · Matei Zaharia · James Zou (All Stanford)
Deployed machine learning (ML) models often encounter new user data that differs from their training data. Therefore, estimating how well a given model might perform on the new data is an important step toward reliable ML applications. This is very challenging, however, as the data distribution can change in flexible ways, and we may not have any labels on the new data, which is often the case in monitoring settings. In this paper, we propose a new distribution shift model, Sparse Joint Shift (SJS), which considers the joint shift of both labels and a few features. This unifies and generalizes several existing shift models including label shift and sparse covariate shift, where only marginal feature or label distribution shifts are considered. We describe mathematical conditions under which SJS is identifiable. We further propose SEES, an algorithmic framework to characterize the distribution shift under SJS and to estimate a model’s performance on new data without any labels. We conduct extensive experiments on several realworld datasets with various ML models. Across different datasets and distribution shifts, SEES achieves significant (up to an order of magnitude) shift estimation error improvements over existing approaches. Learn more

Muyang Li · Ji Lin · Chenlin Meng (Stanford)· Stefano Ermon (Stanford)· Song Han · JunYan Zhu
Deep generative models excel at synthesizing photorealistic images and enable various image synthesis and editing applications. However, when editing a photo, existing methods tend to resynthesize the entire output from scratch, including the unedited region, leading to a significant waste of computation, especially for minor editing operations.In this work, we present Spatially Sparse Inference (SSI), a generalpurpose speedup technique that selectively performs computation for edited regions and is compatible with different types of generative models. Our key observation is that user editing is often incremental in the interactive setting. This allows us to precompute the feature maps of the original image. Given an edited image, we sparsely apply the filters to the edited regions while reusing the precomputed features for the unedited regions. Based on our algorithm, we propose Sparse Incremental Generative Engine (SIGE) to convert the theoretical computation reduction to latency reduction on commonlyused hardware. With 1.2\% area edited regions, our method reduces the computation of DDIM by 7.5× and GauGAN by 18× while preserving the visual fidelity. With \engineabbr, we accelerate the inference time of DDIM by 3.0× on RTX3090 and 6.6× on Apple M1 Pro, and GauGAN by 4.2× on RTX3090 and 14× on Apple M1 Pro. Learn more

Samar Khanna · Yezhen Cong · Chenlin Meng · Patrick Liu · Erik Rozi · Yutong He · Marshall Burke · David Lobell · Stefano Ermon (All Stanford)
Unsupervised pretraining methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multispectral structure provides avenues to further improve existing pretraining strategies. In this paper, we present SatMAE, a pretraining framework for temporal or multispectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multispectral data as groups of bands with distinct spectral positional encodings is beneficial. Our approach yields strong improvements over previous stateoftheart techniques, both in terms of supervised learning performance on benchmark datasets (up to ↑ 7\%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to ↑ 14\%) and semantic segmentation. Learn more

Xuechen Li (Stanford) · Daogao Liu · Tatsunori Hashimoto (Stanford) · Huseyin A. Inan · Janardhan Kulkarni · YinTat Lee · Abhradeep Guha Thakurta
Large pretrained models can be privately finetuned to achieve performance approaching nonprivate models. A common theme in these results is the surprising observation that highdimensional models can achieve favorable privacyutility tradeoffs. This seemingly contradicts known results on the modelsize dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive refined bounds for the excess empirical and population risks that are dimensionindependent under additional conditions.We empirically show that in private finetuning of large language models, gradients obtained during finetuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimensionindependent bounds in convex settings and provides a possible explanation for recent successes in largescale private finetuning. Learn more

Michael Zhang · Christopher Ré (All Stanford)
While large pretrained foundation models (FMs) have shown remarkable zeroshot classification robustness to datasetlevel distribution shifts, their robustness to subpopulation or group shifts is relatively underexplored. We study this problem, and find that foundation models such as CLIP may not be robust to various group shifts. Across 9 robustness benchmarks, zeroshot classification with their embeddings results in gaps of up to 80.7 percentage points (pp) between average and worstgroup accuracy. Unfortunately, existing methods to improve robustness require retraining, which can be prohibitively expensive on large foundation models. We also find that efficient ways to improve model inference (e.g. via adapters, lightweight networks that transform FM embeddings) do not consistently improve and can sometimes hurt group robustness compared to zeroshot. We therefore develop an adapter training strategy to effectively and efficiently improve FM group robustness. Our motivating observation is that while poor robustness results from groups in the same class being embedded far apart in the foundation model "embedding space," standard adapter training may not actually bring these points closer together. We thus propose contrastive adapting, which contrastively trains adapters to bring sample embeddings close to both their groundtruth class embeddings and sameclass sample embeddings. Across the 9 robustness benchmarks, contrastive adapting consistently improves group robustness, raising worstgroup accuracy by 8.5 to 56.0 pp over zeroshot. Our approach is also efficient, doing so without any FM finetuning and only a fixed set of FM embeddings. On popular benchmarks such as Waterbirds and CelebA, this leads to worstgroup accuracy comparable to stateoftheart methods, while only training <1% of the model parameters. Learn more

Mansheej Paul (Stanford)· Brett Larsen (Stanford)· Surya Ganguli (Stanford)· Jonathan Frankle · Gintare Karolina Dziugaite
A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that—after just a few hundred steps of dense training—the method can find a sparse subnetwork that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pretraining leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pretraining iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pretraining only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP. Learn more

Allen Nie (Stanford) · Yannis FletBerliac · Deon Jordan · William Steenbergen · Emma Brunskill (Stanford)
Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithmhyperparameter selection for their setting. Critically, in most realworld settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task and methodagnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithmhyperparameter selection: while this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting, and we show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higherperforming deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a generalpurpose metaalgorithm for automatic algorithmhyperparameter selection for offline RL. Learn more

Xi Chen · Ali Ghadirzadeh (Stanford) · Tianhe Yu (Stanford)· Alex Yuan Gao · Jianhao Wang · Wenzhe Li · Liang Bin · Chelsea Finn (Stanford)· Chongjie Zhang
Offline reinforcement learning methods hold the promise of learning policies from precollected datasets without the need to query the environment for new samples. This setting is particularly wellsuited for continuous control robotic applications for which online data collection based on trialanderror is costly and potentially unsafe. In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios, such as data from several human demonstrators or from policies that act with different purposes. Unfortunately, such datasets often contain action distributions with multiple modes and, in some cases, lack a sufficient number of highreward trajectories, which render offline policy training inefficient. To address this challenge, we propose to leverage latentvariable generative model to represent highadvantage stateaction pairs leading to better adherence to data distributions that contributes to solving the task, while maximizing reward via a policy over the latent variable. As we empirically show on a range of simulated locomotion, navigation, and manipulation tasks, our method referred to as latentvariable advantageweighted policy optimization (LAPO), improves the average performance of the next bestperforming offline reinforcement learning methods by 49\% on heterogeneous datasets, and by 8\% on datasets with narrow and biased distributions. Learn more

Colin Wei · Yining Chen · Tengyu Ma (All Stanford)
A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, the constructions from approximation theory often have unrealistic aspects, for example, reliance on infinite precision to memorize target function values. To address this issue, we propose a formal definition of statistically meaningful approximation which requires the approximating network to exhibit good statistical learnability. We present case studies on statistically meaningful approximation for two classes of functions: boolean circuits and Turing machines. We show that overparameterized feedforward neural nets can statistically meaningfully approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the approximating network. In addition, we show that transformers can statistically meaningfully approximate Turing machines with computation time bounded by T, requiring sample complexity polynomial in the alphabet size, state space size, and log(T). Our analysis introduces new tools for generalization bounds that provide much tighter sample complexity guarantees than the typical VCdimension or normbased bounds, which may be of independent interest. Learn more

Jonathan N Lee (Stanford) · George Tucker · Ofir Nachum · Bo Dai · Emma Brunskill (Stanford)
Offline reinforcement learning (RL) is a promising paradigm where a learner leverages prior data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation where the learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve the optimal balance of approximation and estimation error of the classes. We propose, to our knowledge, the first model selection algorithm for offline RL that achieves minimax rateoptimal oracle inequalities up to logarithmic factors. The algorithm, ModBE, takes as input the model classes and a base offline RL algorithm designed to minimize squared Bellman error. It successively eliminates model classes using a novel onesided generalization test, finally returning a policy that competes with the performance of the best model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to calculating and comparing relative squared errors between classes. Finally, we demonstrate it is capable of reliably selecting a good model class in small simulated experiments. Learn more

Megha Srivastava · Erdem Biyik · Suvir Mirchandani · Noah Goodman · Dorsa Sadigh (All Stanford)
Recent works on shared autonomy and assistiveAI technologies, such as assistive robotic teleoperation, seek to model and help human users with limited ability in a fixed task. However, these approaches often fail to account for humans' ability to adapt and eventually learn how to execute a control task themselves. Furthermore, in applications where it may be desirable for a human to intervene, these methods may have inhibited their ability to learn how to succeed with full selfcontrol. In this paper, we focus on the problem of assistive teaching of motor control tasks such as parking a car or landing an aircraft. Despite their ubiquitous role in humans' daily activities and occupations, motor tasks are rarely taught in a uniform way due to their high complexity and variance. We propose an AIassisted teaching algorithm that leverages skill discovery methods from reinforcement learning (RL) literature to (i) break down any motor control task into teachable skills, (ii) construct novel drill sequences, and (iii) individualize curricula to students with different capabilities. Through an extensive mix of synthetic and user studies on two motor control tasks  parking a car with a joystick and writing characters from the Balinese alphabet  we show that assisted teaching with skills improve student performance by around 40% compared to practicing full trajectories without skills, and practicing with individualized drills can result in up to 25% further improvement. Learn more

Annie Xie · Fahim Tajwar · Archit Sharma · Chelsea Finn (All Stanford)
A longterm goal of reinforcement learning is to design agents that can autonomously interact and learn in the world. A critical challenge to such autonomy is the presence of irreversible states which require external assistance to recover from, such as when a robot arm has pushed an object off of a table. While standard agents require constant monitoring to decide when to intervene, we aim to design proactive agents that can request human intervention only when needed. To this end, we propose an algorithm that can efficiently learn to detect and avoid states that are irreversible, and proactively ask for help in case the agent does enter them. On a suite of continuous control environments with unknown irreversible states, we find that our algorithm exhibits both better sample and interventionefficiency compared to existing methods. Learn more

Chenlin Meng · Kristy Choi · Jiaming Song · Stefano Ermon (All Stanford)
Representing probability distributions by the gradient of their density functions has proven effective in modeling a wide range of continuous data modalities. However, this representation is not applicable in discrete domains where the gradient is undefined. To this end, we propose an analogous score function called the “Concrete score”, a generalization of the (Stein) score for discrete settings. Given a predefined neighborhood structure, the Concrete score of any input is defined by the rate of change of the probabilities with respect to local directional changes of the input. This formulation allows us to recover the (Stein) score in continuous domains when measuring such changes by the Euclidean distance, while using the Manhattan distance leads to our novel score function in discrete domains. Finally, we introduce a new framework to learn such scores from samples called Concrete Score Matching (CSM), and propose an efficient training objective to scale our approach to high dimensions. Empirically, we demonstrate the efficacy of CSM on density estimation tasks on a mixture of synthetic, tabular, and highdimensional image datasets, and demonstrate that it performs favorably relative to existing baselines for modeling discrete data. Learn more

Michael Poli (Stanford) · Winnie Xu (Stanford) · Stefano Massaroli · Chenlin Meng (Stanford) · Kuno Kim (Stanford) · Stefano Ermon (Stanford)
Many patterns in nature exhibit selfsimilarity: they can be compactly described via selfreferential transformations. Said patterns commonly appear in natural and artificial objects, such as molecules, shorelines, galaxies, and even images. In this work, we investigate the role of learning in the automated discovery of selfsimilarity and in its utilization for downstream tasks. To this end, we design a novel class of implicit operators, Neural Collages, which (1) represent data as the parameters of a selfreferential, structured transformation, and (2) employ hypernetworks to amortize the cost of finding these parameters to a single forward pass. We investigate how to leverage the representations produced by Neural Collages in various tasks, including data compression and generation. Neural Collage image compressors are orders of magnitude faster than other selfsimilaritybased algorithms during encoding and offer compression rates competitive with implicit methods. Finally, we showcase applications of Neural Collages for fractal art and as deep generative models. Learn more

Yining Chen (Stanford) · Elan Rosenfeld · Mark Sellke (Stanford) · Tengyu Ma (Stanford) · Andrej Risteski
Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposed algorithms for this task, assessing their performance both theoretically and empirically is still very challenging. Distributional matching algorithms such as (Conditional) Domain Adversarial Networks [Ganin et al., 2016, Long et al., 2018] are popular and enjoy empirical success, but they lack formal guarantees. Other approaches such as Invariant Risk Minimization (IRM) require a prohibitively large number of training environmentslinear in the dimension of the spurious feature space dseven on simple data models like the one proposed by [Rosenfeld et al., 2021]. Under a variant of this model, we show that ERM and IRM can fail to find the optimal invariant predictor with o(ds) environments. We then present an iterative feature matching algorithm that is guaranteed with high probability to find the optimal invariant predictor after seeing only O(logds) environments. Our results provide the first theoretical justification for distributionmatching algorithms widely used in practice under a concrete nontrivial data model. Learn more

Jeff Z. HaoChen · Colin Wei · Ananya Kumar · Tengyu Ma (All Stanford)
Contrastive learning is a highly effective method for learning representations from unlabeled data. Recent works show that contrastive representations can transfer across domains, leading to simple stateoftheart algorithms for unsupervised domain adaptation. In particular, a linear classifier trained to separate the representations on the source domain can also predict classes on the target domain accurately, even though the representations of the two domains are far from each other. We refer to this phenomenon as linear transferability. This paper analyzes when and why contrastive representations exhibit linear transferability in a general unsupervised domain adaptation setting. We prove that linear transferability can occur when data from the same class in different domains (e.g., photo dogs and cartoon dogs) are more related with each other than data from different classes in different domains (e.g., photo dogs and cartoon cats) are. Our analyses are in a realistic regime where the source and target domains can have unbounded density ratios and be weakly related, and they have distant representations across domains. Learn more

Peter Henderson · Mark Krass · Lucia Zheng · Neel Guha · Christopher D Manning · Dan Jurafsky · Daniel Ho (All Stanford)
One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take into account context. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB dataset of opensource Englishlanguage legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may potentially help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in modelbased processing. Learn more

Zelun Luo · Zane Durante · Linden Li · Wanze Xie · Ruochen Liu · Emily Jin · Zhuoyi Huang · Lun Li · Jiajun Wu · Juan Carlos Niebles · Ehsan Adeli · FeiFei Li (All Stanford)
Videolanguage models (VLMs), large models pretrained on numerous but noisy videotext pairs from the internet, have revolutionized activity recognition through their remarkable generalization and openvocabulary capabilities. While complex human activities are often hierarchical and compositional, most existing tasks for evaluating VLMs focus only on highlevel video understanding, making it difficult to accurately assess and interpret the ability of VLMs to understand complex and finegrained human activities. Inspired by the recently proposed MOMA framework, we define activity graphs as a single universal representation of human activities that encompasses video understanding at the activity, subactivity, and atomic action level. We redefine activity parsing as the overarching task of activity graph generation, requiring understanding human activities across all three levels. To facilitate the evaluation of models on activity parsing, we introduce MOMALRG (MultiObject MultiActor LanguageRefined Graphs), a large dataset of complex human activities with activity graph annotations that can be readily transformed into natural language sentences. Lastly, we present a modelagnostic and lightweight approach to adapting and evaluating VLMs by incorporating structured knowledge from activity graphs into VLMs, addressing the individual limitations of language and graphical models. We demonstrate strong performance on fewshot activity parsing, and our framework is intended to foster future research in the joint modeling of videos, graphs, and language. Learn more

Tri Dao (Stanford)· Daniel Fu (Stanford)· Stefano Ermon (Stanford) · Atri Rudra · Christopher Ré (Stanford)
Transformers are slow and memoryhungry on long sequences, since the time and memory complexity of selfattention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wallclock speedup. We argue that a missing principle is making attention algorithms IOawareaccounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IOaware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU onchip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention, 3x speedup on GPT2 (seq. length 1K), and 2.4x speedup on longrange arena (seq. length 1K4K). FlashAttention, yielding higher quality models (0.7 better perplexity on GPT2 and 6.4 points of lift on longdocument classification) and entirely new capabilities: the first Transformers to achieve betterthanchance performance on the PathX challenge (seq. length 16K, 61.4% accuracy) and Path256 (seq. length 64K, 63.1% accuracy). Learn more

Andy Shih · Dorsa Sadigh · Stefano Ermon (All Stanford)
Conditional inference on arbitrary subsets of variables is a core problem in probabilistic inference with important applications such as masked language modeling and image inpainting. In recent years, the family of AnyOrder Autoregressive Models (AOARMs)  which includes popular models such as XLNet  has shown breakthrough performance in arbitrary conditional tasks across a sweeping range of domains. But, in spite of their success, in this paper we identify significant improvements to be made to previous formulations of AOARMs. First, we show that AOARMs suffer from redundancy in their probabilistic model, i.e., they define the same distribution in multiple different ways. We alleviate this redundancy by training on a smaller set of univariate conditionals that still maintains support for efficient arbitrary conditional inference. Second, we upweight the training loss for univariate conditionals that are evaluated more frequently during inference. Our method leads to improved performance with no compromises on tractability, giving stateoftheart likelihoods in arbitrary conditional modeling on text (Text8), image (CIFAR10, ImageNet32), and continuous tabular data domains. Learn more

Huaxiu Yao · Caroline Choi · Yoonho Lee · Pang Wei Koh · Chelsea Finn (All Stanford)
Distribution shifts occur when the test distribution differs from the training distribution, and can considerably degrade performance of machine learning models deployed in the real world. While recent works have studied robustness to distribution shifts, distribution shifts arising from the passage of time have the additional structure of timestamp metadata. Realworld examples of such shifts are underexplored, and it is unclear whether existing models can leverage trends in past distribution shifts to reliably extrapolate into the future. To address this gap, we curate WildTime, a benchmark of 7 datasets that reflect temporal distribution shifts arising in a variety of realworld applications, including drug discovery, patient prognosis, and news classification. On these datasets, we systematically benchmark 13 approaches with various inductive biases. We evaluate methods in domaingeneralization, continual learning, selfsupervised learning, and ensemble learning, which leverage timestamps to extract the common structure of the distribution shifts. We extend several domaingeneralization methods to the temporal distribution shift setting by treating windows of time as different domains. Finally, we propose two evaluation strategies to evaluate model performance under temporal distribution shiftsevaluation with a fixed time split (EvalFix) and evaluation with a data stream (EvalStream). EvalFix, our primary evaluation strategy, aims to provide a simple evaluation protocol for the broader machine learning community, while EvalStream serves as a complementary benchmark for continual learning approaches. Our experiments demonstrate that existing methods are limited in tackling temporal distribution shift: across all settings, we observe an average performance drop of 20% from indistribution to outofdistribution data. Learn more

Huaxiu Yao (Stanford) · Yiping Wang · Linjun Zhang · James Zou (Stanford) · Chelsea Finn (Stanford)
Improving the generalization of deep networks is an important open challenge, particularly in domains without plentiful data. The mixup algorithm improves generalization by linearly interpolating the input features of a pair of examples and their corresponding labels. These interpolated examples augment the original training dataset. It has shown promising results in various classification tasks, but systematic analysis of mixup in regression remains underexplored. Using mixup directly on regression labels could result in arbitrarily wrong labels since the linearity assumption behind mixup may not hold. In this paper, we propose a simple yet powerful algorithm, mixReg, to improve generalization on regression tasks. In contrast with the vanilla mixup, which uses the same sampling probability for example pairs, mixReg adjusts the sampling probability based on the similarity of labels. Our theoretical analysis further confirms that mixReg with label similarity obtains a smaller mean square error than vanilla mixup and using feature similarity. Another benefit of mixReg is that it can improve outofdistribution robustness, where the test distribution is different from the training distribution. By selectively interpolating examples with similar labels, it mitigates the effects of domainassociated information and pushes invariant predictors. We evaluate mixReg on eleven datasets, ranging from tabular to video data. Compared to the best prior approach, mixReg achieves 6.56%, 4.76%, 5.14% improvements in indistribution generalization, task generalization, and outofdistribution robustness, respectively. Learn more

Xuelin Yang (Stanford) · Jiayuan Mao · Xikun Zhang (Stanford) · Noah Goodman (Stanford) · Jiajun Wu (Stanford)
Building machines that can reason about physical events and their causal relationships is crucial for flexible interaction with the physical world. However, most existing physical and causal reasoning benchmarks are exclusively based on synthetically generated events and synthetic natural language descriptions of the causal relationships. This design brings up two issues. First, there is a lack of diversity in both event types and natural language descriptions; second, causal relationships based on manuallydefined heuristics are different from human judgments. To address both shortcomings, we present the CLEVRERHumans benchmark, a video reasoning dataset for causal judgment of physical events with human labels. We employ two techniques to improve data collection efficiency: first, a novel iterative event cloze task to elicit a new representation of events in videos, which we term Causal Event Graphs (CEGs); second, a data augmentation technique based on neural language generative models. We convert the collected CEGs into questions and answers to be consistent with prior work. Finally, we study a collection of baseline approaches for CLEVRERHumans questionanswering, highlighting great challenges set forth by our benchmark. Learn more

Alex Tamkin (Stanford) · Gaurab Banerjee · Mohamed Owda (Stanford) · Vincent Liu · Shashank Rammoorthy (Stanford) · Noah Goodman (Stanford)
Universal selfsupervised (SSL) algorithms hold enormous promise for making machine learning accessible to highimpact domains such as protein biology, manufacturing, and genomics. We present DABS 2.0: a set of improved datasets and algorithms for advancing research on universal SSL. We extend the recentlyintroduced DABS benchmark with the addition of five realworld science and engineering domains: protein biology, bacterial genomics, multispectral satellite imagery, semiconductor wafers, and particle physics, bringing the total number of domains in the benchmark to twelve. We also propose a new universal SSL algorithm, Capri, and a generalized version of masked autoencoding, and apply both on all twelve domainsthe most wideranging exploration of SSL yet. We find that multiple algorithms show gains across domains, outperforming previous baselines. In addition, we demonstrate the usefulness of DABS for scientific study of SSL by investigating the optimal corruption rate for each algorithm, showing that the best setting varies based on the domain. Code will be released here. Learn more

Joy Hsu · Jiajun Wu · Noah Goodman (All Stanford)
Euclidean geometry is among the earliest forms of mathematical thinking. While the geometric primitives underlying its constructions, such as perfect lines and circles, do not often occur in the natural world, humans rarely struggle to perceive and reason with them. Will computer vision models trained on natural images show the same sensitivity to Euclidean geometry? Here we explore these questions by studying fewshot generalization in the universe of Euclidean geometry constructions. We introduce Geoclidean, a domainspecific language for Euclidean geometry, and use it to generate two datasets of geometric concept learning tasks for benchmarking generalization judgements of humans and machines. We find that humans are indeed sensitive to Euclidean geometry and generalize strongly from a few visual examples of a geometric concept. In contrast, lowlevel and highlevel visual features from standard computer vision models pretrained on natural images do not support correct generalization. Thus Geoclidean represents a novel fewshot generalization benchmark for geometric concept learning, where the performance of humans and of AI models diverge. The Geoclidean framework and dataset are publicly available for download. Learn more

Mike Wu · Noah Goodman (All Stanford)
Probabilistic programs provide an expressive representation language for generative models. Given a probabilistic program, we are interested in the task of posterior inference: estimating a latent variable given a set of observed variables. Existing techniques for inference in probabilistic programs often require choosing many hyperparameters, are computationally expensive, and/or only work for restricted classes of programs. Here we formulate inference as masked language modeling: given a program, we generate a supervised dataset of variables and assignments, and randomly mask a subset of the assignments. We then train a neural network to unmask the random values, defining an approximate posterior distribution. By optimizing a single neural network across a range of programs we amortize the cost of training, yielding a "foundation" posterior able to do zeroshot inference for new programs. The foundation posterior can also be finetuned for a particular program and dataset by optimizing a variational inference objective. We show the efficacy of the approach, zeroshot and finetuned, on a benchmark of STAN programs. Learn more

Mehdi S. M. Sajjadi · Daniel Duckworth · Aravindh Mahendran · Sjoerd van Steenkiste · Filip Pavetic · Mario Lucic · Leonidas Guibas (Stanford) · Klaus Greff · Thomas Kipf
A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3Dconsistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3Dcentric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both objectcentric as well as neural scene representation learning communities. Learn more

Connor Lin (Stanford) · Niloy Mitra · Gordon Wetzstein (Stanford) · Leonidas Guibas (Stanford) · Paul Guerrero
Neural representations are popular for representing shapes as they can be used for data cleanup, model completion, shape editing, and shape synthesis. Current neural representations can be categorized as either overfitting to a single object instance, or representing a collection of objects. However, neither allows accurate editing of neural scene representations: on the one hand, methods that overfit objects achieve highly accurate reconstructions but do not support editing, as they do not generalize to unseen object configurations; on the other hand, methods that represent a family of objects with variations do generalize but produce approximate reconstructions. We propose NeuForm to combine the advantages of both overfitted and generalizable representations by adaptively overfitting a generalizable representation to regions where reliable data is available, while using the generalizable representation everywhere else. We achieve this with a carefully designed architecture and an approach that blends the network weights of the two representations. We demonstrate edits that successfully reconfigure parts of humanmade shapes, such as chairs, tables, and lamps, while preserving the accuracy of an overfitted shape representation. We compare with two stateoftheart competitors and demonstrate clear improvements in terms of plausibility and fidelity of the resultant edits. Learn more

Tong Mu · Yash Chandak · Tatsunori Hashimoto (Stanford) · Emma Brunskill (Stanford)
While there has been extensive work on learning from offline data for contextual multiarmed bandit settings, existing methods typically assume there is no environment shift: that the learned policy will operate in the same environmental process as that of data collection. However, this assumption may overly limit the use of these methods for many practical situations where there may be distribution shifts. In this work we propose Factored Distributionally Robust Optimization (FactoredDRO), which is able to separately handle distribution shifts in the context distribution and shifts in the reward generating process. Prior work that either ignores potential shifts in the context, or considers them jointly, can lead to performance that is too conservative and does not consider context shift at all, especially under certain forms of reward feedback. Our FactoredDRO objective mitigates this by considering the shifts separately, and our proposed estimators are consistent and converge asymptotically. We also introduce a practical algorithm and demonstrate promising empirical results in environments based on realworld datasets, such as voting outcomes and scene classification. Learn more

Lingjiao Chen (Stanford)· Zhihua Jin · Evan Sabri Eyuboglu (Stanford)· Christopher Ré (Stanford)· Matei Zaharia (Stanford)· James Zou (Stanford)
Commercial ML APIs offered by providers such as Google, Amazon and Microsoft have dramatically simplified ML adoptions in many applications. Numerous companies and academics pay to use ML APIs for tasks such as object detection, OCR and sentiment analysis. Different ML APIs tackling the same task can have very heterogeneous performances. Moreover, the ML models underlying the APIs also evolve over time. As ML APIs rapidly become a valuable marketplace and an integral part of analytics, it is critical to systematically study and compare different APIs with each other and to characterize how individual APIs change over time. However, this practically important topic is currently underexplored due to the lack of data. In this paper, we present HAPI (History of APIs), a longitudinal dataset of 1,761,417 instances of commercial ML API applications (involving APIs from Amazon, Google, IBM, Microsoft and other providers) across diverse tasks including image tagging, speech recognition, and text mining from 2020 to 2022. Each instance consists of a query input for an API (e.g., an image or text) along with the API’s output prediction/annotation and confidence scores. HAPI is the first largescale dataset of ML API usages and is a unique resource for studying ML asaservice (MLaaS). As examples of the types of analyses that HAPI enables, we show that ML APIs’ performance changes substantially over time—several APIs’ accuracies dropped on specific benchmark datasets. Even when the API’s aggregate performance stays steady, its error modes can shift across different subtypes of data between 2020 and 2022. Such changes can substantially impact the entire analytics pipelines that use some ML API as a component. We further use HAPI to study commercial APIs’ performance disparities across demographic subgroups over time. HAPI can stimulate more research in the growing field of MLaaS. Learn more

Armin Thomas · Christopher Ré · Russell Poldrack (All Stanford)
Selfsupervised learning techniques are celebrating immense success in natural language processing (NLP) by enabling models to learn from broad language data at unprecedented scales. Here, we aim to leverage the success of these techniques for mental state decoding, where researchers aim to identify specific mental states (e.g., the experience of anger or joy) from brain activity. To this end, we devise a set of novel selfsupervised learning frameworks for neuroimaging data based on prominent learning frameworks in NLP. At their core, these frameworks learn the dynamics of brain activity by modeling sequences of activity akin to how NLP models sequences of text. We evaluate the frameworks by pretraining models on a broad neuroimaging dataset spanning functional Magnetic Resonance Imaging data from 11,980 experimental runs of 1,726 individuals across 34 datasets and subsequently adapting the pretrained models to two benchmark mental state decoding datasets. The pretrained models transfer well, generally outperforming baseline models trained from scratch, while models trained in a learning framework based on causal language modeling clearly outperform the others. Learn more

Ines Chami (Stanford) · Sami AbuElHaija · Bryan Perozzi · Christopher Ré (Stanford) · Kevin Murphy
There has been a surge of recent interest in graph representation learning (GRL). GRL methods have generally fallen into three main categories, based on the availability of labeled data. The first, network embedding, focuses on learning unsupervised representations of relational structure. The second, graph regularized neural networks, leverages graphs to augment neural network losses with a regularization objective for semisupervised learning. The third, graph neural networks, aims to learn differentiable functions over discrete topologies with arbitrary structure. However, despite the popularity of these areas there has been surprisingly little work on unifying the three paradigms. Here, we aim to bridge the gap between network embedding, graph regularization and graph neural networks. We propose a comprehensive taxonomy of GRL methods, aiming to unify several disparate bodies of work. Specifically, we propose the GraphEDM framework, which generalizes popular algorithms for semisupervised learning (e.g. GraphSage, GCN, GAT), and unsupervised learning (e.g. DeepWalk, node2vec) of graph representations into a single consistent approach. To illustrate the generality of GraphEDM, we fit over thirty existing methods into this framework. We believe that this unifying view both provides a solid foundation for understanding the intuition behind these methods, and enables future research in the area. Learn more

Ruocheng Wang (Stanford) · Yunzhi Zhang (Stanford) · Jiayuan Mao · Ran Zhang · ChinYi Cheng · Jiajun Wu (Stanford)
Humandesigned visual manuals are crucial components in shape assembly activities. They provide stepbystep guidance on how we should move and connect different parts in a convenient and physicallyrealizable way. While there has been an ongoing effort in building agents that perform assembly tasks, the information in humandesign manuals has been largely overlooked. We identify that this is due to 1) a lack of realistic 3D assembly objects that have paired manuals and 2) the difficulty of extracting structured information from purely imagebased manuals. Motivated by this observation, we present IKEAManual, a dataset consisting of 102 IKEA objects paired with assembly manuals. We provide finegrained annotations on the IKEA objects and assembly manuals, including decomposed assembly parts, assembly plans, manual segmentation, and 2D3D correspondence between 3D parts and visual manuals. We illustrate the broad application of our dataset on three tasks related to shape assembly: assembly plan generation, part segmentation, and 3D part assembly. Learn more

Kailas Vodrahalli · Tobias Gerstenberg · James Zou (All Stanford)
In many practical applications of AI, an AI model is used as a decision aid for human users. The AI provides advice that a human (sometimes) incorporates into their decisionmaking process. The AI advice is often presented with some measure of "confidence" that the human can use to calibrate how much they depend on or trust the advice. In this paper, we present an initial exploration that suggests showing AI models as more confident than they actually are, even when the original AI is wellcalibrated, can improve humanAI performance (measured as the accuracy and confidence of the human's final prediction after seeing the AI advice). We first train a model to predict human incorporation of AI advice using data from thousands of human interactions. This enables us to explicitly estimate how to transform the AI's prediction confidence, making the AI uncalibrated, in order to improve the final human prediction. We empirically validate our results across four different tasksdealing with images, text and tabular datainvolving hundreds of human participants. We further support our findings with simulation analysis. Our findings suggest the importance of jointly optimizing the humanAI system as opposed to the standard paradigm of optimizing the AI model alone. Learn more

Roxana Daneshjou · Mert Yuksekgonul · Zhuo Ran Cai · Roberto Novoa · James Zou (All Stanford)
For the deployment of artificial intelligence (AI) in high risk settings, such as healthcare, methods that provide interpretability/explainability or allow finegrained error analysis are critical. Many recent methods for interpretability/explainability and finegrained error analysis use concepts, which are metalabels which are semantically meaningful to humans. However, there are only a few datasets that include conceptlevel metalabels and most of these metalabels are relevant for natural images that do not require domain expertise. Previous densely annotated datasets in medicine focused on metalabels that are relevant to a single disease such as osteoarthritis or melanoma. In dermatology, skin disease is described using an established clinical lexicon that allow clinicians to describe physical exam findings to one another. To provide the first medical dataset densely annotated by domain experts to provide annotations useful across multiple disease processes, we developed SKINCON: a skin disease dataset densely annotated by dermatologists. SKINCON includes 3230 images from the Fitzpatrick 17k skin disease dataset densely annotated with 48 clinical concepts, 22 of which have at least 50 images representing the concept. The concepts used were chosen by two dermatologists considering the clinical descriptor terms used to describe skin lesions. Examples include "plaque", "scale", and "erosion". These same concepts were also used to label 656 skin disease images from the Diverse Dermatology Images dataset, providing an additional external dataset with diverse skin tone representations. We review the potential applications for the SKINCON dataset, such as probing models, conceptbased explanations, concept bottlenecks, error analysis, and slice discovery. Furthermore, we use SKINCON to demonstrate two of these use cases: debugging mistakes of an existing dermatology AI model with concepts and developing interpretable models with posthoc concept bottleneck models. Learn more

Dilip Arumugam · Benjamin Van Roy (All Stanford)
The quintessential modelbased reinforcementlearning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in modelbased reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as realworld reinforcement learning demands consideration of a simple, computationallybounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent’s capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly valueequivalent model, immediately giving rise to a tradeoff between identifying a model that is simple enough to learn while only incurring bounded suboptimality. To address this problem, we introduce an algorithm that, using ratedistortion theory, iteratively computes an approximatelyvalueequivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an informationtheoretic, Bayesian regret bound for our algorithm that holds for any finitehorizon, episodic sequential decisionmaking problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired suboptimality gap or, alternatively, the best model given a limit on agent capacity. Learn more

Dilip Arumugam (Stanford) · Satinder Singh
The BayesAdaptive Markov Decision Process (BAMDP) formalism pursues the Bayesoptimal solution to the explorationexploitation tradeoff in reinforcement learning. As the computation of exact solutions to Bayesian reinforcementlearning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worstcase difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationallyintractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationallytractable, approximate planning algorithm. Learn more

Emmanuel Candès is the BarnumSimons Chair in Mathematics and Statistics at Stanford University, and Professor of Electrical Engineering (by courtesy). His research interests lie at the interface of statistics, information theory, signal processing and computational mathematics. He received his Ph.D. in statistics from Stanford University in 1998. Candès has received several awards including the Alan T. Waterman Award from NSF, which is the highest honor bestowed by NSF to earlycareer scientists, and the MacArthur Fellowship, popularly known as the ‘genius award’. He has given over 80 plenary lectures at major international conferences, not only in mathematics and statistics but in many other areas as well including biomedical imaging and solidstate physics. He was elected to the National Academy of Sciences and to the American Academy of Arts and Sciences in 2014.
Workshops
Competitions
Posters
Did I miss your paper? Email me at shana2@stanford.edu to let me know.
Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.