Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Natural Language Processing | Stanford HAI
Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Back to Natural Language Processing

All Work Published on Natural Language Processing

ReMix: Optimizing Data Mixtures for Large Scale Imitation Learning
Joey Hejna, Chethan Anand Bhateja, Yichen Jiang, Karl Pertsch, Dorsa Sadigh
Sep 05, 2024
Research
Your browser does not support the video tag.

Increasingly large robotics datasets are being collected to train larger foundation models in robotics. However, despite the fact that data selection has been of utmost importance to scaling in vision and natural language processing (NLP), little work in robotics has questioned what data such models should actually be trained on. In this work we investigate how to weigh different subsets or "domains'' of robotics datasets during pre-training to maximize worst-case performance across all possible downstream domains using distributionally robust optimization (DRO). Unlike in NLP, we find that these methods are hard to apply out of the box due to varying action spaces and dynamics across robots. Our method, ReMix, employs early stopping and action normalization and discretization to counteract these issues. Through extensive experimentation on both the Bridge and OpenX datasets, we demonstrate that data curation can have an outsized impact on downstream performance. Specifically, domain weights learned by ReMix outperform uniform weights by over 40% on average and human-selected weights by over 20% on datasets used to train the RT-X models.

ReMix: Optimizing Data Mixtures for Large Scale Imitation Learning

Joey Hejna, Chethan Anand Bhateja, Yichen Jiang, Karl Pertsch, Dorsa Sadigh
Sep 05, 2024

Increasingly large robotics datasets are being collected to train larger foundation models in robotics. However, despite the fact that data selection has been of utmost importance to scaling in vision and natural language processing (NLP), little work in robotics has questioned what data such models should actually be trained on. In this work we investigate how to weigh different subsets or "domains'' of robotics datasets during pre-training to maximize worst-case performance across all possible downstream domains using distributionally robust optimization (DRO). Unlike in NLP, we find that these methods are hard to apply out of the box due to varying action spaces and dynamics across robots. Our method, ReMix, employs early stopping and action normalization and discretization to counteract these issues. Through extensive experimentation on both the Bridge and OpenX datasets, we demonstrate that data curation can have an outsized impact on downstream performance. Specifically, domain weights learned by ReMix outperform uniform weights by over 40% on average and human-selected weights by over 20% on datasets used to train the RT-X models.

Computer Vision
Robotics
Natural Language Processing
Your browser does not support the video tag.
Research
Language Models in the Classroom: Bridging the Gap Between Technology and Teaching
Instructors and students of CS293
Apr 09, 2025
News

Instructors and students from Stanford class CS293/EDUC473 address the failures of current educational technologies and outline how to empower both teachers and learners through collaborative innovation.

Language Models in the Classroom: Bridging the Gap Between Technology and Teaching

Instructors and students of CS293
Apr 09, 2025

Instructors and students from Stanford class CS293/EDUC473 address the failures of current educational technologies and outline how to empower both teachers and learners through collaborative innovation.

Education, Skills
Generative AI
Natural Language Processing
News
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Christopher Potts, Jing Huang, Zhengxuan Wu, Mor Geva, Atticus Geiger
Aug 14, 2024
Research
Your browser does not support the video tag.

Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute–Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/ explanare/ravel.

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Christopher Potts, Jing Huang, Zhengxuan Wu, Mor Geva, Atticus Geiger
Aug 14, 2024

Individual neurons participate in the representation of multiple high-level concepts. To what extent can different interpretability methods successfully disentangle these roles? To help address this question, we introduce RAVEL (Resolving Attribute–Value Entanglements in Language Models), a dataset that enables tightly controlled, quantitative comparisons between a variety of existing interpretability methods. We use the resulting conceptual framework to define the new method of Multi-task Distributed Alignment Search (MDAS), which allows us to find distributed representations satisfying multiple causal criteria. With Llama2-7B as the target language model, MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of going beyond neuron-level analyses to identify features distributed across activations. We release our benchmark at https://github.com/ explanare/ravel.

Natural Language Processing
Your browser does not support the video tag.
Research
An Open-Source AI Agent for Doing Tasks on the Web
Katharine Miller
Mar 27, 2025
News

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

An Open-Source AI Agent for Doing Tasks on the Web

Katharine Miller
Mar 27, 2025

NNetNav learns how to navigate websites by mimicking childhood learning through exploration.

Machine Learning
Natural Language Processing
News
A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records
Lin Lawrence Guo, Jason Fries, Nigam Shah, Ethan Steinberg, Scott Lanyon Fleming, Keith Morse, Catherine Aandilian, Jose Posada, Lillian Sung
Jun 27, 2024
Research
Your browser does not support the video tag.

Foundation models are transforming artificial intelligence (AI) in healthcare by providing modular components adaptable for various downstream tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across hospitals and their performance in local tasks. This multi-center study examined the adaptability of a publicly accessible structured EHR foundation model (FMSM), trained on 2.57 M patient records from Stanford Medicine. Experiments used EHR data from The Hospital for Sick Children (SickKids) and Medical Information Mart for Intensive Care (MIMIC-IV). We assessed both adaptability via continued pretraining on local data, and task adaptability compared to baselines of locally training models from scratch, including a local foundation model. Evaluations on 8 clinical prediction tasks showed that adapting the off-the-shelf FMSMmatched the performance of gradient boosting machines (GBM) locally trained on all data while providing a 13% improvement in settings with few task-specific training labels. Continued pretraining on local data showed FMSM required fewer than 1% of training examples to match the fully trained GBM’s performance, and was 60 to 90% more sample-efficient than training local foundation models from scratch. Our findings demonstrate that adapting EHR foundation models across hospitals provides improved prediction performance at less cost, underscoring the utility of base foundation models as modular components to streamline the development of healthcare AI.

A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records

Lin Lawrence Guo, Jason Fries, Nigam Shah, Ethan Steinberg, Scott Lanyon Fleming, Keith Morse, Catherine Aandilian, Jose Posada, Lillian Sung
Jun 27, 2024

Foundation models are transforming artificial intelligence (AI) in healthcare by providing modular components adaptable for various downstream tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across hospitals and their performance in local tasks. This multi-center study examined the adaptability of a publicly accessible structured EHR foundation model (FMSM), trained on 2.57 M patient records from Stanford Medicine. Experiments used EHR data from The Hospital for Sick Children (SickKids) and Medical Information Mart for Intensive Care (MIMIC-IV). We assessed both adaptability via continued pretraining on local data, and task adaptability compared to baselines of locally training models from scratch, including a local foundation model. Evaluations on 8 clinical prediction tasks showed that adapting the off-the-shelf FMSMmatched the performance of gradient boosting machines (GBM) locally trained on all data while providing a 13% improvement in settings with few task-specific training labels. Continued pretraining on local data showed FMSM required fewer than 1% of training examples to match the fully trained GBM’s performance, and was 60 to 90% more sample-efficient than training local foundation models from scratch. Our findings demonstrate that adapting EHR foundation models across hospitals provides improved prediction performance at less cost, underscoring the utility of base foundation models as modular components to streamline the development of healthcare AI.

Natural Language Processing
Healthcare
Foundation Models
Your browser does not support the video tag.
Research
Chatbots, Like the Rest of Us, Just Want to Be Loved
Wired
Mar 05, 2025
Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired
Mar 05, 2025

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Natural Language Processing
Machine Learning
Generative AI
Foundation Models
Media Mention
1
2
3
4
5