Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.
Sign Up For Latest News
Computer vision is enhancing machines’ ability to interpret and act on visual data, transforming sectors like healthcare, security, and manufacturing.

QuantiPhy is a new benchmark and training framework that evaluates whether AI can numerically reason about physical properties in video images. QuantiPhy reveals that today’s models struggle with basic estimates of size, speed, and distance but offers a way forward.

QuantiPhy is a new benchmark and training framework that evaluates whether AI can numerically reason about physical properties in video images. QuantiPhy reveals that today’s models struggle with basic estimates of size, speed, and distance but offers a way forward.
We present a new method of deconstructing class activation tokens of vision transformers into a new, overcomplete basis, where each basis vector is “monosemantic” and affiliated with a single, human-compatible conceptual description. We achieve this through the use of a highly optimized and customized version of the K-SVD algorithm, which we call Double-Batch K-SVD (DBK-SVD). We demonstrate the efficacy of our approach on the sbucaptions dataset, using CLIP embeddings and comparing our results to a Sparse Autoencoder (SAE) baseline. Our method significantly outperforms SAE in terms of reconstruction loss, recovering approximately 2/3 of the original signal compared to 1/6 for SAE. We introduce novel metrics for evaluating explanation faithfulness and specificity, showing that DBK-SVD produces more diverse and specific concept descriptions. We therefore show empirically for the first time that disentangling of concepts arising in Vision Transformers is possible, a statement that has previously been questioned when applying an additional sparsity constraint. Our research opens new avenues for model interpretability, failure mitigation, and downstream task domain transfer in vision transformer models. An interactive demo showcasing our results can be found at https://disentangling-sbucaptions.xyz, and we make our DBK-SVD implementation openly available at https://github.com/RomeoV/KSVD.jl.
We present a new method of deconstructing class activation tokens of vision transformers into a new, overcomplete basis, where each basis vector is “monosemantic” and affiliated with a single, human-compatible conceptual description. We achieve this through the use of a highly optimized and customized version of the K-SVD algorithm, which we call Double-Batch K-SVD (DBK-SVD). We demonstrate the efficacy of our approach on the sbucaptions dataset, using CLIP embeddings and comparing our results to a Sparse Autoencoder (SAE) baseline. Our method significantly outperforms SAE in terms of reconstruction loss, recovering approximately 2/3 of the original signal compared to 1/6 for SAE. We introduce novel metrics for evaluating explanation faithfulness and specificity, showing that DBK-SVD produces more diverse and specific concept descriptions. We therefore show empirically for the first time that disentangling of concepts arising in Vision Transformers is possible, a statement that has previously been questioned when applying an additional sparsity constraint. Our research opens new avenues for model interpretability, failure mitigation, and downstream task domain transfer in vision transformer models. An interactive demo showcasing our results can be found at https://disentangling-sbucaptions.xyz, and we make our DBK-SVD implementation openly available at https://github.com/RomeoV/KSVD.jl.

This brief introduces a computer-vision approach to analyzing solar panel adoption in U.S. households that can help policymakers tailor incentive mechanisms.

This brief introduces a computer-vision approach to analyzing solar panel adoption in U.S. households that can help policymakers tailor incentive mechanisms.
"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.
"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.
Increasingly large robotics datasets are being collected to train larger foundation models in robotics. However, despite the fact that data selection has been of utmost importance to scaling in vision and natural language processing (NLP), little work in robotics has questioned what data such models should actually be trained on. In this work we investigate how to weigh different subsets or "domains'' of robotics datasets during pre-training to maximize worst-case performance across all possible downstream domains using distributionally robust optimization (DRO). Unlike in NLP, we find that these methods are hard to apply out of the box due to varying action spaces and dynamics across robots. Our method, ReMix, employs early stopping and action normalization and discretization to counteract these issues. Through extensive experimentation on both the Bridge and OpenX datasets, we demonstrate that data curation can have an outsized impact on downstream performance. Specifically, domain weights learned by ReMix outperform uniform weights by over 40% on average and human-selected weights by over 20% on datasets used to train the RT-X models.
Increasingly large robotics datasets are being collected to train larger foundation models in robotics. However, despite the fact that data selection has been of utmost importance to scaling in vision and natural language processing (NLP), little work in robotics has questioned what data such models should actually be trained on. In this work we investigate how to weigh different subsets or "domains'' of robotics datasets during pre-training to maximize worst-case performance across all possible downstream domains using distributionally robust optimization (DRO). Unlike in NLP, we find that these methods are hard to apply out of the box due to varying action spaces and dynamics across robots. Our method, ReMix, employs early stopping and action normalization and discretization to counteract these issues. Through extensive experimentation on both the Bridge and OpenX datasets, we demonstrate that data curation can have an outsized impact on downstream performance. Specifically, domain weights learned by ReMix outperform uniform weights by over 40% on average and human-selected weights by over 20% on datasets used to train the RT-X models.

This white paper provides research- and scientifically-grounded recommendations for how to give context to calls for testing the operational accuracy of facial recognition technology.

This white paper provides research- and scientifically-grounded recommendations for how to give context to calls for testing the operational accuracy of facial recognition technology.
HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.
HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.

Using AI to analyze Google Street View images of damaged buildings across 16 states, Stanford researchers found that destroyed buildings in poor areas often remained empty lots for years, while those in wealthy areas were rebuilt bigger and better than before.
Using AI to analyze Google Street View images of damaged buildings across 16 states, Stanford researchers found that destroyed buildings in poor areas often remained empty lots for years, while those in wealthy areas were rebuilt bigger and better than before.


The Stanford HAI co-founder is recognized for breakthroughs that propelled computer vision and deep learning, and for championing human-centered AI and industry innovation.
The Stanford HAI co-founder is recognized for breakthroughs that propelled computer vision and deep learning, and for championing human-centered AI and industry innovation.


With a new computer vision model that recognizes the real-world utility of objects in images, researchers at Stanford look to push the boundaries of robotics and AI.
With a new computer vision model that recognizes the real-world utility of objects in images, researchers at Stanford look to push the boundaries of robotics and AI.


Health care providers struggle to catch early signals of cognitive decline. AI and computational neuroscientist Ehsan Adeli’s innovative computer vision tools may offer a solution.
Health care providers struggle to catch early signals of cognitive decline. AI and computational neuroscientist Ehsan Adeli’s innovative computer vision tools may offer a solution.


How early cognitive research funded by the NSF paved the way for today’s AI breakthroughs—and how AI is now inspiring new understandings of the human mind.
How early cognitive research funded by the NSF paved the way for today’s AI breakthroughs—and how AI is now inspiring new understandings of the human mind.
