Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Computer Vision | Stanford HAI
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
    • Subscribe to Email
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Computer Vision

Computer vision is enhancing machines’ ability to interpret and act on visual data, transforming sectors like healthcare, security, and manufacturing.

From Privacy to ‘Glass Box’ AI, Stanford Students Are Targeting Real-World Problems
Nikki Goth Itoi
Feb 27, 2026
News

An Amazon-backed fellowship will support 10 Stanford PhD students whose work explores everything from how we communicate to understanding disease and protecting our data.

News

From Privacy to ‘Glass Box’ AI, Stanford Students Are Targeting Real-World Problems

Nikki Goth Itoi
Generative AIHealthcarePrivacy, Safety, SecurityComputer VisionSciences (Social, Health, Biological, Physical)Feb 27

An Amazon-backed fellowship will support 10 Stanford PhD students whose work explores everything from how we communicate to understanding disease and protecting our data.

Finding Monosemantic Subspaces and Human-Compatible Interpretations in Vision Transformers through Sparse Coding
Romeo Valentin, Vikas Sindhwan, Summeet Singh, Vincent Vanhoucke, Mykel Kochenderfer
Jan 01, 2025
Research
Your browser does not support the video tag.

We present a new method of deconstructing class activation tokens of vision transformers into a new, overcomplete basis, where each basis vector is “monosemantic” and affiliated with a single, human-compatible conceptual description. We achieve this through the use of a highly optimized and customized version of the K-SVD algorithm, which we call Double-Batch K-SVD (DBK-SVD). We demonstrate the efficacy of our approach on the sbucaptions dataset, using CLIP embeddings and comparing our results to a Sparse Autoencoder (SAE) baseline. Our method significantly outperforms SAE in terms of reconstruction loss, recovering approximately 2/3 of the original signal compared to 1/6 for SAE. We introduce novel metrics for evaluating explanation faithfulness and specificity, showing that DBK-SVD produces more diverse and specific concept descriptions. We therefore show empirically for the first time that disentangling of concepts arising in Vision Transformers is possible, a statement that has previously been questioned when applying an additional sparsity constraint. Our research opens new avenues for model interpretability, failure mitigation, and downstream task domain transfer in vision transformer models. An interactive demo showcasing our results can be found at https://disentangling-sbucaptions.xyz, and we make our DBK-SVD implementation openly available at https://github.com/RomeoV/KSVD.jl.

Research
Your browser does not support the video tag.

Finding Monosemantic Subspaces and Human-Compatible Interpretations in Vision Transformers through Sparse Coding

Romeo Valentin, Vikas Sindhwan, Summeet Singh, Vincent Vanhoucke, Mykel Kochenderfer
Computer VisionJan 01

We present a new method of deconstructing class activation tokens of vision transformers into a new, overcomplete basis, where each basis vector is “monosemantic” and affiliated with a single, human-compatible conceptual description. We achieve this through the use of a highly optimized and customized version of the K-SVD algorithm, which we call Double-Batch K-SVD (DBK-SVD). We demonstrate the efficacy of our approach on the sbucaptions dataset, using CLIP embeddings and comparing our results to a Sparse Autoencoder (SAE) baseline. Our method significantly outperforms SAE in terms of reconstruction loss, recovering approximately 2/3 of the original signal compared to 1/6 for SAE. We introduce novel metrics for evaluating explanation faithfulness and specificity, showing that DBK-SVD produces more diverse and specific concept descriptions. We therefore show empirically for the first time that disentangling of concepts arising in Vision Transformers is possible, a statement that has previously been questioned when applying an additional sparsity constraint. Our research opens new avenues for model interpretability, failure mitigation, and downstream task domain transfer in vision transformer models. An interactive demo showcasing our results can be found at https://disentangling-sbucaptions.xyz, and we make our DBK-SVD implementation openly available at https://github.com/RomeoV/KSVD.jl.

Using AI to Understand Residential Solar Power
Zhecheng Wang, Marie-Louise Arlt, Chad Zanocco, Arun Majumdar, Ram Rajagopal
Quick ReadSep 28, 2023
Policy Brief

This brief introduces a computer-vision approach to analyzing solar panel adoption in U.S. households that can help policymakers tailor incentive mechanisms.

Policy Brief

Using AI to Understand Residential Solar Power

Zhecheng Wang, Marie-Louise Arlt, Chad Zanocco, Arun Majumdar, Ram Rajagopal
Energy, EnvironmentComputer VisionQuick ReadSep 28

This brief introduces a computer-vision approach to analyzing solar panel adoption in U.S. households that can help policymakers tailor incentive mechanisms.

America's 250 Greatest Innovators: Celebrating The American Dream
Forbes
Feb 11, 2026
Media Mention

HAI Co-Director Fei-Fei Li named one of America's top 250 greatest innovators, alongside fellow Stanford affiliates Rodney Brooks, Carolyn Bertozzi, Daphne Koller, and Andrew Ng.

Media Mention
Your browser does not support the video tag.

America's 250 Greatest Innovators: Celebrating The American Dream

Forbes
Computer VisionGenerative AIFoundation ModelsEnergy, EnvironmentEthics, Equity, InclusionFeb 11

HAI Co-Director Fei-Fei Li named one of America's top 250 greatest innovators, alongside fellow Stanford affiliates Rodney Brooks, Carolyn Bertozzi, Daphne Koller, and Andrew Ng.

ReMix: Optimizing Data Mixtures for Large Scale Imitation Learning
Joey Hejna, Chethan Anand Bhateja, Yichen Jiang, Karl Pertsch, Dorsa Sadigh
Sep 05, 2024
Research
Your browser does not support the video tag.

Increasingly large robotics datasets are being collected to train larger foundation models in robotics. However, despite the fact that data selection has been of utmost importance to scaling in vision and natural language processing (NLP), little work in robotics has questioned what data such models should actually be trained on. In this work we investigate how to weigh different subsets or "domains'' of robotics datasets during pre-training to maximize worst-case performance across all possible downstream domains using distributionally robust optimization (DRO). Unlike in NLP, we find that these methods are hard to apply out of the box due to varying action spaces and dynamics across robots. Our method, ReMix, employs early stopping and action normalization and discretization to counteract these issues. Through extensive experimentation on both the Bridge and OpenX datasets, we demonstrate that data curation can have an outsized impact on downstream performance. Specifically, domain weights learned by ReMix outperform uniform weights by over 40% on average and human-selected weights by over 20% on datasets used to train the RT-X models.

Research
Your browser does not support the video tag.

ReMix: Optimizing Data Mixtures for Large Scale Imitation Learning

Joey Hejna, Chethan Anand Bhateja, Yichen Jiang, Karl Pertsch, Dorsa Sadigh
Computer VisionRoboticsNatural Language ProcessingSep 05

Increasingly large robotics datasets are being collected to train larger foundation models in robotics. However, despite the fact that data selection has been of utmost importance to scaling in vision and natural language processing (NLP), little work in robotics has questioned what data such models should actually be trained on. In this work we investigate how to weigh different subsets or "domains'' of robotics datasets during pre-training to maximize worst-case performance across all possible downstream domains using distributionally robust optimization (DRO). Unlike in NLP, we find that these methods are hard to apply out of the box due to varying action spaces and dynamics across robots. Our method, ReMix, employs early stopping and action normalization and discretization to counteract these issues. Through extensive experimentation on both the Bridge and OpenX datasets, we demonstrate that data curation can have an outsized impact on downstream performance. Specifically, domain weights learned by ReMix outperform uniform weights by over 40% on average and human-selected weights by over 20% on datasets used to train the RT-X models.

Evaluating Facial Recognition Technology: A Protocol for Performance Assessment in New Domains
Daniel E. Ho, Emily Black, Maneesh Agrawala, Fei-Fei Li
Deep DiveNov 01, 2020
White Paper

This white paper provides research- and scientifically-grounded recommendations for how to give context to calls for testing the operational accuracy of facial recognition technology.

White Paper

Evaluating Facial Recognition Technology: A Protocol for Performance Assessment in New Domains

Daniel E. Ho, Emily Black, Maneesh Agrawala, Fei-Fei Li
Computer VisionRegulation, Policy, GovernanceDeep DiveNov 01

This white paper provides research- and scientifically-grounded recommendations for how to give context to calls for testing the operational accuracy of facial recognition technology.

All Work Published on Computer Vision

AI Can’t Do Physics Well – And That’s a Roadblock to Autonomy
Andrew Myers
Jan 26, 2026
News
breaking of pool balls on a pool table

QuantiPhy is a new benchmark and training framework that evaluates whether AI can numerically reason about physical properties in video images. QuantiPhy reveals that today’s models struggle with basic estimates of size, speed, and distance but offers a way forward.

AI Can’t Do Physics Well – And That’s a Roadblock to Autonomy

Andrew Myers
Jan 26, 2026

QuantiPhy is a new benchmark and training framework that evaluates whether AI can numerically reason about physical properties in video images. QuantiPhy reveals that today’s models struggle with basic estimates of size, speed, and distance but offers a way forward.

Computer Vision
Robotics
Sciences (Social, Health, Biological, Physical)
breaking of pool balls on a pool table
News
Spatial Intelligence Is AI’s Next Frontier
TIME
Dec 11, 2025
Media Mention

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

Spatial Intelligence Is AI’s Next Frontier

TIME
Dec 11, 2025

"This is AI’s next frontier, and why 2025 was such a pivotal year," writes HAI Co-Director Fei-Fei Li.

Computer Vision
Machine Learning
Generative AI
Media Mention
The Architects of AI Are TIME’s 2025 Person of the Year
TIME
Dec 11, 2025
Media Mention

HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.

The Architects of AI Are TIME’s 2025 Person of the Year

TIME
Dec 11, 2025

HAI founding co-director Fei-Fei Li has been named one of TIME's 2025 Persons of the Year. From ImageNet to her advocacy for human-centered AI, Dr. Li has been a guiding light in the field.

Machine Learning
Computer Vision
Media Mention
How Natural Disasters Exacerbate Inequity
Katharine Miller
Dec 10, 2025
News

Using AI to analyze Google Street View images of damaged buildings across 16 states, Stanford researchers found that destroyed buildings in poor areas often remained empty lots for years, while those in wealthy areas were rebuilt bigger and better than before.

How Natural Disasters Exacerbate Inequity

Katharine Miller
Dec 10, 2025

Using AI to analyze Google Street View images of damaged buildings across 16 states, Stanford researchers found that destroyed buildings in poor areas often remained empty lots for years, while those in wealthy areas were rebuilt bigger and better than before.

Computer Vision
Economy, Markets
Ethics, Equity, Inclusion
News
Fei-Fei Li Wins Queen Elizabeth Prize for Engineering
Shana Lynch
Nov 07, 2025
News

The Stanford HAI co-founder is recognized for breakthroughs that propelled computer vision and deep learning, and for championing human-centered AI and industry innovation.

Fei-Fei Li Wins Queen Elizabeth Prize for Engineering

Shana Lynch
Nov 07, 2025

The Stanford HAI co-founder is recognized for breakthroughs that propelled computer vision and deep learning, and for championing human-centered AI and industry innovation.

Computer Vision
Machine Learning
News
Saw, Sword, or Shovel: AI Spots Functional Similarities Between Disparate Objects
Andrew Myers
Oct 13, 2025
News

With a new computer vision model that recognizes the real-world utility of objects in images, researchers at Stanford look to push the boundaries of robotics and AI.

Saw, Sword, or Shovel: AI Spots Functional Similarities Between Disparate Objects

Andrew Myers
Oct 13, 2025

With a new computer vision model that recognizes the real-world utility of objects in images, researchers at Stanford look to push the boundaries of robotics and AI.

Robotics
Computer Vision
News
1
2
3