What is a Vision Transformer (ViT)?

A Vision Transformer (ViT) is an adaptation of the transformer architecture, originally designed for language, to process and analyze images. Instead of treating images as grids of pixels through convolutional layers, ViTs divide images into small patches, convert them into sequences, and apply attention mechanisms to understand relationships between different parts of the image. ViTs have achieved competitive or superior performance compared to traditional convolutional neural networks on many computer vision tasks.

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Finding Monosemantic Subspaces and Human-Compatible Interpretations in Vision Transformers through Sparse Coding

Romeo Valentin, Vikas Sindhwan, Summeet Singh, Vincent Vanhoucke, Mykel Kochenderfer

Jan 01

Research

We present a new method of deconstructing class activation tokens of vision transformers into a new, overcomplete basis, where each basis vector is “monosemantic” and affiliated with a single, human-compatible conceptual description. We achieve this through the use of a highly optimized and customized version of the K-SVD algorithm, which we call Double-Batch K-SVD (DBK-SVD). We demonstrate the efficacy of our approach on the sbucaptions dataset, using CLIP embeddings and comparing our results to a Sparse Autoencoder (SAE) baseline. Our method significantly outperforms SAE in terms of reconstruction loss, recovering approximately 2/3 of the original signal compared to 1/6 for SAE. We introduce novel metrics for evaluating explanation faithfulness and specificity, showing that DBK-SVD produces more diverse and specific concept descriptions. We therefore show empirically for the first time that disentangling of concepts arising in Vision Transformers is possible, a statement that has previously been questioned when applying an additional sparsity constraint. Our research opens new avenues for model interpretability, failure mitigation, and downstream task domain transfer in vision transformer models. An interactive demo showcasing our results can be found at https://disentangling-sbucaptions.xyz, and we make our DBK-SVD implementation openly available at https://github.com/RomeoV/KSVD.jl.

Finding Monosemantic Subspaces and Human-Compatible Interpretations in Vision Transformers through Sparse Coding

Romeo Valentin, Vikas Sindhwan, Summeet Singh, Vincent Vanhoucke, Mykel Kochenderfer

Jan 01

Computer Vision

Research

Enroll in a Human-Centered AI Course

This AI program covers technical fundamentals, business implications, and societal considerations.

What is a Vision Transformer (ViT)?

Navigate

Participate

Stay Up To Date

Vision Transformer (ViT) mentioned at Stanford HAI

Finding Monosemantic Subspaces and Human-Compatible Interpretations in Vision Transformers through Sparse Coding

Finding Monosemantic Subspaces and Human-Compatible Interpretations in Vision Transformers through Sparse Coding

Enroll in a Human-Centered AI Course