General Few-shot Learners for Image Understanding and Generation
General Few-shot Learners for Image Understanding and Generation
Related Publications
Current societal trends reflect an increased mistrust in science and a lowered civic engagement that threaten to impair research that is foundational for ensuring public health and advancing health equity. One effective countermeasure to these trends lies in community-facing citizen science applications to increase public participation in scientific research, making this field an important target for artificial intelligence (AI) exploration. We highlight potentially promising citizen science AI applications that extend beyond individual use to the community level, including conversational large language models, text-to-image generative AI tools, descriptive analytics for analyzing integrated macro- and micro-level data, and predictive analytics. The novel adaptations of AI technologies for community-engaged participatory research also bring an array of potential risks. We highlight possible negative externalities and mitigations for some of the potential ethical and societal challenges in this field.
Current societal trends reflect an increased mistrust in science and a lowered civic engagement that threaten to impair research that is foundational for ensuring public health and advancing health equity. One effective countermeasure to these trends lies in community-facing citizen science applications to increase public participation in scientific research, making this field an important target for artificial intelligence (AI) exploration. We highlight potentially promising citizen science AI applications that extend beyond individual use to the community level, including conversational large language models, text-to-image generative AI tools, descriptive analytics for analyzing integrated macro- and micro-level data, and predictive analytics. The novel adaptations of AI technologies for community-engaged participatory research also bring an array of potential risks. We highlight possible negative externalities and mitigations for some of the potential ethical and societal challenges in this field.
We present a new method of deconstructing class activation tokens of vision transformers into a new, overcomplete basis, where each basis vector is “monosemantic” and affiliated with a single, human-compatible conceptual description. We achieve this through the use of a highly optimized and customized version of the K-SVD algorithm, which we call Double-Batch K-SVD (DBK-SVD). We demonstrate the efficacy of our approach on the sbucaptions dataset, using CLIP embeddings and comparing our results to a Sparse Autoencoder (SAE) baseline. Our method significantly outperforms SAE in terms of reconstruction loss, recovering approximately 2/3 of the original signal compared to 1/6 for SAE. We introduce novel metrics for evaluating explanation faithfulness and specificity, showing that DBK-SVD produces more diverse and specific concept descriptions. We therefore show empirically for the first time that disentangling of concepts arising in Vision Transformers is possible, a statement that has previously been questioned when applying an additional sparsity constraint. Our research opens new avenues for model interpretability, failure mitigation, and downstream task domain transfer in vision transformer models. An interactive demo showcasing our results can be found at https://disentangling-sbucaptions.xyz, and we make our DBK-SVD implementation openly available at https://github.com/RomeoV/KSVD.jl.
We present a new method of deconstructing class activation tokens of vision transformers into a new, overcomplete basis, where each basis vector is “monosemantic” and affiliated with a single, human-compatible conceptual description. We achieve this through the use of a highly optimized and customized version of the K-SVD algorithm, which we call Double-Batch K-SVD (DBK-SVD). We demonstrate the efficacy of our approach on the sbucaptions dataset, using CLIP embeddings and comparing our results to a Sparse Autoencoder (SAE) baseline. Our method significantly outperforms SAE in terms of reconstruction loss, recovering approximately 2/3 of the original signal compared to 1/6 for SAE. We introduce novel metrics for evaluating explanation faithfulness and specificity, showing that DBK-SVD produces more diverse and specific concept descriptions. We therefore show empirically for the first time that disentangling of concepts arising in Vision Transformers is possible, a statement that has previously been questioned when applying an additional sparsity constraint. Our research opens new avenues for model interpretability, failure mitigation, and downstream task domain transfer in vision transformer models. An interactive demo showcasing our results can be found at https://disentangling-sbucaptions.xyz, and we make our DBK-SVD implementation openly available at https://github.com/RomeoV/KSVD.jl.
Model-based reinforcement learning (MBRL) is a promising route to sampleefficient policy optimization. However, a known vulnerability of reconstructionbased MBRL consists of scenarios in which detailed aspects of the world are highly predictable, but irrelevant to learning a good policy. Such scenarios can lead the model to exhaust its capacity on meaningless content, at the cost of neglecting important environment dynamics. While existing approaches attempt to solve this problem, we highlight its continuing impact on leading MBRL methods —including DreamerV3 and DreamerPro — with a novel environment where background distractions are intricate, predictable, and useless for planning future actions. To address this challenge we develop a method for focusing the capacity of the world model through synergy of a pretrained segmentation model, a task-aware reconstruction loss, and adversarial learning. Our method outperforms a variety of other approaches designed to reduce the impact of distractors, and is an advance towards robust model-based reinforcement learning.
Model-based reinforcement learning (MBRL) is a promising route to sampleefficient policy optimization. However, a known vulnerability of reconstructionbased MBRL consists of scenarios in which detailed aspects of the world are highly predictable, but irrelevant to learning a good policy. Such scenarios can lead the model to exhaust its capacity on meaningless content, at the cost of neglecting important environment dynamics. While existing approaches attempt to solve this problem, we highlight its continuing impact on leading MBRL methods —including DreamerV3 and DreamerPro — with a novel environment where background distractions are intricate, predictable, and useless for planning future actions. To address this challenge we develop a method for focusing the capacity of the world model through synergy of a pretrained segmentation model, a task-aware reconstruction loss, and adversarial learning. Our method outperforms a variety of other approaches designed to reduce the impact of distractors, and is an advance towards robust model-based reinforcement learning.
Vafa et al. (2024) introduced a transformer-based econometric model, CAREER, that predicts a worker’s next job as a function of career history (an “occupation model”). CAREER was initially estimated (“pre-trained”) using a large, unrepresentative resume dataset, which served as a “foundation model,” and parameter estimation was continued (“fine-tuned”) using data from a representative survey. CAREER had better predictive performance than benchmarks. This paper considers an alternative where the resume-based foundation model is replaced by a large language model (LLM). We convert tabular data from the survey into text files that resemble resumes and fine-tune the LLMs using these text files with the objective to predict the next token (word). The resulting fine-tuned LLM is used as an input to an occupation model. Its predictive performance surpasses all prior models. We demonstrate the value of fine-tuning and further show that by adding more career data from a different population, fine-tuning smaller LLMs surpasses the performance of fine-tuning larger models.
Vafa et al. (2024) introduced a transformer-based econometric model, CAREER, that predicts a worker’s next job as a function of career history (an “occupation model”). CAREER was initially estimated (“pre-trained”) using a large, unrepresentative resume dataset, which served as a “foundation model,” and parameter estimation was continued (“fine-tuned”) using data from a representative survey. CAREER had better predictive performance than benchmarks. This paper considers an alternative where the resume-based foundation model is replaced by a large language model (LLM). We convert tabular data from the survey into text files that resemble resumes and fine-tune the LLMs using these text files with the objective to predict the next token (word). The resulting fine-tuned LLM is used as an input to an occupation model. Its predictive performance surpasses all prior models. We demonstrate the value of fine-tuning and further show that by adding more career data from a different population, fine-tuning smaller LLMs surpasses the performance of fine-tuning larger models.