Our AI4All task is to build AI models for zero-shot classification on the ChestMNIST and PneumoniaMNIST datasets. That means the models have to predict disease categories without being explicitly trained on those specific tasks. We’ll be working with models like BioMedCLIP and SmolVLM, and experimenting with prompt engineering to improve their accuracy. The project definitely feels ambitious, but having seen the structure and resources laid out, it feels manageable as long as we work together.
We also learned more about prompt engineering—the idea that carefully designing the language prompts we feed into the models can significantly impact performance. It’s a space where medical knowledge, language, and AI all come together, and there’s a lot of room to be creative with how we approach it.
Today felt like a real turning point – we made solid progress on our final projects and explored some of the broader ways AI is being applied.
The morning began with a focused session reviewing the tools we’re using in our projects. Our mentors clarified key concepts like the distinction between classical machine learning and deep learning, particularly how deep learning models can automatically extract features from data and perform classification end-to-end. This is exactly what makes them so powerful for complex tasks like medical image analysis.
We also revisited supervised and unsupervised learning. This naturally led to a deeper discussion of CLIP (Contrastive Language-Image Pre-Training), the vision-language model from OpenAI that plays a big role in our project. CLIP learns to associate images and text using contrastive learning, bringing correct image-caption pairs closer together in feature space. Once trained, it enables zero-shot classification, meaning the model can predict labels for images it hasn’t explicitly seen, simply by comparing the image to textual prompts. This is especially valuable in medical AI, where new, unfamiliar cases often arise.
My group made the decision to switch from ChestMNIST to BloodMNIST, a smaller and more manageable dataset. Running the larger dataset on Google Colab had been challenging, and the switch will make it much easier to collaborate and test our models without worrying about technical limitations.
In the afternoon, we had a faculty talk from Joon Park, whose research on generative agents was one of the most engaging sessions so far. His project, Smallville, uses AI-driven agents that simulate believable human behavior in a sandbox environment. The agents can plan their days, have conversations, remember interactions, and even reflect on their experiences.
The whole system runs inside a pixelated, game-like interface that immediately reminded me of cozy games like Animal Crossing. It was really fun and charming to watch these little AI agents going about their lives — socializing, making plans, and adapting to their environment — all powered by language models, memory systems, and higher-level reflections that shape their behavior in realistic ways. It was impressive to see how emergent, believable social behaviors could arise from this combination of language models and memory architecture.