Skip to main content Skip to secondary navigation
Page Content

The Open-Source Movement Comes to Medical Datasets

Hoping to spur crowd-sourced AI applications in health care, Stanford’s AIMI center is expanding its free repository of datasets for researchers around the world.

A young woman working at a desk with multiple monitors studies data & information on the screens which consists of MRI & CAT scans.

Medical datasets can cost millions of dollars to acquire, which limits their use. A new platform at Stanford AIMI will offer datasets at no cost. | Laurence Dutton

In a move to democratize research on artificial intelligence and medicine, Stanford’s Center for Artificial Intelligence in Medicine and Imaging (AIMI) is dramatically expanding what is already the world’s largest free repository of AI-ready annotated medical imaging datasets.

Artificial intelligence has become an increasingly pervasive tool for interpreting medical images, from detecting tumors in mammograms and brain scans to analyzing ultrasound videos of a person’s pumping heart.

Many AI-powered devices now rival the accuracy of human doctors. Beyond simply spotting a likely tumor or bone fracture, some systems predict the course of a patient’s illness and make recommendations.

But AI tools have to be trained on expensive datasets of images that have been meticulously annotated by human experts. Because those datasets can cost millions of dollars to acquire or create, much of the research is being funded by big corporations that don’t necessarily share their data with the public.

Visit the new Stanford AIMI shared dataset platform


“What drives this technology, whether you’re a surgeon or an obstetrician, is data,” says Matthew Lungren, co-director of AIMI and an assistant professor of radiology at Stanford. “We want to double down on the idea that medical data is a public good, and that it should be open to the talents of researchers anywhere in the world.”

Launched two years ago, AIMI has already acquired annotated datasets for more than 1 million images, many of them from the Stanford University Medical Center. Researchers can download those datasets at no cost and use them to train AI models that recommend certain kinds of action.

Now, AIMI has teamed up with Microsoft’s AI for Health program to launch a new platform that will be more automated, accessible, and visible. It will be capable of hosting and organizing scores of additional images from institutions around the world. Part of the idea is to create an open and global repository. The platform will also provide a hub for sharing research, making it easier to refine different models and identify differences between population groups. The platform can even offer cloud-based computing power so researchers don’t have to worry about building local resource intensive clinical machine-learning infrastructure.

Building a Research Ecosystem

The idea is to create an entire ecosystem for AI medical research, and not just for analyzing images. With the right datasets, people will also be able to explore important clinical use cases beyond the pixel data alone, including other related companion multimodal data.

The center already has nine datasets containing more than 1 million images, and Lungren predicts that number will double within the next year. Two new datasets will be released with the new platform.

“This platform will have the largest diversity and volume of AI-ready medical datasets in the world,” he says.

In time, the platform will also offer standardized machine-learning tools and pre-trained models leveraging open-source data and common architectures — AI software in a box, as it were — to spur a wave of crowd-sourced AI research.

Democratizing the Tools

By offering data at no cost, researchers will be able to explore niche areas, such as medical problems that affect particular communities, that large corporations might well overlook.

These diverse datasets will also make it easier for researchers to spot hidden biases in the data or in algorithms. Studies have shown that some AI models are more accurate for certain population groups than others, mainly because they were trained on data from patients at one location. Having datasets from many different communities will make it easier for researchers to detect those issues.

“We love that corporations are doing all this work, but we don’t love the fact that the opportunity to share information is asymmetric,” Lungren says. “If they amass data but then lock it down, they will be the only ones who can innovate, which would shut out the important contributions by computer scientists and clinicians around the world. That’s not a position we want to be in.”

Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more

More News Topics

Related Content