Amid the pandemic, scientists from epidemiology to engineering have raced to apply their myriad skills to the global health crisis. In just months, more than 160,000 research papers, book chapters, reports, and news articles have flooded the scientific literature and popular media.
“It’s the greatest surge in biomedical research on one topic the world has ever seen. Organizing and making sense of it all has been an unexpected challenge,” says Jake Lever, who recently completed two years of postdoctoral research at Stanford University.
To aid scholars, policymakers, and others hoping to better understand COVID, Lever and Stanford bioengineering professor Russ Altman, an associate director of the Stanford Institute for Human-Centered Artificial Intelligence, built CoronaCentral, a web-based dashboard of coronavirus-related articles that can help people unearth trends and reveal new avenues for research. The work was funded as part of the Chan Zuckerberg Biohub.
CoronaCentral applies machine learning to comb through, size up, and categorize virtually every published word about SARS-Cov-2, the virus that causes COVID-19, and two other closely related viruses that beget similar though less severe public health crises, SARS-CoV and MERS-CoV. The literature is organized into useful categories that make analysis of the contents covering everything from therapeutics and forecasting to the latest trends looking at the long-term effects of the pandemic, studies of inequality in care and resource allocation, and studies of misinformation about the disease.
The dashboard offers helpful visualization widgets that quickly summarize articles by category, highlight trending research, and map articles by the geographic location where they were published and also offers other summaries such as papers about specific drug treatments, various vaccination approaches, and even patient risk factors including age, weight, pre-existing conditions and so forth.
The dashboard is open to anyone; there are no member accounts or logins required. Using Google Analytics, the researchers note that 5,000 unique users have used the site since July 2020, with an uptick since they published a preprint paper in PNAS. The trending articles page is the most visited, followed closely by articles looking the “long haul” effects of COVID-19, vaccine articles, and pieces on psychology.
A New Approach
CoronaCentral’s AI-based approach combines natural language processing to read through written content — the abstracts of published papers and reports and so forth — and then sort them into categories ranging from epidemiology and clinical studies to the psychological effects and policy implications of the pandemic.
The authors then apply what they term “esteem metrics” to measure a given publication’s reach as quantified by mentions in social media, news, and citations in the academic literature. Such metrics allow the team to highlight trending papers that for one reason or another are gaining popularity. Lever asserts that these esteem metrics are not a gauge of scientific merit, only of a publication’s ability to garner attention.
CoronaCentral is not the first biological database promising to ease the process of culling through research, but it is the first to off-load much of the time-consuming review process to AI. The process of manually reviewing all the papers on a given topic is a monumental effort for researchers, even when the subject is relatively narrow, says Lever, who got his start in this area creating similar resources for cancer. With the evolving scale and complexity of COVID-19 publications, however, the researcher’s job has become considerably more difficult. CoronaCentral’s powerful winnowing capabilities, search tools, and metrics offer a new way to speed the literature review process for researchers, journalists, policymakers, and virtually anyone else who is interested in understanding the latest science about COVID-19.
Setting the Baseline
Before the authors could create an algorithm able to comprehend sophisticated scientific language and apply effective categorization, they first needed to create a training database that provides a baseline of human interpretations against which the machine learning algorithm can judge its accuracy in categorizing a never-before-seen article.
That database comprised 3,200 articles that had been manually reviewed by humans and slotted according to a list of 38 categories ranging from risk factors, drug treatments, and vaccines to non-medical aspects of the pandemic, such as the economic effects of the disease. The machine learning algorithm then analyzed the text of the remaining 150,000-plus articles using the human labeling as a model.
Some interesting trends have already emerged from the database, Lever notes. For one, he found studies’ subject matter has evolved over time. In the earliest days of the pandemic, much of the literature was dedicated to forecasting models that predicted where, when, and how fast COVID-19 might spread. As the pandemic played out and vaccines arrived on the scene, he watched as basic research into clinical treatments, drug therapies, and vaccinations caught up to and eventually overtook forecasting. Recently, he’s noted an influx of studies looking at non-medical aspects of the pandemic, such as the long-term societal implications.
Most promising of all, however, Lever says that the lessons learned during their CoronaCental experiment are not confined to COVID-19 but should be generalizable to other diseases and even to other non-medical, scientific fields.
“Wherever a topic generates a great quantity of published information, the analytical and organizing capabilities of artificial intelligence will have a strong future,” Lever says.
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.