Agile NLP for Clinical Text: COVID-19 and Beyond

Date

June 01, 2021

Topics

With Trove, weakly supervised NLP of clinical text is fast, adaptive, shareable, and high performing.

In early 2020, just as the SARS-CoV-2 virus was arriving in the United States, a team of Stanford researchers wondered if the natural language processing (NLP) framework they were developing might be nimble enough to help triage COVID-19 patients who visited the Stanford Hospital emergency room.

“There’s lots of useful information in doctors’ notes and unstructured textual medical records, and we wanted a fast way to get it out, given the COVID-19 pandemic situation,” says Nigam Shah, professor of medicine (biomedical informatics) and of biomedical data science at Stanford University and an affiliated faculty member of the Stanford Institute for Human-Centered Artificial Intelligence.

Unlike most NLP frameworks, users of the team’s open-source framework, called Trove, don’t need expensive and time-consuming expert-labeled data to train their machine learning models. Instead, Trove uses what’s called “weak supervision” to automatically classify entities in clinical text using publicly available ontologies (databases of biomedical information) and expert-generated rules. “There is no expectation that these ontologies and rules will do a perfect job of labeling a training set, but in fact they work quite well,” says Jason Fries, a research scientist in Shah’s lab who led the development of Trove.

In addition to saving time and money, Trove has several other advantages over traditional NLP: It is user-friendly enough for hospital data science teams to use; the rules it relies on can be amended as new scientific information comes along (without the need to manually relabel a training dataset); and it generates labeling functions that can be shared with other hospitals without violating patient privacy.

In a recent paper published in Nature Communications, Fries; Shah; Alison Callahan, another research scientist in Shah’s lab; and several other colleagues showed that Trove does a surprisingly good job of labeling chemicals, diseases, disorders, and drugs — as well as COVID-19 presenting symptoms and risk factors — in clinical text. “It starts to approach the performance you’d see if you paid a group of doctors to manually label the notes for you,” Fries says.

Read the full paper, "Ontology-driven weak supervision for clinical entity classification in electronic health records".

And the team’s COVID work demonstrated that machine learning tools can in fact move more quickly in response to the world, Fries says: “COVID was a nice testbed for showing that a lot of the advantages of weakly supervised learning actually became true advantages in that context.”

Trove’s Weak Supervision: Easy and High Performing

Fries has been developing weakly supervised NLP models for about 6 years. In fact, Trove uses an open-source weak supervision framework called Snorkel, on which Fries collaborated when he worked with Alex Ratner in Chris Ré’s lab. (Snorkel is also the basis for a startup, Snorkel.AI, for which Fries now consults.) But compared to off-the-shelf Snorkel, Fries says, “Trove is dramatically simplified to support complex medical NLP out of the box. Conceptually, it’s an easier sell.”

Some of Trove’s simplicity comes from skipping over the step in Snorkel that requires users to code a lot of custom rules. But what really drives Trove’s ease of use is its reliance on publicly available ontologies, which typical hospital data science teams who use the NCBO BioPortal are already familiar with. Users simply specify which ontologies or parts of ontologies to use for a labeling task. “Users can be a little sloppy,” Fries says. “They can select ontologies they believe are probably correct but aren’t certain about.” Trove then steps in and uses weak supervision to reason about the quality of the ontologies used, correct for those that are low quality, and then label the training data.

In their recent paper, Fries’ team looked at how well various versions of Trove did at labeling chemicals, diseases, disorders, and drugs in a standardized dataset compared to hand labeling. For example, when Trove relied on only the Unified Medical Language System, or UMLS, which contains over 100 ontologies, it outperformed previous state-of-the-art approaches by 1.5% without the burden of manual annotation. The biggest gains (10.9%) were made when additional, less accurate ontologies were added, demonstrating that such noisy sources can be yoked to increase performance. Trove’s performance improved still further when a few specific “task” rules (designed to correct for observed errors) were added on top of the ontologies.

“The results provide a very strong baseline for saying to hospitals, ‘If you take what you’re already doing with ontologies and supercharge it with Trove, your machine reading will start to approach the performance you’d see if you paid a ton of doctors to manually label clinical notes for you,’” Fries says.

Adaptable and Shareable Labeling Functions

One advantage of Trove is the ease with which users can incorporate new knowledge by adding new rules and ontologies. In traditional supervised machine learning, users are locked into the cost of relabeling data and regenerating training sets. “That’s unsustainable, especially in a situation like the COVID-19 pandemic, where your understanding of what’s going on changes from day to day,” Fries says. By using COVID-19 as a testbed, Fries and his colleagues found that Trove really delivered on this promise. “It is very much the case that you can easily modify this setup to incorporate new information,” Fries says.

Another advantage: Trove captures domain expertise in a way that can be shared. In supervised machine learning, developers who want to share an NLP model commonly do so by sharing the manually labeled dataset. But in the clinical context, such sharing runs up against privacy concerns because the labeled datasets typically contain patient information.

With Trove, developers don’t have to share any training data. They only share “labeling functions,” which are essentially rules or heuristics that tell users how to create their own training data. “There are no HIPAA concerns,” Fries says. “It removes a big bottleneck in clinical NLP in terms of exchanging datasets and the logistics of that process.”

Trove and COVID

When the COVID-19 pandemic started, Fries’ team saw it as an opportunity for Trove to make a difference in the world as well as to test Trove’s advantages in a real-life setting. The researchers started by developing some labeling functions for COVID-19 symptoms using a public dataset called MIMIC. Using clinician notes from Stanford Hospital’s emergency department, they then showed that the weakly supervised model extracted symptoms of COVID-19 from patient notes at least as well as, if not a little better than, a hand-labeled MIMIC-based model would have. The initial results were published in NPJ Digital Medicine in July of 2020.

“We really did close the full loop,” Fries says. “We showed that Trove has all of these benefits: You can share the labeling functions; easily update them to incorporate information about COVID; and run Trove daily on a stream of notes coming out of emergency departments to track what’s happening around COVID.”

Trove’s Future

Fries says that just as the machine learning community has popularized “model zoos” for sharing and extending prepackaged models and datasets, he’d like to explore the idea of a “labeling function zoo” where researchers can share code for weakly supervised medical NLP.

Shah is also focused on Trove’s shareability. He hopes next steps for Trove might include running a challenge for identifying some type of information (such as socioeconomic status or homelessness or other social determinants of health) in unstructured clinical text. Every team would work with its own institution’s electronic health records but use shared labeling functions to generate a consensus level of performance. “Not only does Trove get around privacy,” Shah says. “It also helps with reproducible science by allowing others to verify that it actually works.”

Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

Related News

RAISE Health Inaugural Seed Grant Recipients Announced

Hanae Armitage

Mar 18, 2025

Announcement

Five projects received a RAISE Health seed grant to support research and educational initiatives that advance responsible AI in biomedicine.

Announcement

RAISE Health Inaugural Seed Grant Recipients Announced

Hanae Armitage

HealthcareMar 18

Five projects received a RAISE Health seed grant to support research and educational initiatives that advance responsible AI in biomedicine.

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Mar 05, 2025

Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Media Mention

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Natural Language ProcessingMachine LearningGenerative AIFoundation ModelsMar 05

Holistic Evaluation of Large Language Models for Medical Applications

Nigam Shah, Mike Pfeffer, Percy Liang

Feb 28, 2025

News

Medical and AI experts build a benchmark for evaluation of LLMs grounded in real-world healthcare needs.

News

Holistic Evaluation of Large Language Models for Medical Applications

Nigam Shah, Mike Pfeffer, Percy Liang

HealthcareFoundation ModelsFeb 28

Medical and AI experts build a benchmark for evaluation of LLMs grounded in real-world healthcare needs.