Shortcomings of Visualizations for Human-in-the-Loop Machine Learning

Date

October 09, 2023

Topics

While visualizations can help developers better design, train, and understand their models, new research shows gaps between ambitions and evidence.

Because machine learning models are built on data, it makes sense to use data visualization tools to help us interpret how those systems work.

For the last few years, some data visualization researchers have been doing just that, launching a field known as Visualization for Machine Learning, or VIS4ML. The goal: to provide human-in-the-loop domain experts with visualizations that will help them accomplish diverse tasks including designing, training, engineering, interpreting, assessing, and debugging ML models.

But when Hariharan Subramonyam, assistant professor at Stanford Graduate School of Education and a faculty fellow with Stanford’s Institute for Human-Centered AI, and his colleague Jessica Hullman of Northwestern University examined 52 recent VIS4ML research publications, they became concerned that researchers are overstating their accomplishments.

For example, Subramonyam says, researchers in this space are not testing VIS4ML tools in ecologically valid ways and are making inappropriately broad claims about their tools’ applicability. The team’s analysis, which has been accepted for publication at IEEE VIS, is available now on preprint service ArXiv.org.

“The VIS4ML community is trying to solve the problem of making ML models more interpretable,” Subramonyam says, “but the way they’re doing it has shortcomings.”

Lofty Aspirations for VIS4ML

VIS4ML researchers aspire to keep humans in the ML design loop because that will improve ML model performance, Subramonyam says. It’s an admirable goal, but also a difficult challenge. Many ML models are complex black box models that evade insight into their inner workings. It will take brand new data visualization tools to help humans understand what’s going on inside those black boxes, he says.

Some VIS4ML researchers have taken a laudable stab at inventing novel data visualization tools that offer a window into some aspects of ML models, Subramonyam says. For example, there are VIS4ML tools for creating a scatter plot that depicts clusters in high-dimensional data, with different colors for each of the categories an ML algorithm finds in a dataset – types of clothing in images, for example, as shown below. This allows an expert to spot items that are mislabeled and re-label them. Other tools might visualize the various layers of a convolutional network in a manner that users can understand, or visualize the nature of various possible features of an ML model so that an expert can make appropriate decisions about which features to include.

a sample visualization chart, showing a scatter plot of clothing types of an online clothing retailer

When an ML model categorizes thousands of items of clothing from several online shopping sites into 14 types of clothing (T-shirt, shirt, jacket, suit, dress, vest, etc.), it is correct only 61% of the time. In this visualization, the categories are color coded, allowing an expert to easily identify and re-label miscategorized items (red dots in a group of purple dots, for example). This type of scatter plot relies on a visualization algorithm that is good at showing clusters when they exist, but can also imply structure that doesn’t actually exist in the data, Hullman says.

The Generalizability Gap

While the development of novel VIS4ML tools for aiding human-in-the-loop ML is important work, Subramonyam and Hullman’s analysis shows some troubling findings: These tools are too often tested by a small set of experts – often those who were involved in designing the tools in the first place; and they are typically tested on only the most standard popular datasets. “The measure of each tool’s usefulness is quite narrow,” Subramonyam says.

In addition, only a third of the 52 VIS4ML papers reviewed went beyond asking an expert if a tool seemed useful and actually reported whether using the tool changed the performance of an ML model. Evidence in the other papers depended on hypothetical claims about a visualization tool’s potential benefits, essentially positing that the tool will improve model performance for any kind of model and dataset.

“These papers make these claims without providing supporting evidence and without acknowledging their limitations and constraints,” Subramonyam says.

Recommendations

VIS4ML researchers should curtail the unsupported claims about their tools’ generalizability and be more transparent about their limitations, Subramonyam says.

If these researchers want to truly support human-in-the-loop ML, they need to more thoroughly evaluate VIS4ML tools and build a stronger evidence base for any claims of broad applicability. “Researchers need to connect the dots between the new tools and their usefulness in the real world,” Subramonyam says. To further that aim, he and Hullman set out some concrete guidelines for transparency in their paper.

In addition, Subramonyam says, there’s a need for closer collaboration between the people who are building these visualization solutions and the communities they hope to serve. “Human-centered AI is a multidisciplinary endeavor,” he says. “You can’t have tunnel vision where you build a visualization solution expecting it’s going to work in multiple domains and workflows without actually testing it in those domains and workflows.”

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

Related News

Chatbots, Like the Rest of Us, Just Want to Be Loved

Wired

Mar 05, 2025

Media Mention

A study led by Stanford HAI Faculty Fellow Johannes Eichstaedt reveals that large language models adapt their behavior to appear more likable when they are being studied, mirroring human tendencies to present favorably.

Media Mention