They say a picture is worth a thousand words. But an image can’t “speak” to individuals with blindness or low-vision (BLV) without a little help. In a world driven by visual imagery, especially online, this creates a barrier to access. The good news: When screen readers – software that reads the content of web pages to BLV people – come across an image, they will read any “alt-text” descriptions that the website creator added to the underlying HTML code, rendering the image accessible. The bad news: Few images are accompanied by adequate alt-text descriptions.
In fact, according to one study, alt-text descriptions are included with fewer than 6% of English-language Wikipedia images. And even in instances where websites do provide descriptions, they may be of no help to the BLV community. Imagine, for example, alt-text descriptions that list only the name of the photographer, the image’s file name, or a few keywords to aid with search. Or picture a home button that has the shape of a house but no alt-text saying “home.”
As a result of missing or unhelpful image descriptions, members of the BLV community are frequently left out of valuable social media interactions or unable to access essential information on websites that use images for site navigation or to convey meaning.
While we should encourage better tooling and interfaces to nudge people toward making images accessible, society’s failure to date to provide useful and accessible alt-text descriptions for every image on the internet points to the potential for an AI solution, says Elisa Kreiss, a graduate student in linguistics at Stanford University and a member of the Stanford Natural Language Processing Group. But natural language generated (NLG) image descriptions haven’t yet proven beneficial to the BLV community. “There’s a disconnect between the models we have in computer science that are supposed to generate text from images and what actual users find to be useful,” she says.
In a new paper, Kreiss and her study co-authors (including scholars from Stanford, Google Brain, and Columbia University) recently found that BLV users prefer image descriptions that take context into account. Because context can dramatically change the meaning of an image – e.g., a football player in a Nike ad versus in a story about traumatic brain injury – contextual information is vital for crafting alt-text descriptions that are useful. Yet existing metrics of image description quality currently don’t take context into account. These metrics are therefore steering the development of NLG image descriptions in a direction that will not improve image accessibility, Kreiss says.
Kreiss and her team also found that BLV users prefer longer alt-text descriptions rather than the concise descriptions typically promoted by prominent accessibility guidelines – a result that runs counter to expectations.
These findings highlight the need not only for new ways of training sophisticated language models, Kreiss says, but also new ways of evaluating them to ensure they serve the needs of the communities they’ve been designed to help.
Measuring Image Descriptions’ Usefulness in Context
Computer scientists have long assumed that image descriptions should be objective and context independent, Kreiss says, but human-computer interaction research shows BLV users tend to prefer descriptions that are both subjective and context appropriate. “If the dog is cute or the sunny day is beautiful, depending on the context, the description might need to say so,” she says. And if the image appears on a shopping website versus a news blog, the alt-text description should reflect the particular context to help clarify its meaning.
Yet existing metrics for evaluating the quality of image descriptions focus on whether a description is a reasonable fit for the image regardless of the context in which it appears, Kreiss says. For example, current metrics might highly rate a soccer team’s photo description that reads “a soccer team playing on a field” whether it accompanies an article about cooperation (in which case the alt-text should include something about how the team cooperates), a story about the athletes’ unusual hairstyles (in which case the hairstyles should be described), or a report on the prevalence of advertising in soccer stadiums (in which case the advertising in the arena might be mentioned). If image descriptions are to better serve the needs of BLV users, Kreiss says, they must have greater context awareness.
To explore the importance of context, Kreiss and her colleagues hired Amazon Mechanical Turk workers to write image descriptions for 18 images, each of which appeared in three different Wikipedia articles. In addition to the soccer example cited above, the dataset included images such as a church spire linked to articles about roofs, building materials, and Christian crosses; and a mountain range and lake view associated with articles about montane (mountain slope) ecosystems, body of water, and orogeny (a specific way that mountains are formed). The researchers then showed the images to both sighted and BLV study participants and asked them to evaluate each description’s overall quality, imaginability (how well it helped users imagine the image), relevance (how well it captured relevant information), irrelevance (how much irrelevant information it added), and general “fit” (how well the image fit within the article).
The study revealed that BLV and sighted participants’ ratings were highly correlated. Knowing that the two populations were aligned in their assessments will be helpful when designing future NLG systems for generating image descriptions, Kreiss says. “The perspectives of people in the BLV community are essential, but often during system development we need much more data than we can get from the low-incidence BLV population.”
Another finding: Context matters. Participants’ ratings of an image description’s overall quality closely aligned with their ratings for relevance.
When it came to description length, BLV participants rated the quality of longer descriptions more highly than did sighted participants, a finding Kreiss considers surprising and worthy of further research. “Users’ preference for shorter or longer image descriptions might also depend on the context,” she notes. Figures in scientific papers, for example, might merit a longer description.
Steering Toward Better Metrics
Kreiss hopes her team’s research will promote metrics of image description quality that will better serve the needs of BLV users. In their paper, she and her colleagues found that two of the current methods (CLIPScore and SPURTS) were not capable of capturing context. CLIPScore, for example, only provides a compatibility score for an image and its description. And SPURTS evaluates the quality of the description text without reference to the image. While these metrics can evaluate the truthfulness of an image description, that is only a first step toward driving “useful” description generation, which also requires relevance (i.e., context dependence), Kreiss says.
It was therefore unsurprising that CLIPScore’s ratings of the image descriptions in the researchers’ dataset did not correlate with the ratings by the BLV and sighted participants. Essentially, CLIPScore rated the description’s quality the same regardless of context. When the team added the text of the various Wikipedia articles to alter the way CLIPScore is computed, the correlation with human ratings improved somewhat – a proof of concept, Kreiss says, that referenceless evaluation metrics can be made context aware. She and her team are now working to create a metric that takes context into account from the get-go to make descriptions more accessible and more responsive to the community of people they are meant to serve.
“We want to work toward metrics that can lead us toward success in this very important social domain,” Kreiss says. “If we’re not starting with the right metrics, we’re not driving progress in the direction we want to go.”
“Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics” was accepted by the 2022 Empirical Methods in Natural Language Processing conference. Co-authors include Cynthia Bennett, a senior research scientist in Google’s People + AI Research Group; Columbia University undergraduate student and NLP researcher Shayan Hooshmand; Stanford computer science PhD student Eric Zelikman; Google Brain principal scientist Meredith Ringel Morris; and Stanford linguistics professor Christopher Potts.
Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.