Generative Search Engines: Beware the Facade of Trustworthiness
Generative AI’s rapid rollout may soon alter the humble internet search. A query for how to change a tire will no longer yield a series of websites for our perusal. Instead, it will offer an AI-generated, fluently written and footnoted description of the steps: how to jack up your car, remove the lug nuts, and replace the old tire with the spare.
Indeed, several so-called generative search engines are already publicly available, including Microsoft’s Bing Chat.
But the race to release generative search engines poses risks. The large language model behind Bing Chat often hallucinates, making false statements that sound like fact. And false query responses could prove harmful if people rely on them without seeking verification. Indeed, changing a tire using a generative search result could literally prove life-threatening.
“If these tools are to be a primary way that we find information, they need to be reliable and trustable,” says Nelson Liu, a fourth-year graduate student in computer science at Stanford University. At the very least, Liu says, the statements provided by generative search engines should include citations of their sources, and the citations provided should legitimately support the statements provided. The question is: Do they?
In a new paper examining how four different generative search engines responded to 1,450 search queries, Liu and his colleagues offer a concerning answer: About 50% of the search engines’ generated statements have no supportive citations, and only about 75% of the provided citations truly support the generated statements.
Read the full study, Evaluating Verifiability in Generative Search Engines
“These systems aren’t yet ready for prime time,” Liu says. “Their statements have a facade of trustworthiness, but we should be taking this content with a healthy grain of salt.”
Verifiability of Generative Search
A generative search engine is sort of a cross between a traditional search engine and a purely generative large language model like ChatGPT. Like a traditional search engine, it retrieves websites through a traditional search process. And like ChatGPT, it can creatively write a narrative query response using the large language model on which it was trained – which a traditional search engine cannot do. But unlike ChatGPT, a generative search engine is supposed to focus on content extracted from the top search hits rather than the vast dataset used to train the underlying language model.
The problem is, there’s no easy way to know how much of a generative search engine’s response is purely generative (and potentially hallucinatory) and how much is supported by the underlying search results, Liu says. “It’s not clear how often these models directly rely on what was retrieved versus veering off and generating content from their own model.”
To explore that question, Liu and his colleagues tasked Amazon Mechanical Turk workers with evaluating search responses yielded by four generative search engines (Bing Chat, NeevaAI, perplexity.ai, and YouChat), with a focus on four features: fluency, perceived utility, citation recall (the proportion of generated statements that are fully supported by the citations provided), and citation precision (the proportion of citations that support their associated statements).
The result: All four search engines were rated as highly fluent (readable and understandable). And all did well on the measure of perceived utility as well, Liu says. But, as noted above, on average, citation recall and citation precision were poor – 50% and 75%, respectively.
The systems do pretty well in settings where they can directly copy from web pages provided by the conventional search engine, Liu says. But they struggle with combining information from multiple sources. Sometimes they might write a fluent and seemingly useful response but provide few citations. On other occasions they might stitch together a sentence from one web page with a sentence from another web page to yield a sort of Franken-response that doesn't read very fluidly or look very convincing but is actually well supported by the citations. This leads to an intriguing result: “The statements that look the most convincing are the ones we should be most suspicious of,” Liu says. That is, the most fluent and seemingly useful generated statements are less likely to draw from the citations, while the clumsiest statements are more likely to be well supported.
Optimism: A Fix Is Possible
In generative search engines Liu sees an analog to Wikipedia, which used to be viewed with suspicion, but these days has gained a pretty solid reputation. “When I read Wikipedia, I almost never check the sources it cites. I just trust it,” Liu says. In contrast, Liu doesn’t yet feel that way about generative search engine results. “Providing sources for every generated statement is the bare minimum I need in order to slowly build trust in what these models are generating,” he says.
Fixing language models will be tough, Liu says, but he’s optimistic that generative search engines’ problems can be addressed with technological solutions. And at the very least, a traditional list of search results should be provided below a search engine’s generated response, allowing users to easily check their veracity. NeevaAI already does this, and others might be following suit.
“If the research community rallies around a shared goal of making generative search engines more trustworthy, I think things will improve,” Liu says. “I’ve learned not to bet against progress in new technology.”
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.