'Worse' AI Counterintuitively Enhances Human Decision Making and Performance
Can worse artificial intelligence (AI) actually lead to better human decision making? That’s the surprising takeaway in a new study from Stanford researchers, based on experiments analyzing thousands of human-AI interactions.
For the study, human users fielded either/or-style questions and indicated a confidence level in their answer choice. The human users were then presented with an AI model’s "advice,” likewise expressed as a confidence level and which favored one of the two possible answers. Based on this AI feedback, the human users then had the option of altering their initial answer and upping or lowering their confidence level.
Intriguingly, when the Stanford researchers programmed the AI to be “overconfident” by exaggerating its confidence levels, the overall performance of the human users improved significantly. The humans got more of the either/or questions right and also reported higher confidence levels in their correct answers. This improvement arose because in the particular tasks used in the study the AI performed better than human users and thus usually had “good” advice; yet, human users underutilized the AI’s advice when stated at a more sedate, mathematically accurate level. Accordingly, overstating the AI’s confidence for both right and wrong advice—which while on rare occasions did lead humans astray into wrong answers—cumulatively boosted human user success.
The study’s authors cautiously note that carefully implemented worse AI can be beneficial to humans but only under certain conditions. The broader takeaway from the findings is that developers should keep human end users and their human notions of confidence more in mind when approaching human-AI collaboration.
“We have shown that ‘worse’ or uncalibrated AI, which seems quite confident in its predictions, ends up being more beneficial for the human user,” says Stanford HAI affiliate faculty James Zou, an assistant professor of biomedical data science and (by courtesy) of computer science and of electrical engineering.
Read the study, Uncalibrated Models Can Improve Human-AI Collaboration"
Zou is the senior author of the study, which posted on the scholarly preprint site arXiv. Kailas Vodrahalli, an electrical engineering PhD student at Stanford who is advised by Zou, is the study’s lead author and Tobias Gerstenberg, an assistant professor of psychology at Stanford, is a co-author.
Human Notions of Confidence
As an explanation for this seemingly counterintuitive phenomenon, the Stanford research team looked to the behavior and psychology of the human users. The study revealed that users changed their answers and confidence levels far more often when the AI offered strong advice, with confidence levels in the 80 to 90 percent range. By the same token, users were less likely to alter their initial response when the AI model offered weaker predictions, with confidence levels only reaching into the 60 to 70 percent range. Faced with this equivocal advice, humans tended to ignore the “second opinion” provided by the AI. While this sounds reasonable, it becomes an issue when humans overestimate their own confidence or undervalue the AI and discard AI advice that is, in reality, more confident than they are.
The upshot, according to the Stanford researchers, is that humans can more effectively decide when presented advice that aligns with their internal, skewed representation of confidence versus having to try to parse highly calibrated, probabilistic predictions.
“An AI model that helps humans the most is not necessarily the AI model that is the best by its own standard metrics of accuracy,” says Zou. “We think this is partly because uncalibrated AI is actually more closely aligned with human notions of confidence, where you may need to be pretty sure of something in order for someone to seriously weigh your opinion.”
The findings could have important implications for designing AI systems intended to work alongside humans. These sorts of systems are becoming increasingly common as decision aids, for instance in guiding physician diagnoses in health care settings. Typically, AI systems have been trained in isolation to be as accurate as possible on fixed datasets. In this way, though, the systems are calibrated wholly independently of the human end user.
Because those end users might find a completely accurate report of the AI models’ uncertainty unenlightening, an uncalibrated approach that readily captures a user’s attention could bolster certain practical applications of AI.
“There is often a misalignment between how we train and develop AI versus how we actually use AI in practice,” says Zou. “This study shows that when developers are crafting AI systems that will interact with humans, it is important to have humans in the loop.”
For the study, Stanford researchers recruited diverse groups of 50 participants via the crowdworking platform Prolific to perform four different tasks. The tasks consisted of determining the correct art period for a painting, detecting the presence of sarcasm in passages of text, gauging whether a landmark occurs in one major city or another, and answering census-based demographic queries. An example of one of the easier questions posed in the art task was a displaying of the Mona Lisa to the user. Users then manipulated a sliding scale to express their confidence that the correct art period is “Definitely Romanticism” on the left side of the scale, “Unsure” in the middle of the scale, or “Definitely Renaissance” (the correct answer) on the right.
After the study participants posited an initial assessment, an AI model chimed in with its own prediction of the right answer. For some tasks instances, this prediction was highly accurate and confident, while for others the AI prediction came with significant uncertainty. For example, the AI had high confidence about the location of the Golden Gate Bridge, but lower confidence about whether, say, a winding staircase was located in San Francisco or New York City. For about half of the participants, Zou and colleagues perturbed all the AI predictions to overstate their certainty, bumping a 70 percent degree of certainty to 80 percent, for instance.
In identical task iterations where the calibrated and uncalibrated AI agreed, but had different levels of confidence, the overstated prediction tended to nudge human users toward more accurate and more confident final decisions. Overall, accuracy improved by a few percentage points across the tasks, from a low of about 2 percent to a high of 6 percent when relying on uncalibrated versus calibrated AI. With regard to raising confidence levels in answers, the performance improvement ranged from around 5 to 15 percent.
To double-check the findings, the Stanford research team ran computer simulations using their model of human decision making and AI feedback and arrived at the same results. Another check on the results came through recruiting an additional set of participants entirely from the United Kingdom, whereas the original participants lived just in the United States. Sure enough, the same results emerged, where uncalibrated AI enhanced decision making by human users.
Tailoring for Human Collaboration
Zou and colleagues caution that uncalibrated AI would likely only be potentially suitable for when AI systems are meant to collaborate with humans in a team setting. For AI systems intended to act on a largely autonomous basis, arriving at their own decisions and predictions, calibrated uncertainty would remain valuable.
The Stanford researchers acknowledge that working with “lesser” AI sounds misguided at best and even dangerous at worst. Hypothetically, an overconfident AI model could convince people to change their minds in the wrong direction and make a mistake, Zou says. But these kinds of errors are inevitable, Zou adds, because the ultimate decision is made by the human user: a fallible being with limitations, biases, and psychological leanings—including finding ambiguous advice unhelpful. Even the most well-calibrated AI, which offers vague and therefore unheeded advice, cannot prevent rare negative outcomes from occurring.
Altogether, Zou and colleagues hope their findings will foster more investigation and discussion about how to best tailor AI for human collaboration.
“I’m very excited about this work because it changes the conversation around AI,” says Zou. “Instead of optimizing AI for the standard AI metrics, we should train AI to directly optimize for downstream human performance.”
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.