Clinical drug trials compare a treatment with a placebo and aim to determine the best course of action for patients. Given enough participants, such randomized control trials are the gold standard for determining causality: If the group receiving the drug improves more than the group receiving the placebo, it’s safe to assume that the better outcome is the result of the drug. With the knowledge gained from the experiment, drug makers can confidently roll out a drug to thousands or even millions of patients.
But this traditional approach to conducting experiments has a major downside: The group receiving the less beneficial treatment inevitably loses out.
The stakes are quite different in a drug trial than when testing versions of a company webpage or advertisement, but the essential problem of any A/B testing is the same. “We gain knowledge at the expense of this opportunity cost,” explains Mohsen Bayati, an associate professor at the Stanford Graduate School of Business. And the opportunity cost only increases with the number of different variants being tested — as, for example, in a chemotherapy drug trial, where each combination of drugs requires its own treatment group.
Bayati, along with Ramesh Johari, a Stanford professor of management science and engineering, and PhD students Nima Hamidi and Khashayar Khosravi, recently set out to see if they could improve the efficiency of these tests and lower opportunity costs. In a new paper, they found that a “greedy” algorithm could streamline tests that explore many options and still produce conclusive results.
The Value of Multi-Armed Bandits
An approach to reducing such opportunity costs has long been known, though it’s been used in practice only in the last decade — mainly in the online world by tech-savvy companies needing to quickly compare many combinations of variables. This experimental design (sometimes called sequential experimentation or adaptive experimentation) uses a set of ideas called the “multi-armed bandit,” named after a row of slot machines (or one-armed bandits). Each “arm” represents one of the options being explored, whether the options vary by product, headline, or whatever the experimenter wants to test.
Instead of allocating all participants or users evenly across all arms as you would in a traditional experiment — say, 100 users to condition A, 100 to B, and so on — the multi-armed bandit allocates just a few users into the different arms at a time and quickly adjusts subsequent allocations of users according to which arms are showing the most promising outcomes. As a result, by the end of an adaptive experiment, fewer people have been subjected to suboptimal versions of whatever was being tested. For example, if someone were to use this adaptive approach in a drug trial, Bayati explains, “we still find out which variant is the best, but more patients are on the good variant.”
That’s the basic idea — but how do you optimize the multi-armed bandit so that as many users as possible end up on the best arm? That’s the crux of all the research into multi-armed bandits, which goes back to the 1930s to model the trade-off between exploring different options to learn about their rewards and “exploiting” options whose rewards you already know. Different policies for this explore-exploit dilemma work best in different situations. The question Bayati and his colleagues investigated is what is the best policy when there are many more arms than two (for example, 10,000 arms, a scenario that is not uncommon in the online world).
Optimizing Bandits
Through mathematical analysis and simulations run on real data, the researchers made several discoveries, which they presented at the last NeurIPS conference. First, they proved that with a large number of arms, the best policy uses subsampling — for example, exploring just 100 arms instead of trying all 10,000 arms. Sure, trying every arm can find the very best arm — but the improvement over trying just a subsample is not nearly big enough to justify the enormous opportunity cost of so much exploration. “If you take a random subset, that’s enough,” Bayati says. “We show that the best out of the 100 is already very good.”
But what should you do with that relatively small number of arms? The answer to that question was the researchers’ second discovery: Their simulations showed the effectiveness of the greedy algorithm, which starts exploiting without a lot of exploration: It simply pulls each arm just once and then pulls the best of those arms every single time.
The researchers are still working to fully understand why the greedy algorithm works so well — but at least part of the answer seems to be that it eliminates most of the suboptimal arms in the early stages and doesn’t try them later on, just as their analysis shows. Instead, greedy narrows down its exploration to a small number of arms — and experiments only with those. And, as Bayati puts it, “The greedy algorithm benefits from free [costless] exploration”— meaning that the algorithm isn’t actively exploring, and yet it is nonetheless learning. “That’s the interesting part.”
Finally, Bayati and his team also theoretically proved a lower bound for any algorithm in these problems and proved that subsampling followed by greedy is optimal when the number of arms is at least as large as the square root of the number of users.
This set of insights is important for AI developers, the researchers say, because most reinforcement learning algorithms rely on a certain amount of exploration, and the multi-armed bandit is a special case of reinforcement learning.
“The takeaway is you can reduce the experimentation,” says Bayati. Amazon, Facebook, Uber, and many other companies routinely run extensive experiments on their many users. “We’re telling them that even if you just try the best options you have seen so far, you might learn much more than you have imagined.”
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.