Skip to main content Skip to secondary navigation
Page Content

A Consumer Reports for AI Services

Stanford researchers reveal the surprising heterogeneity of cost and accuracy in the machine learning as a service (MLaaS) market – and offer a tool to help consumers choose wisely.

Illustration of a woman looking through a magnifying glass at charts and graphs showing accuracy levels


These days, many companies turn to machine learning (ML) models to help them with tasks such as detecting fraudulent transactions; extracting key information from product reviews; or providing streamlined customer service.

But instead of developing their own machine learning models to do these things, most companies pay a fee to use ML services developed and provided by another company (Google, Amazon, or Microsoft, for example). Typically, they access these services by way of an API – an interface that enables secure submission of data.

Unfortunately, consumers in this machine learning as a service (MLaaS) market know very little about the nature or quality of the services they are purchasing. Whereas consumers of ordinary products like laptops, cars, or refrigerators can turn to Consumer Reports or Wirecutter to help them evaluate a product’s quality, there are no such resources for MLaaS consumers. For example, MLaaS consumers don’t know what data was used to train any given ML API, how accurate any given service’s predictions will be for their own data, or if paying more will yield better predictions. This leaves them with little basis for choosing among the different options, says James Zou, assistant professor of biomedical data science at Stanford University and member of the Stanford Institute for Human-Centered Artificial Intelligence.

To address this problem, Zou and his Stanford colleagues, including computer science graduate student Lingjiao Chen and associate professor Matei Zaharia, created HAPI (History of APIs), a dataset comprising ML predictions by multiple vendors’ ML APIs over time, as well as two tools that provide consumers with an automated way to choose the best API for a particular task, dataset, and budget.

The team’s analysis revealed that, for the same tasks, the accuracy and cost of different vendors’ ML services varied considerably. In addition, cost and accuracy were not always correlated, and companies’ offerings changed over time – not always for the better. Meanwhile, the team demonstrated that their tools can help consumers achieve greater accuracy at less cost.

In combination, HAPI and the Frugal tools should prove valuable for both the purchasers and vendors of MLaaS, Zou says. “We’re providing a sort of Consumer Reports for ML API users and a ML selection tool to help API providers improve their products.”

The Heterogeneous MLaaS Market

In the MLaaS market, the consumer, usually a business, chooses an ML API, submits its data, and then the API provides predictions based on that data. For example, the consumer might submit customer reviews of their products to get a sense of how many are positive or negative – a machine learning task known as sentiment analysis. Or they might submit troves of images to better label them with accurate tags – a machine learning task known as object recognition. An ML API might also provide speech recognition services for transcribing meetings in real time or providing closed captioning for videos.

Regardless of the dataset or the desired ML task, the same problem faces all MLaaS consumers: the opacity of the market. The ML models themselves are black boxes, and the major MLaaS providers’ accuracy claims are based on internal, proprietary benchmarks. “On their own, users can’t determine which product is going to work best for any given dataset, and they certainly don’t know if the product is changing for the better or worse over time,” Chen says.

To close this gap, Chen and his colleagues assembled and analyzed the HAPI dataset and found heterogeneous accuracy among major ML API services. “Different APIs can perform very differently on the same data,” Chen says. For example, some of the text and image analysis services offered by Google, Amazon, and Microsoft work much better on some datasets than others. And when it comes to price, ML services’ cost per query can differ by a factor of 10 – and more expensive services do not necessarily yield better results.

The HAPI team was also interested in understanding how ML APIs’ accuracy changes over time as models are updated. When they examined how each of several APIs performed at several different time points in the last few years, they found something surprising: After a model update, even if a model’s overall performance improved, it would often perform worse for some kinds of data. And often those changes had a disparate impact on certain populations. For example: Updates to some speech recognition software increased overall accuracy despite worse performance for people with an accent.

“Because of these findings, we recommend that ML API providers do more comprehensive and more diverse internal benchmarking,” Zou says.

Choosing ML APIs

The utility of an ML API depends on each data instance the user submits, but deciding which service to use is extremely challenging. For example, the choice of ML API for image classification could depend on such things as the subject matter of the image, where it was taken, its resolution, or even time of day, Chen says.

To help consumers choose wisely, Chen and his colleagues developed FrugalML and FrugalMCT, two related frameworks that can automate the process of optimally choosing the most accurate and affordable ML API for any given datum (for example, an image or a bit of speech or text)

Here’s an example of how the Frugal pipeline works for image recognition: A user submits an image of a man playing tennis to a free ML API that spits out an initial prediction of what’s in that photo. For example, it might identify a person and a sports ball. Frugal, which has learned the strengths and weaknesses of several paid ML services, then evaluates which, if any, of those services is likely to yield a more accurate prediction and at what cost. If the anticipated improvement in accuracy falls within the user’s budget, Frugal then submits the image to the service that will yield more accurate tags – for example: person, sports ball, tennis racket.

Chen’s team showed that FrugalMCT yielded more accurate predictions faster and more cheaply than any individual ML API could provide on its own. Indeed, Chen says, “In some cases, FrugalMCT reduced the cost by up to 90% to 95% while matching the accuracy of the best individual APIs.” Moreover, at the same cost, FrugalMCT yielded up to a 6% gain in accuracy over the best API, he says.

Chen thinks of the FrugalMCT framework as an API that could sit on top of other ML APIs – allowing consumers to leverage various services to meet their needs. 

What’s a Consumer to Do?

The HAPI website offers an “explore” option where consumers and providers of MLaaS can review the performance of various vendors’ ML APIs on different tasks at several different time points. FrugalML and FrugalMCT are currently available as research tools, and Chen would love to see them commercialized so that more people can benefit from improved accuracy and reduced cost while receiving the kind of customer support and customization a business entity can provide.

According to Chen, the team’s work will make it cheaper and easier for more people to participate in the MLaaS marketplace. 

“The end goal is to have the whole industry evolve to a better place, so everyone can benefit,” he says.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.

More News Topics