Carol M. Highsmith
Artificial intelligence is enabling spectacular advances in fields from medicine to robotics, but it also generates worry about job losses, privacy, fairness and human accountability.
Small wonder that governments worldwide are fixated on policies to both stay competitive and head off dangers. In June alone, for example, U.S. lawmakers introduced seven separate AI bills.
But do policymakers and the public have accurate data? How do we even define AI, much less measure “progress” or competitiveness? Do we have any agreed-upon metrics about benefits and risks?
Those questions were the focus of a recent workshop convened by Stanford HAI and Stanford’s AI Index.
The AI Index may be the world’s most comprehensive public source of data on AI activity, investment and impact. Yet the message from this conference was just how hard it still is to know what’s going on.
We sat down with three of the AI Index’s creators – Saurabh Mishra of Stanford, Ray Perrault of SRI International and Jack Clark of OpenAI – to better understand the challenges.
Why is it hard to measure the progress and impact of AI, and why should we worry?
Perrault: Policymakers want to know what’s happening, but they need good information. People who want more government funding may have incentives to present one set of numbers to warn that we’re underinvesting in AI, while others might present a different set of numbers to claim we’re heavily funding AI and having great impact. So this requires careful thought about what we’re measuring.
The problem is that it’s difficult to put a boundary around what we mean by “artificial intelligence.” AI has borrowed ideas from many disciplines over time, including logic, linguistics and psychology. Machine learning draws many of its foundations from statistics and optimization, and today is being applied to a broad range of other fields, from bioinformatics to finance. Many of those advances are coming not from AI researchers but from people in the applying disciplines.
There’s nothing wrong with that – it’s progress. But it does raise challenges about how to measure investment and advances in artificial intelligence.
For example, should we think that an investment in self-driving cars is all about AI? AI is certainly important, but you can’t give it credit for the whole field. When you actually build a self-driving car, AI is a pretty small share of the total cost. Most applications of AI are driven by a mix of technologies.
How good are we at measuring performance?
Perrault: Strictly from a technical standpoint, there are different metrics of AI performance. These include the amount of data it takes to train a system, but also the amount of computation required and how well a model performs with real-world data that’s different from what it was trained on.
Speech recognition is much more practical now because it’s possible to collect vast numbers of speech samples for the systems to train on. The more data and the more computing power you can throw at the job, the better the results will be.
But accurate results are not the only measure of performance. Another might be: Is there a way to get the same job done, but with less data and computing power? Increasingly, authors of papers about new AI models indicate how much computing was necessary to get their results.
Clark: One way of cutting through the hype is by having better and more standardized metrics for what you want to achieve.
Imagine if car companies hadn’t standardized horsepower. One company could claim its engine had 100,000 “foxpower,” while another claimed its had 700,000 “flypower.” That’s a little like where AI is today. It can be very challenging to compare the performance of a system from task to task, or to compare different systems on the same task, because you use different standards to evaluate them.
You can have a system that’s useful but will use enough energy to boil the ocean, or you can have a system that’s just kind of useful but runs on a triple-A battery. You need to talk about those systems in the same universe.
Mishra: Another metric of progress is in avoiding bias. We know that facial recognition systems are more accurate with some racial groups than others. The National Institute of Standards and Technology has developed a set of systemic evaluation methods to compare the bias of competing facial recognition systems, and it has published reports showing that every system has problems. But those kinds of in-depth standardized measurements and evaluations are still rare in other domains impacted by AI.
Clark: But if you have a single metric, a single performance score, you’re likely to get something wrong. Let’s say you want to measure the bias of facial recognition systems, but the measure is actually a blend of how a system performs for different social or racial groups. What happens if a system is reasonably good overall but weirdly bad at recognizing one particular group?
How good are we at measuring the social and economic impact of AI?
Perrault: It’s a challenge. In spite of all the technological advances in AI, for example, productivity growth has been lagging – even in the West. Part of the answer is that not all the uses of AI generate economic consequences. I can ask a question to my phone and get an answer, but how much economic impact does that have? You can ask many more questions during the day than you could before, but are you more productive than when you had to go look up the information yourself? And how much is the AI contributing? Google says it’s an AI company, but no one really knows how much of their revenue comes from AI.
Mishra: To put this into a global perspective, we need to think about distributional consequences and inequality. We need to study these trends in terms of the impact on developing countries. We don’t have much clarity about which nations, which domains and which organizations are deploying AI. Who has access to which data? Who has access to the computing power? There’s a big paucity of data about developing countries.
Stanford HAI's mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.