The World Wide Web (WWW) and the WWW browser have permeated our lives and revolutionized how we get information and entertainment, how we socialize, and how we conduct business.
Using novel tools that make it easy and inexpensive to develop voice-based agents, researchers at Stanford are now proposing the creation of the World Wide Voice Web (WWvW), a new version of the World Wide Web that people will be able to navigate entirely by using voice.
About 90 million Americans already use smart speakers to stream music and news, as well as to carry out tasks like ordering groceries, scheduling appointments, and controlling their lights. But two companies essentially control these voice gateways to the voice web, at least in the United States – Amazon, which pioneered Alexa; and Google, which developed Google Assistant. In effect, the two services are walled gardens. These oligopolies create large imbalances that allow the technology owners to favor their own products over those from rival companies. They control which content to make available, and what fees to charge for acting as intermediaries between companies and their customers. On top of all that, their proprietary smart speakers jeopardize privacy because they eavesdrop on conversations as long as they’re plugged in.
The Stanford team, led by computer science Professor Monica Lam at the Stanford Open Virtual Assistant Laboratory (OVAL), has developed an open-source privacy-preserving virtual assistant called Genie and cost-effective voice agent development tools that can offer an alternative to the proprietary platforms. The scholars also hosted a workshop on Nov. 10 that discussed their work and proposed the design of the World Wide Voice Web (watch the full event).
What Is the WWvW?
Just like the World Wide Web, the new WWvW is decentralized. Organizations publish information about their voice agents on their websites, which are accessible by any virtual assistant. In WWvW, Lam says, the voice agents are like web pages, providing information about their services and applications, and the virtual assistant is the browser. These voice agents can also be made available as chatbots or call-center agents, making them accessible on the computer or over the phone as well.
“WWvW has the potential to reach even more people than WWW, including those who are not technically savvy, those who don’t read and write well, or may not even speak a written language,” Lam says. For example, Stanford computer science Assistant Professor Chris Piech, with graduate students Moussa Doumbouya and Lisa Einstein, are working to develop voice technology for three African languages that could help bridge the gap between illiteracy and access to valuable resources including agricultural information and medical care. “Unlike the commercial voice web spearheaded by Amazon and Google, which is only available in select markets and languages, the decentralized WWvW empowers society to provide voice information and services in every language and for every use, including education and other humanitarian causes which do not have big monetary returns,” Lam says.
Why have these tools not been created before? The Stanford team says: It is just very hard to create voice technology. Amazon and Google have invested tremendous amounts of money and resources to provide the AI Natural Language Processing technologies for their respective assistants and employ thousands of people to annotate the training data. “The technology development process has been expensive and extremely labor-intensive, creating a huge barrier to entry for anyone trying to offer commercial-grade smart voice assistants,” Lam says.
Over the past six years, Lam has worked with Stanford PhD student Giovanni Campagna, computer science Professor James Landay, and Christopher Manning, professor of computer science and of linguistics, at OVAL to develop a new voice agent development methodology that is two orders of magnitude more sample-efficient than current solutions. The open-source Genie Pre-trained Agent Generator they created offers dramatic reductions in costs and resources in the development of voice agents in different languages.
Interoperability is a key component to ensure that devices can interact with each other seamlessly, Lam notes. At the core of the Genie technology is a distributed programming language they created for virtual assistants called ThingTalk. It enables interoperability of multiple virtual assistants, web services, and IoT devices. Stanford is currently offering the first course on ThingTalk, Conversational Virtual Assistants Using Deep Learning, this fall.
As of today, Genie has pre-trained agents for the most popular voice skills such as playing music, podcasts, news, restaurant recommendations, reminders, and timers, as well as support for over 700 IoT devices. These agents are openly available and can be applied to other similar services.
World Wide Voice Web Conference
The OVAL team presented these concepts at a workshop focused on the World Wide Voice Web on Nov. 10.
The conference included speakers from academia and industry with expertise in machine learning, natural language processing, computer-human interaction, and IoT devices, and panelists discussed building a voice ecosystem, pretrained agents, and the social value of a voice web. The Stanford team also conducted a live demonstration of Genie.
“We want other people to join us in building the World Wide Voice Web,” says Lam, who is also a faculty member of the Stanford Institute for Human-Centered Artificial Intelligence. “The original World Wide Web grew slowly at the beginning, but once it caught on there was no stopping it. We hope to see the same with the World Wide Voice Web.”
Genie is an ongoing research project funded by the National Science Foundation, the Alfred P. Sloan Foundation, the Verdant Foundation, and Stanford HAI.