Skip to main content Skip to secondary navigation
Page Content

Is It My Turn Yet? Teaching a Voice Assistant When to Speak

Predicting initiation points based on voice intonation instead of silence detection could lead to the next generation of voice assistant technology.

A comic illustration of an annoyed person talking to a smart assistant.


Consider this hypothetical human-human interaction and this actual human-Alexa one:

Box A: Human1: So, what historical person would you love to meet? Human2: That’s a great question. Hmm…historical person…maybe, uh, Martin Luther King Junior.  Human1: Great choice! I’d love to meet him as well, and Jane Austen, too. Box B: Alexa: So, what historical person would you love to meet? Human: That’s a great question. Hmm…historical person…maybe, uh, Martin Luther King J-- Alexa: Here’s a random thought. Taking notes in history class…

Where did Alexa go wrong? In the human-human interaction, Human1 gave Human2 a moment to think, and when they paused, Human1 interpreted the silence as a need for more thinking time. Alexa, on the other hand, interpreted the moment of silence the way all current dialog agents do: You’re done talking. It’s my turn.

This thinking pause, along with other types of silences, are crucial signals that humans use with each other to know when to keep talking or when to stop. Other cues include a change in vocal pattern (e.g., pitch or intonation), or a “yeah” or “uh huh,” which are referred to as linguistic backchannels.

In the absence of a strong understanding of these turn-taking signals, voice agents – like Siri, Alexa, and Google Home – stumble along, resulting in stilted, unnatural conversations and periods of empty silence while the human waits for the voice assistant to chime in. 

How do you design dialog agents that display more human-like behavior? With a goal of creating a more natural conversational flow, Siyan Li, second-year master’s student, and Ashwin Paranjape, recent PhD graduate, collaborated with Christopher Manning, Stanford HAI associate director and Stanford professor of linguistics (School of Humanities and Sciences) and of computer science (Engineering), on a novel approach to this problem. They took the classification approach traditionally used with dialog agents and instead used a more continuous approach, incorporating prosodic features from voice inputs as well. In doing so, they were able to create models that behave more similarly to how humans take turns when conversing in real life. 

As Li describes, “Our models continuously ask: In how many seconds can I speak, as opposed to, Can I speak in the next 300 milliseconds?” This continuous approach ultimately predicts more natural points at which voice assistants can initiate speech, allowing for more human-like conversation. 

Current Systems Detect Silence

As they embarked on this research, Li, Paranjape, and Manning first assessed the chatbot landscape. “Most chatbots now are text-to-text systems that have speech components tacked on to them,” says Paranjape. In other words, speech recognition systems for voice assistants first convert a user’s speech into text, which then gets processed by the dialog agent that retrieves or generates a text response. This text response is then converted to speech, which is the output we hear when Alexa responds to our request. 

Though there have been recent improvements in the technology, the processing of many chatbots is often conducted on text, which means that all the nuances of a verbal conversation are lost. The linguistic backchannels that humans use to signal turn-taking in conversation fade away. “It’s basically like when you’re texting someone,” says Paranjape. Moreover, current dialog agents use silence detection to determine when it’s their turn to speak, often at a threshold of 700 milliseconds to 1,000 milliseconds. Humans are much quicker than that, usually responding within 200 milliseconds. 

But Paranjape underscores that the problem isn’t just about having a more human conversation. “It’s also a user-interface issue,” Paranjape explains. “People are comfortable talking to other humans, and they try to bring these characteristics into their conversations with voice agents. But when these characteristics aren’t supported, it becomes an interaction issue that leads to confusion.” 

Stanford HAI · Alexa Conversation 2

Li and Paranjape’s research could improve a voice assistant’s ability to interpret changes in vocal intonation, like the confusion expressed here (which Alexa glosses over). 

A Model That’s Always Attentive

By “reformulating our model to continuously analyze voice input,” says Li, they redefined the current problem space in an entirely new way. “Our model is more flexible and more similar to what humans do in real life.” 

Li and Paranjape began by considering the two phases of speech input: 1) when a user is speaking, and 2) when a user is silent. “Sometimes being silent for a while is an easy way to predict the end of a turn,” Paranjape says. “But what we’re trying to do here is to predict before they go silent, based on intonation changes or if they’re mid-sentence but not done. Why do this? Because it can signal the dialog system to prepare a response in advance and reduce the gap between turns.

For this research, Li and Paranjape used a combination of GPT-2 for word features and wav2vec for prosody, with a Gaussian Mixture Model (GMM) sitting on top of the other models. Using GMM enabled them to predict multiple future points for the digital agent to initiate speech. This combination of models proved to be the most performant, easily besting current silence-based models. The result is a machine learning model that continuously makes predictions and is always attentive to see if it’s the agent’s turn. As the end of the turn approaches, Li and Paranjape concluded, the model predicts a shortening lead time to agent initiation.

Voice Assistant 2.0

Li and Paranjape believe their research will help designers build more engaging experiences because of the model’s ability to get continuous feedback from the user. Since the model considers intonation and other prosodic features from the user’s speech – as opposed to text alone – designers can gain insight into a user’s response to a particular utterance from the agent. These signals give designers the freedom to adjust levers like response length or content type, as the feedback system will alert them if a user is engaged or getting bored. 

Li notes that utterance duration was the main lever they considered in this research, not content type, so a future phase of research could “look at social guidelines we abide by, and encode those into the training process, like increasing politeness or empathy. We could design those as reward signals.” Paranjape believes further research could investigate ambiguity and better handle clarification questions with little burden on the user. 

Both agree that this research represents an exciting phase two of the way people build voice assistants. Paranajape says that future voice assistants will “not be a straight text-to-speech and automatic speech recognition system with pause detection. The hope is that the next phase is more seamless and will take the nuances of voice into account, rather than just converting to text.”

Stanford HAI's mission is to advance AI research, education, policy, and practice to improve the human condition. Learn more.

More News Topics