Google Assistant was undoubtedly one of the standout highlights from Google I/O 2018. Maybe you’ve seen this remarkable demonstration of Google Assistant scheduling a hair appointment in real time:

Embedded content: https://www.youtube.com/watch?v=bd1mEm2Fy08

This was the world’s introduction to Google’s new recurrent neural network (RNN), which they’re calling Duplex. This post will give you a better understanding of the core approach of Duplex, and how it works to conduct natural conversations to accomplish certain types of tasks.

Defining a natural conversation

So what do we mean exactly when we talk about a “natural” conversation? Is there any way to define requirements for that?

A natural conversation can be described with the following characteristics:

1. Speaker is exhibiting goal-directed, cooperative, rational behavior The speaker’s utterances are relevant to the topic, and the speaker is able to successfully carry out a transaction. A crucial part of this is determining which utterances will guide the situation towards success as well as recognizing the successful outcome of the conversation.

2. Speaker is using the appropriate tone The speaker demonstrates an ability to match inflection of speech to the nature of the content. This might include using a rising intonation for questions, slow oscillations, and pauses before lists.

3. Speaker can understand and control the conversational flow and use the right timing The speaker should know when to give the hearer more time to process, be able to identify and coordinate speaker “turns,” interruptions, and elaborations. This also includes the ability to both identify and use strategic pauses and signals that the speaker needs more time (like “umm”).

How does Google Duplex model natural conversations?

First of all, note that Google Duplex is not able to to carry out random casual conversation. Rather, it was trained to autonomously handle three types of tasks: scheduling a hair salon appointment, making a restaurant reservation, and asking about the business hours of a store. Once the user asks Google Assistant to do one of these tasks, the Assistant makes a phone call which happens completely in the background without any user involvement whatsoever. Then it shows a notification to the user that the task has been successfully completed.

Use cases sorted by latency

A big requirement for a live conversation is for it to adhere to each speaker’s expectations for timing. Taking too long to respond can jam up the interaction, leading to needless (and confusing) attempts to get back on the same page. To make matters more complicated, these expectations are partially formed by the complexity of the current conversational step; when someone says “hi” to you, they expect a more prompt reply than if they ask you the answer to a math question.

To accomplish this, Duplex classifies whether or not the user’s utterance is associated with a high or a low latency expectation. If the speaker expects a prompt reply, it is usually due to the reduced complexity of that stage of the conversation. In cases where swift replies are expected, Duplex uses a deep neural network (DNN); in more complex cases, it makes use of a recurrent neural network (RNN) which is more expensive, but better at modeling language. RNNs have gained a lot of attention in the past few years, especially when applied to natural language processing. We will shed some light on the way DNNs and RNNs work and how they fit in the core of Google Duplex.

DNNs versus RNNs

DNNs are the kind of neural network with which many people are most familiar. They involve an input layer, a hidden layer (the matrix of weights which is trained against data), and an output layer capable of producing what can be interpreted as a prediction or a classification.

Figure 1. Deep neural network (DNN): fig1

DNNs are good at one-shot prediction—if a single observation is all it takes to produce suitable output, a DNN is a good job for the task. However, oftentimes, data comes in sequences (e.g. time series). In both written and spoken word, language arrives in an ordered sequence. It’s for this reason that RNNs have seen so much popularity in natural language research in the past decade. Since it is very important to remember the context when conducting a longer human-like conversation, RNNs became one of the obvious, go-to choices to do the job.

While more expensive than DNNs computationally, they have the added benefit of being able to model the development of a sequence. So while DNNs are often used to map a sequence of features (an observation) to a single output (a prediction), an RNN is suited to the task of mapping sequences to sequences. RNNs are capable of handling multiple types of sequential input, and even if your data has a fixed size, it can still be possible to process it sequentially. This has led to their popularity for tasks such as automated annotation of video and natural language processing, domains whose data are best viewed as time series.

The major difference in an RNN is that the hidden state now has an edge leading to itself. In other words, it not only ingests the current input, it ingests its past hidden state as well. This allows for it to learn sequential patterns.

Figures 2 and 3 below show the RNN architecture in both “rolled up” and “unrolled” forms. Different people find different modes of presentation helpful here. As stated above, RNNs can be viewed as stacking up DNNs through time—hidden states have connections between time steps.

Figure 2. “Rolled up” RNN: fig2 divider 600 Figure 3. “Unrolled” RNN: fig3

Closed domains and the vanishing gradient problem

Google freely admits that Duplex won’t have a philosophical conversation, call your grandma, or conduct phone interviews successfully on your behalf. This isn’t seen as a downside, but is an intentional move. Picking what they call “closed domains” as the intended zone of operation for Duplex has allowed for them to develop a highly tailored system with devices capable of meeting both the timing and accuracy constraints of natural language.

Closed domains are loosely defined as any setting that has a limited number of conceivable interactions. The suggestion seems to be to imagine a use case that’s specified at two levels, such as “hair appointment.” The idea is that any closed domain has a sort of closed (and well-worn) number of conversational paths and options. When booking a hair appointment, complex divergences don’t happen frequently. Typically, people stay focused on the task at hand. Furthermore, in the case of business, the humans interacting with Duplex have a built-in incentive to conclude the transaction as quickly, to-the-point, and successfully as possible.

This has a number of advantages, but a major one is that it helps Duplex avoid the “vanishing gradient problem,” which is an issue for many DNNs and RNNs alike. The issue is fairly straightforward to explain in language: when many hidden layers are stacked such as in a multi-layer DNN or between time steps in an RNN, the network begins to “forget” the past. As the network goes through multiple layers of words, the original context gets lost, so it fails to capture the relationship between the words that stand far apart in a conversation. For example, a phrase “4 is fine” or “OK for 4” can mean different things depending on the context. It could mean “4 pm” or “4 people”. Mathematically speaking this happens due to the underlying mechanics of backpropagation (and there are many varieties of remedies out there, notably use of the ReLU activation function). Figure 4 below illustrates the vanishing gradient problem:

Figure 4. Vanishing gradients in an RNN: fig4 Given a closed domain, however, the number of times one has to look into the past is constrained. This allows Duplex to work smoothly, in time, without recourse to a more computationally expensive model (such as an LSTM). Vanishing gradients aren’t as much of an issue if you don’t need to remember much.

The use of theory

Conversations can be either long or short depending on our perspective. Viewed symbol-by-symbol, they can appear long. Viewed turn by turn, where speakers trade utterances, they might appear short. In other words, if we are capable of “chunking” a conversation into various units (words, terms, phrases, sentences, turns, and so on), we can mitigate the vanishing gradient problem by viewing the problem at the appropriate level. A big advance here with Duplex is using “intents” in this way. Intents are a sort of speech-action perspective on utterances—they are objects associated with a speaker’s utterance that tie that utterance to its intended effect. If I ask you to close the door, the intent of my utterance (“close the door”) is a request of some sort.

In simple tasks such as booking a hair appointment, for instance, there are only so many loops of conversation we might go through. There just aren’t the long divergences that occur with discussing “open” domains (where the topic might meander and conversations nest for dozens of turns). This has two effects for Duplex’s RNN: (1) it mitigates the vanishing gradient problem, as we mentioned, and (2) it increases the sample size for particular conversational paths in Duplex’s training data. When a domain is closed, conversations are pigeonholed—the same sorts of conversations occur over and over, building up a stronger dataset for harder-to-reach features such as natural timing, knowing industry/trade slang, and so on.

The future of Duplex

It’s still unclear in which direction Google will choose to go to further develop Duplex. Our best guess would be perfecting its current performance, making sure Duplex identifies itself as an AI to its interlocutor, and potentially adding new domains. Businesses could possibly use a similar technology for handling phone calls and answering frequently asked questions. Polishing the interactions between Duplex and another AI assistant representing a business, or developing a “physical” assistant system that is able to handle a certain set of conversational tasks would be an interesting experiment as well.

Regardless, Duplex in its current state represents a significant milestone of advancement in the field of AI, machine learning and speech act theory. Sometimes we may overlook the fact that neural networks are not entirely a new idea. RNNs in particular already existed in the 90s. The real breakthrough that made RNNs powerful and quite popular today is their integration with the concepts of speech act theory. Specifically, it is the capacity for thinking in terms of conversational intents that ultimately allows RNNs to punch above their weight.

Last but not least, Duplex can make a variety of services more friendly and accessible to elderly and disabled customers. It can make life easier especially for people that have hearing and speech impairments. Businesses can surely find value in a technology like Duplex that lowers the barrier for the interactions with customers of diverse backgrounds.

References:
https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html http://karpathy.github.io/2015/05/21/rnn-effectiveness/