Speech Recognition and Detecting User Intent for Voice Ordering

Voice as an interface to a computing system is rapidly becoming an everyday experience. Amazon has sold over 5 million Echo devices, and one estimate reports 24 million Amazon Echo and Google Home devices will be sold in 2017. Of course, there’s also Siri, and Google Assistant was featured prominently at the recent Google I/O.

With all of this action, many companies are wondering how they can participate in this new voice user interface market? One place we see a good fit for many brands, is adding the ability to place orders via voice in mobile apps. Using voice as an interface in a mobile device is compelling for users that may be on the go or ordering an item that has many different options. It may also simply fit with user’s preferred search behavior as they get more comfortable with voice-based tools.

As we’ve begun to explore voice ordering interfaces for our clients, we’ve begun to develop some lessons learned and options that we wanted to share.

We have a few problems to solve to take user voice input and match it to items available to purchase. First, we must turn the voice input into text, next, we must map that text into user intent, and finally, we must match potential pieces of text to items on the menu.

Let’s tackle those items in order:

Speech-to-Text

Apple introduced a Speech Recognition API in iOS 10 (SFSpeechRecognizer). This allows a mobile app to record a user’s voice and input it into Apple’s speech recognition engine. Apple runs their speech recognition algorithms on the user’s input and periodically returns the text to the mobile app. This gives us high quality speech-to-text, and now we can start trying to derive user intent from this speech.

User Intent – Part 1

Our voice ordering prototype assumes that the user is ordering items off of a menu. It is not a general chatbot (we will briefly discuss technologies that allow general chatbot-like functionality later).

We use Apple’s linguistic tagger (NSLinguisticTagger) to turn the raw text converted from speech into text that is tagged with parts of speech. We use these parts of speech to map the utterance to a potential item with options or a new item. The output from this is a list of utterances where each one might be an item on the menu. Finally, we must attempt to decide which item on the menu this could be.

User Intent – Part 2

Speech-to-text is not perfect. A user might order an item many different ways, so we must use some fuzzy logic to find the item the user intends to order – it won’t be a simple match.

To do this we apply a score that is made up of matches on an item and its options. To determine if a word in a phrase matches part of an item or option for that item, we calculate the Levenshtein distance between that word and all possible items on the menu – this allows the user to say burger and have a match to hamburger, for instance.

The Result

After all this, we end up with a list of potential items the user appears to be ordering and we present those items to them for confirmation. This prototype performs very well in casual usage and testing. However, there is room for improvement. It is not a general chatbot in which a user could ask for help, or nutritional information, or directions to the nearest location, etc.

Keeping focused on improving within this order-focused flow, we could make some improvements to the menu data to reduce false positives. We could provide a mapping of how a user might typically say the same thing and apply those to match better. For instance, instead of relying on the Levenshtein distance to determine hamburger and burger are the same we could provide burger as an alternative to hamburger.

To expand this to a general chatbot we would need to tap into some more advanced technology. Amazon has recently moved Amazon Lex from beta to being generally available. Amazon Lex is the engine behind Alexa the voice assistant used in the Amazon Echo. Google and Apple will likely open similar services to compete with Amazon Lex in the near future. With Lex we can build an Echo-like chatbot inside of our mobile app – but that is a topic for a different blog post.

The world of voice-enabled technology is going to take off over the next few years. The technology is ready (and increasingly capable) and users are starting to adopt this new interaction pattern. As with all new tech, the key to success will be figuring out whether this new technology actually makes sense for your business.