Emerging Tech PrototypingResearch and Insights

Industry Hears the Multi-Modal Call at First Voice Live From CES

I’ve been attending (and speaking at) the first ever Voice Live From CES conference this week, and I’ve come away really energized about how much of an impact voice will have this year and in the next couple of years. Being with this group of innovators and pioneers felt a lot like attending the early iOS developer conferences a decade ago, when the potential of developing mobile apps was becoming clear. Voice is poised to arrive in the same big way.

The most important development at the event wasn’t all the voice-based products at Google’s huge booth, or its battle with Amazon. What really mattered is the industry as a whole is heading toward multi-modal voice interaction. We’ve moved away from smart speakers to mobile devices as the hub of voice interaction, and so we’re no longer limited to getting the information we need from a speaker, but are able to see it on the screen.

This is something I’ve been saying for some time, and repeated in my keynote speech at CES: The real breakthrough for voice is when we can speak to machines and get a visual response. Now, Google has also said that screen integration is the number one voice trend they see for 2020 – and it was a theme you heard from just about everyone else here.

The Microphone Icon Will be Everywhere

In fact, multi-modal voice interaction is already real. Think about dictating long text messages rather than thumb-typing them; that’s a simple example of voice input with visual output. Have you used Waze for navigation? Then you’ve likely used the microphone icon to tell it where you want to go, and then get visual directions. Using the Microphone button on Apple TV to search is a lot faster than typing out the name of a movie or show you want to find. Those microphone icons will be appearing everywhere in the next year.

There are two main reasons the industry has to go in this direction, both crucial to the user experience: Simplicity (saving time) and trust (ensuring privacy and security).

Saving Time, Making Choices

For simplicity, take something as commonplace as ordering a pizza online. It’s much faster to tell an app that you want a large pepperoni pizza with extra cheese and olives but no onions, than to swipe and touch all those options on your mobile screen. We’ve already built a system like this for one of our customers.

Or consider an example Google itself showed, where a user asks a smart speaker what movies are playing that night. Once the system has determined where the individual is located and when “night” begins, it can grab information about titles and times from local theaters. But the last thing you want to do then is have the speaker recite a litany of options one at a time (think about how enjoyable it is to use older automated phone systems that read you a long menu of options). Wouldn’t you rather see your choices displayed in a well-designed visual layout and then be able to use your voice to buy tickets?

Visual Answers Build Trust

Simplicity is obvious; trust is less obvious but just as important.

Banking apps turn out to be the third most commonly used category on mobile devices (after social media and weather), and something like half of all interactions are people checking their balances. While you might not have any problem with someone else in the room hearing you ask for your balance, chances are you would have a problem with them hearing how much you have in the bank, or what you owe on your credit card. You won’t trust a system that doesn’t provide that response visually to your mobile screen. This is why we see voice-first banking as the future for that industry.

The same is true for healthcare, another industry where voice has a significant role to play. Some hospitals and clinics are putting smart speakers in waiting areas for inpatients or the emergency department, offering educational information about different tests or procedures. It’s a way not only to answer basic questions but also alleviate some anxiety. But as my fellow panelist Sandhya Pruthi from the Mayo Clinic noted, when you design those systems you have to be very careful about how the system responds. You also can’t have systems that will read a patient’s diagnosis or other information aloud; privacy (and privacy laws) demand the information be pushed to the patient’s mobile device screen.

Hopping Across Ecosystems

Where we want to go is where voice is entirely integrated across what today are competing ecosystems – Google, Amazon, Apple – and where most of the voice assistant functionality is embedded in the app. It will be the app’s job to determine what you want based on everything from context to tone of voice – and, over time, know who is speaking – and then respond appropriately.

Those aren’t trivial problems to solve. In particular, there are a lot of difficult issues related to authentication when hopping the boundaries between ecosystems. You need to make the handoff secure but also seamless, not forcing the user to intervene. Difficult as they are, they’re vital to the success of almost any app in the future, so it’s where companies should be investing in voice app development.

I’m confident after being at CES that we’re at the tipping point for voice, and that we’ll look back in a few years and wonder how we managed to get through the day without it being an integral part of the user experience.

Discover How to Drive ROI with Voice Experiences

Download the Report

Product Researchers Are Builders Too: Why You Need UX Researchers on Your Agile Teams

If you include user and product research as part of your business or product strategy,...

Read the article