Voice Interfaces – Research

Part 1:

The Voice Technology Landscape

How many of us have experienced speaking to a machine that is programmed to understand our commands? Whether a smartphone, TV or an automated banking phone menu, most of us have experienced this at some point. Now, out of those, how many users had an unsuccessful experience when talking to a machine? It’s likely that the proportion of poor experiences is fairly high. However, with rapid technological advances in the AI that powers these systems, this is all about to change. As a UX designer, I wanted to do some research into the sector with the goal of understanding where it has come from, where it is going and how to best go about successfully designing for voice as its popularity grows. This part covers all these areas and in part 2 you can find out about my prototyping exercise.

The first piece of functioning voice recognition technology appeared in Bell Laboratories in 1952. The automatic digit recogniser, named ‘Audrey’, could recognise the numbers 0 - 9 when they were spoken by the same person. This nascent technology gradually improved over the decades, with the first commercially successful speech to text software appearing in the 90s with Dragon NaturallySpeaking (this technology would eventually become the foundation for Siri).

As a side note, I had no idea of the amount of important technology produced by Bell Labs until researching this piece. They are credited with (amongst many other things) the development of radio astronomy, the transistor, the laser, the Unix operating system and the programming languages C and C++. Work done there has gathered nine Nobel Prizes! Fun fact: the Decibel loudness unit was also invented there and gets its name from the founder.

Today

Voice technology is relatively expensive from a processing power point of view, with a single Siri query using more than 100 times the computational resources of a traditional web search. However, due to continuing advancements in processing power, bandwidth and neural networks, speech recognition has come a long way since the 50s and as the technology improves, it will take up an increasingly central position in our lives.

Amazon's Echo Dot smart speaker

At the time of writing this post (October 2019), some of the most visible incarnations of voice technology are the ‘smart speakers’ and their associated AI assistants; Amazon Alexa and the Echo speaker, Google Home and the Hub speaker. There is also a vast and growing list of other devices that these assistants can be integrated with, including plug sockets, light bulbs, thermostats and even carbon monoxide detectors. Apple’s Siri and Microsoft’s Cortana are the other big players in the AI assistant space but they do not have flagship smart speaker products associated with them just yet, and they are used more often with smartphones and computers.

So what are we using the tech for at the moment? By far the most common tasks we employ a VUI for are consuming digital content such as news or music, requesting a service such as pizza delivery, or controlling a utility like Alexa and asking her to set a timer or switch the lights in our homes on.

Where is the technology going?

There are a lot of ideas out there theorising on where VUI usage is heading. These theories commonly take the form of predictions such as “50% of all searches will be voice activated by 2021”. These tend to be quite conservative and relate to common current tasks such as browser searches because the rapid proliferation of the technology makes longer term predictions difficult. Part of the reason for the rapid change is that there is a positive feedback loop at work here; as the tech improves, there is greater and greater uptake from users. This generates more and more data for the neural networks to train on, which in turn improves the technology.

Other future predictions for voice include it rendering the remote control obsolete (who enjoys typing in letters one by one with a directional pad on a remote anyway?) and the increasing presence of ‘multimodal’ systems that have multiple inputs and outputs. A device like Amazon’s Echo Show offers different outputs in the form of sound and screen, so the designs of the future will need to take this combination into account and allow the different outputs to reinforce one another in the user experience.

Before I move on, I wanted to remind you of the many examples of ‘strong’ AI found in fiction, from HAL 9000, through to Knight Rider’s KITT, and Samantha in the Spike Jonze film Her. As a species we are still some way away from achieving this level of AI (and opinion is strongly divided on whether it is even a good thing to develop), so as far as designing VUIs, at present we have to settle for the creation of ‘smart fakes’ that appear to function in this sophisticated manner but actually can only deal with a limited number of user responses. What these responses might be, therefore, needs to be carefully considered beforehand.

Designing for voice

There are many considerations required for VUI design and many of these understandably differ a lot from visual UI design. There is, however, some crossover, and the trusty 10 Heuristics for User Interface Design by Jakob Nielsen is a good place to begin. Nielsen pioneered these principles in the 90s and several of them are relevant for VUI design and prompt useful questions. For example, ‘visibility of system status’, highlights a common problem in the form of discoverability. If there is no visual indicator of the system’s status, how do we go about conveying this information to the user? Furthermore, if we step back and think of discoverability more generally, how do we convey what actions are possible for the user at any given moment?

Heuristic #7 (flexibility and efficiency of use) is another principle from Nielsen’s work that aides VUI design. It’s clear that the system needs to cater for the beginner, but also needs to adapt as the user becomes more familiar with it, thereby allowing the user to achieve their goals in the most frictionless way(s) possible. These are just a couple of examples from my research, but it is a useful exercise to revisit the 10 Heuristics and take your time to consider which are relevant to the problem you are trying to solve.

Another world we can observe to glean useful wisdom when designing for voice is within linguistics. The Cooperative Principle was developed by Paul Grice in the 70s and is used to describe how people achieve effective conversational communication. More broadly, the Cooperative Principle (taken from Wikipedia):

“…describes how people achieve effective conversational communication in common social situations—that is, how listeners and speakers act cooperatively and mutually accept one another to be understood in a particular way.”

The Cooperative Principle is divided into four maxims, or general rules. You can read about them in full here, but for the purposes of designing for voice there are two which are particularly relevant (taken from Wikipedia):

“Maxim of quantity

1. Make your contribution [to an interaction] as informative as is required (for the current purposes of the exchange).
2. Do not make your contribution more informative than is required.

Maxim of manner

1. Avoid obscurity of expression.
2. Avoid ambiguity.
3. Be brief.
4. Be orderly.”

The Maxims can be thought of as presumptions that we make when hearing or uttering a phrase, and the implications inherent in them. For example, if you asked your friend “I need to buy something, is there a shop nearby?” and they simply answered “yes” and nothing else, they are failing to follow the maxim of quantity, because a desire to know the location of the shop is implicated in the question. Furthermore, a response stating the location of the shop that doesn’t include the fact that the shop happens to be closed at the moment is failing to follow the maxim of relevance, as this response has the implicature the shop is currently open.

Thinking about the maxims of quantity and manner will help you design a VUI that provides just enough information to the user at a particular moment during their journey, as well as conveying this information in a clear and concise way.

There are a few more considerations that I will cover here before moving on to some practical ways of getting started with the actual design. Firstly, it is important to leverage as much context as possible that we know about the user. These include things like the different locations the user might be in when interacting with the interface, previous habits around content consumption or even biometric information (if available from something like a Fitbit). The more we can piece together about the user before an interaction, the better we can tailor the experience to their needs. Furthermore, changes in context could be an opportunity to design better and more fulfilling experiences - could voice be used to create a seamless transition from one context to another?

Secondly, it is important to consider how the VUI is conveying the personality or the brand of the product. Should it be warm and informal like the voice and tone of Netflix, or should it take up a more clinical, serious tone such as that which you would expect from a health app? Matching the personality of the VUI to the brand will reinforce the latter.

There are many other areas to ponder but there isn’t time to cover all of them here, so let’s move on from theoretical considerations to how we would actually approach designing for voice.

The first thing to consider is whether or not voice can really add value to what you are thinking of applying it to. Does the implementation of voice improve the achievement of a task that usually takes many steps on a Smartphone or computer screen? For example, to set a timer with an iPhone you need to find the phone, pick it up, press the home button, swipe up for the control centre etc. Asking whatever smart speaker you have in the vicinity to start a timer for 10 minutes is much more direct. Also, is it useful for the task to be achieved hands free? It’s not a stretch to imagine the previous example being employed while cooking, so hands being kept free for food preparation is a big plus. This is also a jumping off point for considering designing for those who are unable to use their hands due to a disability. Accessibility is a huge and fascinating topic that I won't go into here, but there are surely a wealth of exciting opportunities to improve people's quality of life.

Next we need a script, or at least an idea of how the conversation with the user will progress and the responses we think the VUI will need. This can be created by simply talking through variations of how users might ask for the task to be achieved, and the different responses the VUI could give at any one juncture of the conversation.

Once this has been drawn up, we need a quick and dirty way to prototype and test the script to iron out what we want it to say before committing to any expensive development. Wizard of Oz testing is one such method. The title refers to the scene in the 1939 film whereby Dorothy believes she is talking to the ‘great and powerful Oz.’ That is, until Toto the dog reveals the ordinary human being sat behind the curtain pulling some levers. The idea is that the test participant believes they are speaking to an actual VUI, but in reality there is a human triggering the responses that the VUI makes, so it is akin to a hi-tech ventriloquist's dummy! Designer Ben Sauer is an experienced practitioner in the field and has developed a simple application for running Wizard of Oz tests which you can find here. His app allows you to load a text file containing your script and assign each line to a different letter of the keyboard. The result is that when you press a letter it sends that phrase to the computer’s text to speech software, which allows you to pretend to be the VUI!

The benefit of this method is that it allows you to rapidly test out a script, gathering all the weird and wonderful responses that humans give. You can then use your insights to refine the script, which will make it more robust and prepared to handle a wider range of responses.

After undertaking this research I presented it to the Product Design team at Ostmodern and undertook a whiteboard exercise to gather ideas for a proof of concept/prototyping exercise. You can read more about that in Part 2.

Résumé

E-mail