Get Started with Alexa
After completing this unit, you’ll be able to:
- Explain why voice matters.
- Name the core components that make Alexa work.
- Describe the capabilities of Alexa and Alexa-enabled devices.
To help set the stage for this module—let’s start by talking about the power of voice. At Amazon, we believe voice represents the next major disruption in computing .
Looking at how we interact with computers and technology, though, we typically only use our hands and eyes to interact with them.
Voice interfaces are the next progression in a series of ever-adapting user interfaces that we use every day. In the early days of computing, there was the venerable punch card, which was a limited character interface. The next step up was to text-only graphical user interfaces (GUI). Following the introduction of the mouse, we then had a progression of GUI that used more and more advanced layouts with modern operating systems. In the 90s, the explosion of the Internet and web pages drove web design as the new frontier. Enter the smartphone in the early 2000s with a new touch-based interface. And now, with advancements in neural nets, natural language processing, and speech recognition, we have voice user interfaces (VUI).
VUI have also evolved over time. The days of “Press 1 for the front desk, press 2 for housekeeping, press 3 for reservations” are slowly shifting to a more conversational flow that is more natural for users and allows them to be more succinct and accurate in their request. This evolution is referred to as conversational user interface.
Let’s examine a common interaction with Alexa. If you don’t know who Alexa is:
Alexa is the brain behind the Amazon Echo family of devices and other Alexa-enabled devices. Using Alexa is as simple as asking a question—just ask, and Alexa responds instantly. Alexa lives in the cloud and is always getting smarter.
Getting back to that conversation, it can look something like this:
A typical user : “Alexa, do I need an umbrella today?”
Alexa : “It might rain in Seattle today. There’s a 55% chance. You can expect about 0.14 inches.”
A simple question, but many different things needed to happen to get that response. And yes, it does rain frequently in Seattle.
This diagram is a high-level end-to-end flow for what happened when Alexa hears and responds to a question.
Let’s dive into the details.
When you say the wake word (in this case we are using Alexa ), the light ring around the Echo begins to glow blue to indicate Alexa is now listening and streaming that data to the cloud. The captured audio is called an utterance . Note : You can also change the wake word to a couple of other words: Echo, Computer, and Amazon.
Once the utterance has been received in the cloud, a series of speech models are applied to it using automatic speech recognition (ASR) and natural language understanding (NLU) to figure out what you wanted and where to route that. In the previous example, Alexa figured out that this was an intent to check the weather. Intents are registered by a skill that can handle the intents, and the skill provides a number of sample utterances to help Alexa map out where requests go.
Skills are built utilizing the Alexa Skills Kit, a collection of self-service APIs, tools, documentation, and code samples that makes it fast and easy for anyone to build for voice. In this case, let’s assume there is an AWS Lambda function that calls a service to check on the weather forecast when an incoming intent from Alexa is received.
Skills can be built using many different options such as AWS Lambda, Heroku, and custom web services communicated over HTTPS. As long as the skill is built to handle the incoming Alexa request in a secure manner, it doesn’t matter where it is hosted or what language it is written in.
The skill is then responsible for returning a response to Alexa. The response can contain text that is formatted to be spoken a certain way from Alexa or can even contain your own prerecorded audio files. Fun fact—if you ever wanted to have Alexa say Bazinga as part of your response, you can do it using what are called speechcons .
The response can be more than just a voice response. The skill can also indicate that a card should be returned to the user. Cards can contain additional context via text plus an image that help supplement the voice response. The card is then accessible via the Amazon Alexa App, which is available on Fire OS, Android, iOS, and desktop web browsers. With the introduction of the Echo Show, there are even more advanced cards called display templates that can be returned to the user. Display templates provide more flexibility by supporting full-width images, text overlays, lists of images and text, and more.
Once Alexa receives the response from the service, it dispatches the resulting text for the speech output to the Echo and transmits the card content to the user’s devices. The Echo then uses a text-to-speech engine to speak the response in Alexa’s voice.
So what else does Alexa do? What else works with Alexa? We mentioned the Echo earlier, but the voice-enabled device market is growing, and there are many different options out there.
Alexa can make your life easier and more fun by:
- Providing hands-free voice control for music and entertainment—“Alexa, play me some funky music.”
- Keeping an eye on the clock whether you’re cooking in the kitchen or snoozing in the bedroom—“Alexa, set a timer for 20 minutes.”
- Using your voice to manage shopping and to-do lists—“Alexa, add milk to my shopping list.”
- Helping you stay connected to the news that matters most to you—“Alexa, play my flash briefing.”
- Controlling smart-home devices such as lights, switches, thermostats, and more—“Alexa, set the bedroom to 72 degrees.”
- And many more.
Alexa is available on a growing number of devices. In addition to Echo devices, Alexa can also listen to you on Amazon devices like the Amazon Tap, Fire TV, and with the Amazon shopping app on your smartphone. Utilizing Alexa Voice Service, hardware makers can also add the ability for Alexa to converse with their users on any device that has a microphone and a speaker.
Now that we’ve given you an overview of what Alexa is and how it all ties together, let’s take a step back in the next section to think about how to design for voice interactions, and how it’s different than other types of interactions you typically have with software.