Speech Recognition

Do you know how Alexa or other personal assistants work?

Well, don’t you worry! You’ll learn that in this lesson.

Topic Covered in the Lesson

At the end of the lesson, you will be able to:

Understand how speech recognition works.
Use AI blocks in PictoBlox to convert speech into text.
Make your own virtual assistant in PictoBlox which recognizes your that can recognize your command play the requested song.

Key Learning Outcomes

Speech Recognition
How does Alexa work?
Speech Recognition in PictoBlox
Text to Speech in PictoBlox

Let’s begin!

How Do Humans Learn a Language?

From the time we are born, we hear words and sounds around us. Even before we can speak, we hear some words that we start responding to words like Mama, Dada, Yes, No.

Our brain tries to find patterns to differentiate various sounds and words and categorize them. It may seem as though humans are pre-programmed to listen and understand but it is not so. We have been trained to develop this ability.

Speech recognition technology has been developed on the same lines. Computers are also trained in the same way.

Speech Recognition

Speech recognition is the ability of a machine to identify words and phrases in spoken language and convert them to a machine-readable format.

Speech recognition is very complex and a lot of mathematical equations are involved. Let’s break it down into simple steps:

First, the machine records the audio file.
Then, it breaks down the audio to extract consonants and vowels (the building blocks of a text). After this process, we get a list of consonants and vowels.
Using the word database of the language, the machine tries to identify words from the list and then make sentences thus converting the speech into text.

How Alexa Works

Alexa, Amazon’s virtual assistant AI technology, uses natural language processing, a procedure of converting speech into sounds, words, and ideas.

Here’s how she works:

Alexa first records your speech. Then, this recording is sent to Amazon’s servers to be analyzed more efficiently.
Amazon breaks down the recording into individual sounds. It then consults a database containing various words’ pronunciations to find which words most closely correspond to the combination of individual sounds.
It then identifies keywords to make sense of the tasks and carry out corresponding functions. E.g. if Alexa notices words like “weather” or “temperature”, it will open the weather app.
Amazon’s servers send the information back to your device. If Alexa needs to say anything back to you, it will go through the same process described above, but in reverse order.

Face landmarks are a set of easy-to-find points on a face, such as the pupils or the tip of the nose.

Speech-to-Text Blocks in PictoBlox

The Artificial Intelligence extension in PictoBlox has blocks dedicated to speech recognition. Let’s first add the extension in our project:

Create a new project in PictoBlox.
Select evive as your board from the Board tab in the menu bar.
Next, click on the Add Extension button and add Artificial Intelligence extension.

Speech Recognition Block

To execute speech recognition, we have the recognize speech for () s in ().

When the block is executed, the recognition window will open and you will get a specified time during which PictoBlox will record whatever you say. Once recorded, the speech will be converted to the text of the language you spoke in and saved locally.

Speech Result Block

To get the result, we have the speech recognition result block. It reports the last text detected from the speech.

Here is a simple example of how to use the speech recognition blocks:

Activity: Make Your Own Alexa

In this project, we will make our own personal assistant like Alexa.

We will be making a script that will recognize our voice command and analyze it to play the Mario theme song or the Spider-Man theme song. If the command is not recognized, it will say that it didn’t understand the command.

Let’s start.

Setting Up the Stage

Download the songs from here:

Once you open the link:

Follow the steps to set up the Stage:

Switch to Sounds tab and select Upload Sound from the bottom left corner.
Select the two sounds downloaded, and open the sounds.
Delete the Grunt sound from the library.
Switch to Code Tab.

In this project, we will also use the Text to Speech extension to respond to the user. To add it, click the add extension button and add the Text to Speech extension.

Speech Recognition

Add a when flag clicked block into the scripting area.
Snap a recognize speech for () s in () block below the when flag clicked block. Change the time to 4 seconds.
Now, snap an if () else block below the recognize speech for () seconds block.
In the condition of the if () else block, add a () contains ()? block from the Operators palette. In the first argument, add a speech recognition result block and in the second write “mario“. So, if the decoded text contains the word Mario, it will execute the if branch blocks.
Add a speak () block from Text to Speech palette under the if arm and write the message “Playing Mario Song!“.
Next, a snap play sound () until done block below the speak () block and select Mario. This is how the script look:
Duplicate the if () else block and snap it under the else arm.
Change “mario” to “spiderman” in the condition of the if arm.
Change the message in the speak block to “Playing Spiderman Song!“.
Change the sound to Spiderman.
Finally, in under else arm, add a speak () block and write “Sorry, I am unable to understand the command“.