OK google – Hey Siri – Sunnie

Tram Ho

Introducing the beginning

Hello everyone, today I will introduce you to the word trigger or wake up word problem.

So what is trigger word? Trigger word is a voice-activated device signal similar to “Ok Google” – “Hey Google” in Google’s virtual assistant, “Alexa” Amazon’s Alexa virtual assistant. And here is the demo I made to create my own trigger word model, please see:

OK, let’s find out what trigger word !!!!!!!

Why use trigger word:

  • Trigger word as a name call, let the device know that the user is calling that device and not another person or device
  • Limit requests to continuous voice samples on the server
  • Partially ensure the privacy of users

Requirements of a word trigger module:

  • Lightweight, run on low-profile hardware
  • Call signals need to attach tops, easy to say
  • Typical signaling, avoiding being encountered in everyday speech
  • High accuracy, detect male and female voices, different ages, different regions

Based on the above requirement, we determine this is a two-class classification problem: positive and negative.

Choose method

Option 1 : Use the api of “Snowboy hotword detection” refer to the page: https://snowboy.kitt.ai/

Pros: this api has built tools, my job is to put data into and get the model

Cons: To have a “good” model download, we need at least 500 users to contribute data with English data and at least 2000 users contribute data with data in other languages.

Also, I can still download the test model with only 3 personal data samples. But the effectiveness of the model with these 3 data samples is not high. So I did not choose this option.

But if you want to try it out, you can follow these steps:

Step 1: Go to https://snowboy.kitt.ai/

Step 2: Create a login account

Step 3: After logging in, you will see the interface like this:

As you can see, below is a list that many users have created their own trigger word model, with models that have enough samples, you can download it to use. For models that are not enough sample you can go to contribute data. To create your own model, click the Button: Create Hotword

In the Hotword Name field, enter the model name, so the name matches the keyword to call, so that other users can know and contribute data. In the language field, you select the language used to create the model, if the language you use is not in the list, select other. Next, you write a few descriptions for other users to understand and also for easy search.

Step 4: Record sample data: A user is contributed 3 data samples, you should collect all 3 samples to test and download the model

Option 2 : Follow the method that Andrew Ng taught in deeplearning lecture.

By this method, on each frame of an input data sample, we will classify whether the frame is labeled positive or negative.

The structure of deep learning model under this plan is as follows:

This model has more than 500000 parameters, using a Convolution 1D layer at the input and two GRU layers, in my opinion this model structure is still quite large to run on low hardware devices.

Option 3 : Build model with simple architecture

Data preparation

To perform this step, you need to collect data of 3 types:

  • Positive: keyword audio data to activate
  • Negative: non-verbal voice audio data to activate
  • Background: the sound of the environment in which we will place the device

The data collection stage will be time consuming and boring. In fact, I used my phone to “beg” data from my friends. Try to collect as many positive data as possible, thanks to everyone who speaks in different intonation. With negative data, I ask you to read any paragraph that does not contain the activation keyword. If we collect negative data from friends for positive data, we have a very good data set. In addition, we can collect negative data from audio reading stories, news …. With data for the background, it is easier to collect, if your device is located in a fixed space, we should record it there as a background.


Step 1: Reduce white noise:

White noise is the noise produced by a recording device, more or less white noise depends on the microphone quality and the sensitivity setting of the microphone.

To reduce white noise, I recorded an audio clip in a quiet environment for modeling. This sample will be like a threshold mask, to eliminate noise in future audio. Note, when we change the recording device, or set the microphone sensitivity differently, it should be recorded again. About the noise reduction algorithm you refer here for more: https://github.com/timsainb/noisereduce

Code to reduce noise is as follows:

Step 2: Divide the data file after reducing the noise of each sample, the length you set is 1.5 seconds

Step 3: Generate more data with existing data

Because the data for your positive label is very small, it is necessary to generate more data for this label. I use the following methods:

  • Shifting Time: Move the signal to the left, right
  • Changing Pitch: Change the pitch
  • Changing Speed: Change the speed

Code implementation:

Extract feature

Audio data has many directions for extracting data and depends on the problem. With this problem, I choose to extract data in melspectrogram spectrum.

Code to extract melspectrogram data:

Initialize model and trainning

The model structure I created is quite simple to ensure light model and fast prediction time. In the future I can update this section to give the best model possible.

Model initialization code:


Next, I split the data set into 2 parts: 85% for training and 15% for validation:

Trainning model:


Let’s test the model running realtime to check the results: I use pyaudio to stream audio data collected from the microphone, running a thread to record data, a thread to detect will help us not to miss the data:

Audio stream initialization code:

Code dectect trigger word:

Directions for improvement:

  • Add more data
  • Try other featured extracts
  • Improve model structure
  • Use model pruning to make the model lighter and faster


Above, I introduced the method to build my trigger word model. If you have any comments, please comment below. Thank you for reading!

Share the news now

Source : Viblo