Introducing the beginning
Hello everyone, today I will introduce you to the word trigger or wake up word problem.
So what is trigger word? Trigger word is a voice-activated device signal similar to “Ok Google” – “Hey Google” in Google’s virtual assistant, “Alexa” Amazon’s Alexa virtual assistant. And here is the demo I made to create my own trigger word model, please see:
OK, let’s find out what trigger word !!!!!!!
Why use trigger word:
- Trigger word as a name call, let the device know that the user is calling that device and not another person or device
- Limit requests to continuous voice samples on the server
- Partially ensure the privacy of users
Requirements of a word trigger module:
- Lightweight, run on low-profile hardware
- Call signals need to attach tops, easy to say
- Typical signaling, avoiding being encountered in everyday speech
- High accuracy, detect male and female voices, different ages, different regions
Based on the above requirement, we determine this is a two-class classification problem: positive and negative.
Choose method
Option 1 : Use the api of “Snowboy hotword detection” refer to the page: https://snowboy.kitt.ai/
Pros: this api has built tools, my job is to put data into and get the model
Cons: To have a “good” model download, we need at least 500 users to contribute data with English data and at least 2000 users contribute data with data in other languages.
Also, I can still download the test model with only 3 personal data samples. But the effectiveness of the model with these 3 data samples is not high. So I did not choose this option.
But if you want to try it out, you can follow these steps:
Step 1: Go to https://snowboy.kitt.ai/
Step 2: Create a login account
Step 3: After logging in, you will see the interface like this:
As you can see, below is a list that many users have created their own trigger word model, with models that have enough samples, you can download it to use. For models that are not enough sample you can go to contribute data. To create your own model, click the Button: Create Hotword
In the Hotword Name field, enter the model name, so the name matches the keyword to call, so that other users can know and contribute data. In the language field, you select the language used to create the model, if the language you use is not in the list, select other. Next, you write a few descriptions for other users to understand and also for easy search.
Step 4: Record sample data: A user is contributed 3 data samples, you should collect all 3 samples to test and download the model
Option 2 : Follow the method that Andrew Ng taught in deeplearning lecture.
By this method, on each frame of an input data sample, we will classify whether the frame is labeled positive or negative.
The structure of deep learning model under this plan is as follows:
This model has more than 500000 parameters, using a Convolution 1D layer at the input and two GRU layers, in my opinion this model structure is still quite large to run on low hardware devices.
Option 3 : Build model with simple architecture
Data preparation
To perform this step, you need to collect data of 3 types:
- Positive: keyword audio data to activate
- Negative: non-verbal voice audio data to activate
- Background: the sound of the environment in which we will place the device
The data collection stage will be time consuming and boring. In fact, I used my phone to “beg” data from my friends. Try to collect as many positive data as possible, thanks to everyone who speaks in different intonation. With negative data, I ask you to read any paragraph that does not contain the activation keyword. If we collect negative data from friends for positive data, we have a very good data set. In addition, we can collect negative data from audio reading stories, news …. With data for the background, it is easier to collect, if your device is located in a fixed space, we should record it there as a background.
Pretreatment
Step 1: Reduce white noise:
White noise is the noise produced by a recording device, more or less white noise depends on the microphone quality and the sensitivity setting of the microphone.
To reduce white noise, I recorded an audio clip in a quiet environment for modeling. This sample will be like a threshold mask, to eliminate noise in future audio. Note, when we change the recording device, or set the microphone sensitivity differently, it should be recorded again. About the noise reduction algorithm you refer here for more: https://github.com/timsainb/noisereduce
Code to reduce noise is as follows:
1 2 3 4 5 6 7 8 | import noisereduce as nr import librosa y, sr = librosa.load("positive.wav") noise, sr = librosa.load("audio/train_sunnie/test/noise.wav") reduced_noise = nr.reduce_noise(audio_clip=y, noise_clip=noise, verbose=False) librosa.output.write_wav("positive_reducenoise.wav", reduced_noise, sr, norm=False) |
Step 2: Divide the data file after reducing the noise of each sample, the length you set is 1.5 seconds
Step 3: Generate more data with existing data
Because the data for your positive label is very small, it is necessary to generate more data for this label. I use the following methods:
- Shifting Time: Move the signal to the left, right
- Changing Pitch: Change the pitch
- Changing Speed: Change the speed
Code implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | import numpy as np import librosa def shifting(data): shift_max = int(len(data)*0.05) shift_min = int(len(data)*0.02) shift = np.random.randint(shift_min, shift_max) if np.random.randint(0,2,1)[0]: shift = -shift augmented_data = np.roll(data, shift) return augmented_data def change_pitch(data, sampling_rate, pitch_factor): return librosa.effects.pitch_shift(data, sampling_rate, pitch_factor) def change_speed(data, speed_factor): return librosa.effects.time_stretch(data, speed_factor) path_voice = "positive/" path_save = "positive_aug/" for i in range(5): for file_name in tqdm_notebook(os.listdir(path_voice)): try: y, sr = librosa.load(path_voice+file_name, sr=16000) y = shifting(y) rate = np.random.uniform(-3, 3, 1)[0] y = change_pitch(y, sr, rate) rate = np.random.uniform(0.8, 1.5, 1)[0] y = change_speed(y, rate) librosa.output.write_wav(path_save + str(np.random.randint(1000000)) + file_name, y, sr) except Exception as e: print(e) print("Num sample: ", len(os.listdir(path_save))) |
Extract feature
Audio data has many directions for extracting data and depends on the problem. With this problem, I choose to extract data in melspectrogram spectrum.
Code to extract melspectrogram data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | import numpy as np import librosa def extract_feature(path_file): y, sr = librosa.load(path_file, sr=16000) sr = 16000 num_step = int(sr*1.5) if len(y) > num_step: y = y[0: num_step] pad_width = 0 else: pad_width = num_step - len(y) y = np.pad(y, (0, int(pad_width)), mode='constant') D = np.abs(librosa.core.stft(y=y, n_fft=2048))**2 S = librosa.feature.melspectrogram(S=D, sr=sr, n_fft=256) S_dB = librosa.power_to_db(S, ref=np.max) return S_dB.T path_positive = "positive_aug/" path_negative = "negative_aug/" sample_data = extract_feature("test_1.wav") num_sample = len(os.listdir(path_negative)) + len(os.listdir(path_positive)) x = np.zeros((num_sample, sample_data.shape[0], sample_data.shape[1])) y = np.zeros((num_sample, 2)) i = 0 for file_name in tqdm_notebook(os.listdir(path_positive)): feature = extract_feature(path_positive+file_name) x[i, :, :] = feature y[i, :] = np.array([1, 0]) i += 1 for file_name in tqdm_notebook(os.listdir(path_negative)): feature = extract_feature(path_negative+file_name) x[i, :, :] = feature y[i, :] = np.array([0, 1]) i += 1 print("Num sample: ", num_sample) print("Time: ", time.time()-start) np.save("Xtrain", np.asarray(x)) np.save("Ytrain", np.asarray(y)) |
Initialize model and trainning
The model structure I created is quite simple to ensure light model and fast prediction time. In the future I can update this section to give the best model possible.
Model initialization code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | from keras.models import Sequential from keras.layers import Dense, GRU from keras.layers import Activation from keras.models import load_model from keras import optimizers import keras num_hidden = 2 model = Sequential() model.add(Dense(128, input_shape=(x.shape[1], x.shape[2]))) model.add(Activation("relu")) for _ in range(num_hidden): model.add(Dense(128)) model.add(Activation("relu")) model.add(GRU(128)) model.add(Dense(2, activation = "softmax")) adam = optimizers.Adam(lr=0.000125) model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy']) model.summary() |
Result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | Model: "sequential_5" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_17 (Dense) (None, 47, 128) 16512 _________________________________________________________________ activation_13 (Activation) (None, 47, 128) 0 _________________________________________________________________ dense_18 (Dense) (None, 47, 128) 16512 _________________________________________________________________ activation_14 (Activation) (None, 47, 128) 0 _________________________________________________________________ dense_19 (Dense) (None, 47, 128) 16512 _________________________________________________________________ activation_15 (Activation) (None, 47, 128) 0 _________________________________________________________________ gru_5 (GRU) (None, 128) 98688 _________________________________________________________________ dense_20 (Dense) (None, 2) 258 ================================================================= Total params: 148,482 Trainable params: 148,482 Non-trainable params: 0 _________________________________________________________________ |
Next, I split the data set into 2 parts: 85% for training and 15% for validation:
1 2 3 | from sklearn.model_selection import train_test_split x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.15, random_state=42) |
Trainning model:
1 2 3 | model.fit(x_train, y_train, validation_data=(x_valid, y_valid), batch_size=32, epochs=12, shuffle=True) model.save('model.h5') |
Result:
1 2 3 4 5 6 7 | Epoch 10/12 4764/4764 [==============================] - 17s 4ms/step - loss: 0.1363 - accuracy: 0.9536 - val_loss: 0.1304 - val_accuracy: 0.9548 Epoch 11/12 4764/4764 [==============================] - 17s 4ms/step - loss: 0.1274 - accuracy: 0.9582 - val_loss: 0.1301 - val_accuracy: 0.9548 Epoch 12/12 4764/4764 [==============================] - 17s 4ms/step - loss: 0.1112 - accuracy: 0.9633 - val_loss: 0.1225 - val_accuracy: 0.9655 |
Let’s test the model running realtime to check the results: I use pyaudio to stream audio data collected from the microphone, running a thread to record data, a thread to detect will help us not to miss the data:
Audio stream initialization code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | def get_audio_input_stream(callback): stream = pyaudio.PyAudio().open( format=pyaudio.paInt16, channels=1, rate=fs, input=True, frames_per_buffer=chunk_samples, input_device_index=0, stream_callback=callback) return stream def callback(in_data, frame_count, time_info, status): global run, timeout, data, silence_threshold data0 = np.frombuffer(in_data, dtype='int16') data = np.append(data, data0) if len(data) > feed_samples: data = data[-feed_samples:] q.put(data) return (in_data, pyaudio.paContinue) |
Code dectect trigger word:
1 2 3 4 5 6 7 8 9 10 11 12 | while True: data = q.get() save_audio(data, "temp/temp.wav") feature = extract_feature("temp/temp.wav") x_t = np.zeros((1, feature.shape[0], feature.shape[1])) x_t[0, : , :] = feature r = model.predict(x_t) if r[0][0] > 0.5: print("trigger word detected!") print(r[0][0]) time.sleep(0.001) |
Directions for improvement:
- Add more data
- Try other featured extracts
- Improve model structure
- Use model pruning to make the model lighter and faster
Conclude:
Above, I introduced the method to build my trigger word model. If you have any comments, please comment below. Thank you for reading!