Text to speech with Tacotron 2

Tram Ho

1. Text-to-Speech at a Glance

Text to Speech (TTS), or speech synthesis – speech synthesis is methods of converting from text to speech – like the voice of google translate. This topic has been studied and used since the 60s of the last century. In the early days, speech synthesis systems were usually built in the following way:

  • Record each syllable and save it in the database as a set of audio tracks
  • Word preprocessing -> word extraction, phoneme -> identifies and finds audio segments containing corresponding sounds in the database
  • Combine audio, post-processing, create a complete voice segment

Photo source: DolphinAtack_Inaudible_Voice_Commands

With this construction, the voice produced is of poor quality, “mechanical”, even and without the same intonation as a human. Furthermore, systems of this type depend too much on the preparation of recording data.

Recently, deep learning has gained a lot of achievements in many fields, including the field of speech synthesis. With the introduction of a series of models and algorithms such as Deep Voice, Nuance TTS, WaveNet, Tacotron, Tacotron2 … the voice was born with very good quality, intonation and emotion, difficult to distinguish the listener. Is that voice real or not.

In today’s article, I will write about Tacotron2 – an improved, simplified version of Tacotron. Tacotron and tacotron2 are Google public for the community, SOTA in the field of speech synthesis.

2. Tacotron architecture 2

2.1 Mel spectrogram

Before getting into the tacotron / tacotron2 architecture details, you need to read a bit about mel spectrogram. In the field of speech and sound processing, instead of directly processing the data in a time domain waveform (a very long sequence), we will convert the sound to the frequency domain like Spectrogram, Mel spectrogram, Mel Cepstrum. , MFCC … For better understanding, please read a little knowledge in part 2 Speech p2 – Feature extraction MFCC . Below is a Spectrogram illustration of a sound clip:

Photo source: medium.com

Basically, the process of converting sound from time-domain to frequency-domain follows these steps:

  • Split waveform sound waves into a set of short segments (~ 25ms)
  • On each audio segment, convert DCT / FFT to calculate a range of N magnitude (intensity) corresponding to N frequencies. (In addition a series of N phase (phase) but this part is often not used). These two steps are called STFT – Short time furier transform.
  • Thus, we obtain data in the form of spectrogram with dimension = 2. The width is the time axis, the vertical is the frequency, the value of each point is represented by color – the intensity of the corresponding frequency.
  • Since the perception of the human ear is non-linear, mel filter bank filters are used to convert from spectrogram to mel-spectrogram.

Basically, the audio data is in the form of spectrogram, MFCC … quite similar to the image. In many deep learning problems, we use Convolution layers to extract the same feature when applied to image data.

2.2 Architecture

Basically, tacotron and tacotron2 are quite similar, both divide architecture into two separate parts:

  • Part 1: Spectrogram Prediction Network – used to convert text strings to mel-spectrogram at frequency-domain
  • Part 2: Vocoder – Convert sound from mel-spectrogram (frequency-domain) to waveform (time-domain)

If you wonder why you have to divide it into 2 parts like that, the answer is simple: 1 second sound in the time-domain is a sequence of 16,000 numbers. If the model wants to inference directly output 1s of sound, the model needs 16000 steps for LSTM / GRU … Meanwhile, before the tacotron came out, people had found a solution to convert spectrogram to waveform with good quality. .

2.2.1 Spectrogram Prediction Network

Spectrogram Prediction Network’s architecture is quite simple, according to the Encoder-Decoder architecture. Encoder and Decoder are connected by Location Sensitive Attention.

Encoder:

  • The input input is decoded and encoded (onehot encode) at the character level
  • Add a Character Embedding layer with a simple Lookup table
  • Add 3 Convolution layers
  • Add Bi-directional LSTM (2-way LSTM).

Decoder:

  • Pre-Net – essentially two fully-connected layers that are used to filter information from the preceding step.
  • 2LSTM – get information from Encoder via Attention, combine with information from previous step through Pre-Net.
  • Linear projection – a linear layer used to predict mel spectrogram
  • Post Net: 5 convolution layer added with the purpose of filtering noise on mel spectrogram

Thus, the outputs of Linear Projection and PostNet are combined to create target mel spectrogram.

The loss function used in this paper is MSE

2.2.2 Vocoder – WaveNet

Vocoder – understood simply as a phonogram, is used to convert data from Mel-spectrogram format into waveform (time domain) that humans can hear.

As I said, the sound waveform after STFT (Short time furier transform) transform, we split into two types of information: magnitude and phase (intensity and phase). Spectrogram is only information about magnitude. That means both magnitude and phase need to restore the original sound. Thus, the spectrogram returned by the model is not enough to reconstruct the waveform sound.

Previously, tacotron used the Griffin-Lim algorithm to estimate phase output based on spectrogram. It then restores sound based on spectrogram (magnitude) and phase. However, this way for the sound quality is not perfect, the sound is not clear, sometimes there is a lot of noise.

Photo source: deepmind.com

In tacotron2, the team took advantage of WaveNet – the sound generation model studied a few years earlier (and still Google, of course). WaveNet is based on convolution dilation. Looking at the illustration, you can see that 1 data point was generated based on past data points. And with the convolution dilation, the coverage is much larger than that of conventional convolution. To avoid thunderstorms and off-topic, if you want to read about WaveNet, read at deepmind.com/wavenet .

The original WaveNet was a non-conditional network, meaning it didn’t receive any input but just randomly generated meaningless audio. However in Tacotron2, the authors used a modified version of WaveNet which generated time-domain waveform samples conditioned on the mel spectrogram . Ie WaveNet is designed to receive input as mel spectrogram, output the corresponding waveform.

Later, when implementing the algorithm, many people replaced the wavenet with similar models. The NVIDIA version, for example, uses WaveGlow – an improved version that’s rated better and faster than WaveNet

3. Some useful information

In this article, I do not guide implementing the algorithm (because it actually takes a lot of effort). However, you can read Pham Van Toan’s article about deep-learning-generating-audio-truyen-ma-khong-lo . The article guides you to open source code application of NVIDIA

There are several versions of the Tacotron2 implementation, in which the NVIDIA / tacotron2 and Rayhane-mamah / Tacotron-2 versions are highly appreciated and widely used.

After tacotron2, there are quite a few improvements, modifications, applying some new techniques … Up to the present time, tacotron2 may not be the best algorithm anymore. However, tacotron and tacotron2 are still the main foundations of the current SOTA algorithms.

4. References

https://arxiv.org/pdf/1712.05884.pdf

https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

https://medium.com/spoontech/tacotron2-voice-synthesis-model-explanation-experiments-21851442a63c

Thank you for reading, see you in the next articles!

Share the news now

Source : Viblo