Knowledge of speech processing platform – Speech Processing

Tram Ho


I intend to write a series of at least 2 parts to provide background knowledge in speech processing. In this section, I will focus on sound theory and phonetics. You may be bored, but if you want to dig deeper and develop in the field of “voice processing”, we need to have solid background knowledge instead of reading through a few small algorithms and paper.

1. The principle of voice formation

First, a blast of air is pushed up from the lungs by putting pressure on the larynx (Vocal folds). Under that pressure, the larynx opens to help the air flow, the pressure drops and the larynx automatically closes. Such closure causes the pressure to increase and the process to recur. These loops open / close constantly recur, creating sound frequencies with a base frequency of about 125Hz for men, 210Hz for women. That’s why female voices tend to be higher than male voices. This frequency is called the fundamental frequency F0 .

Thus, the larynx has created fundamental sound wave frequencies. However, it is necessary to create other voices such as the palate, mouth cavity, tongue, teeth, lips, nose … These organs act as a “resonator” like a guitar box. , but has the ability to change shape flexibly. This resonator works to amplify some frequencies, cancel out some other frequencies to create sound. Its flexible shape changing ability helps create different sounds to form voices.

The image above describes in great detail the mechanism. Source + Filter longrightarrow Output sound . At the spectrum of output sound, we see there are 3 vertices, these peaks are called vertices F1, F2, F3 … also known as the formant . The value, position, and change over time of these peaks are specific to the phonemes. In the speech recognition method traditional, people will try to extract information about this formant from F0 and then use this information to identify.

Audio and voiced sound (optional)

There are two types of sounds produced in the above process: voiced and voiced. To make it easier to visualize, place your hand on your throat and pronounce / b /, you will feel the vibration of the throat, which is a vocal sound. Similarly, when pronouncing / th /, we don’t feel this vibration, which is soundlessness.

2. Syllables, syllables, phonemes

2.1 Syllables

In English and many other languages, a word is made up of many syllables. For example, the word “want” has 1 syllable, “wanna” has 2 syllables, “computer” has 3 syllables …. While in Vietnamese, almost every syllable is semantic so we can consider syllable is 1 word. A syllable is usually a vowel, with or without accompanying consonants.

Vowel: English is a vowel, the sound is emitted without obstruction in the respiratory tract, without interruption. In Vietnamese, there are vowels: a, o, e, i, u … To visualize, you try to read the / a / word for a long time, we can make the / a / sound continuously, without the let’s break up.

Consonants : English are consonants , different from vowels, consonants that are created with a complete or partial closure of the larynx, breaking and interrupting the vowel line, creating clear pauses. Try reading a sentence that has omitted all of the consonants, you will find that you only sound like a child who has just returned from a dental exam.

2.2 Phonemes and phonemes (phoneme and phone)

Phoneme : English is phoneme , in many languages, a character / phrase in different words can have different pronunciation. The Latin alphabet has 26 letters but has 44 phoneme. For example, the word ” ough ” in the following sentence has up to 6 pronunciation types.

Though I c ough ed r ough ly and hicc ough ed thr ough out the lecture, I still thought I could pl ough thr ough the rest of it.

In the Text to Speech problems, one needs to convert from written form to a series of phonemes. Our Vietnamese script has a higher onomatopoeia, with a high consistency between writing and reading. That may be one of our advantages when working with Vietnamese.

Syllable : English is phone is the realization of phonemes. The same phoneme but each person has a different voice, for example, the same word “three” but the male voice is different from the female voice, the A voice is different from the B person’s voice. See the picture below. The image depicting the phrase ” she just had a baby ” is separated into phonemes in the bottom row and is realized into “phones” (pictures of sound waves).

In the field of speech recognition, we have the dataset TIMIT – a set of transcribed and aligned paragraphs of 630 Americans. The dataset was collected and annotated by phonetic experts, each sound heard and marked with a clear opening and ending position.

3. The mechanism of action of the ear.

In speech recognition, understanding the human “listening” mechanism is more important than the “speaking” way.

The sounds and voices we hear every day are a mix of lots of waves with different frequencies. These frequencies usually range from 20Hz -> 20000Hz. However, the human ear (and animals) work nonlinearly, which is not to say that the sound perception of 20000Hz will be 1000 times that of 20Hz. Usually the human ear is very sensitive at low frequency sounds, less sensitive at high frequencies.

When the sound that reaches the ear hits the eardrum, the eardrum vibrates, sending vibrations on three small bones: malleus, incus, stapes to the cochlea. The cochlea is a hollow, helical organ like a snail. The cochlea contains mucus inside that helps transmit sound, and along the cochlea are sound-sensing hair cells. These hair cells vibrate as waves pass and send signals to the brain. The cells in the first segment are stiffer, vibrating with higher frequencies. The deeper inside, the less hard the cells become, responding to lower frequencies. Due to the cochlear structure and the number of low-frequency response cells that make up the majority of the ear, the perception of human (and animal) ears is non-linear, sensitive at low frequencies, and less sensitive at frequencies. high.

In speech processing, we need a mechanism to map between the audio signal obtained by the sensor and the sensitivity of the human ear. This map is made by Mel filterbank , we will talk about Mel filterbank in the next section, when equipped with enough background knowledge.

4. Fourier Transform

An indispensable piece of knowledge when working with audio signals is digital signal processing , the focus is Fourier transform (also known as Fourier transform).

Sound is a very long signal sequence, but the information content in it is not much. And as I said in the first part that sound is composed of waves of different frequencies, so think the opposite, why don’t we try to resolve a short piece of sound into waves with frequency? and specific amplitude. That is illustrated quite easily by the picture above. In the figure, a sound segment in the time domain is composed of two periodic waves. Because these two waves are periodic, instead of saving values ​​over time, we only need to save the frequency, amplitude and oscillation phase of these waves. We have a denser representation for that audio (a more informative way of performing).

Thus, with Fourier Transfrom, we have converted information from time domain to frequency domain. In contrast, we have inverse Fourier transform to convert information from the frequency domain to the time domain. Fourier transform has a great application in the field of signal processing (audio, images, information) … If you have time, you should read more about Fourier Transform.

Fourier Transform formula for continuous functions:

Discrete fourier transform formula (DFT – discrete fourier transform):

Fourier transform is a symmetrical transformation, meaning that information is transformed from Fourier time domain to frequency domain, we can reverse Fourier transform to restore information from the frequency domain back to the time domain. space. Below is an illustration of a square wave resolved into the sine wave. It can be seen with the higher N value, the greater the accuracy.

In current algorithms, instead of using the original DFT algorithm, people use FFT (Fast Fourier Transform) an efficient and fast algorithm to speed up the calculation. So far, we have basically understood that the original audio signal will be transformed into Fourier frequency domain and then used to calculate. In the next part, I will say more clearly.

This part is a bit boring theory, so people probably are not interested in it, the next part I will delve into the extraction extraction feature Feature and Mel Frequency Cepstral Coefficients (MFCC) information. Thank you everyone for reading.

Part 2: Feature Extraction – MFCC for voice processing


Speech Recognition

Basic concepts of speech recognition

Phones and Phonemes

Speech Recognition – Phonetics

Share the news now

Source : Viblo