Guess English stress with machine learning – Why not?

Tram Ho

Hi guys, do you have ever “gone crazy” with the stress song before learning English? I do, even though I learned some rules of stress at school, but sadly, English, like any other language, always has exceptions. Personally, I don’t like things that cannot be deduced by logic, so although I love English dearly, I almost always put the stress in the form of “10-point sentences” and never tried to do. these whole sentences.

The day before I sat to teach my younger brother, I suddenly thought, are the rules of stress in English something that a machine learning model can learn? Then let’s find out through this article together!

Since I am learning tensorflow, in this article I will use tensorflow libraries to try it out.

1. Dataset

In this article, I use The CMU Pronouncing Dictionary data set of Carnegie Mellon University. This is a pronunciation dictionary for American English that includes over 134,000 words and their pronunciation according to the ARPAbet phoneme set, commonly used for speech recognition and synthesis.

Below are a few examples of the CMU Pronouncing Dictionary’s transcription

So through the transcription of a word, we will know the number of the word’s syllable (through the vowel number) and where the word’s stress is (the vowels are numbered 1 behind – primary stress)

For more details, please refer to the link: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

So the input in my model will be any word in English and the output is the word stress in the word syllable. Because we cannot find any other data sets, our model will only learn in the alphabetical order of words instead of the whole word like in the old days, we learned under the stress rule.

2. Prepare the dataset

In this article, I will use tf.data – a tensorlfow API used to construct a pipepline input for machine learning model. tf.data helps to process large amounts of data, read data from different data formats, and perform complex data transformations. This API includes an abstraction called tf.data.Dataset which represents a sequence of elements, usually each element corresponds to a training example ~ a pair of tensor: input – label. A more convenient point is that from tensorflow 2.0, we can directly use tf.data.Dataset as an input when using model.fit in Keras.

Argument of the fit method in tf.keras.Model

My input is read from the text file, in addition I can generate the dataset from a python list (tf.data.Dataset.from_tensor_slices), from records in the form of TFRecord (tf.data.TFRecordDataset), or from a file list ( tf.data.Dataset.list_files)

Each element in our dataset is a tensor string that looks like this:

To be able to pass this dataset into model.fit, we need to convert each element in the dataset into a tuple form consisting of a pair (input, target).

Fortunately, the tf.data API provides a lot of methods to help me do this.

For the sake of simplicity, I will first filter out words that contain special characters such as apostrophes, dashes, and so on. We can use dataset.filter(predicate) with predicate which is a function that maps an element in the dataset with a True or False result. For example, if we want to filter the elements less than 3 in the dataset, we can do the following:

or

However, because my filter_fn function uses expressions and functions that are not a Tensorflow operation, I cannot directly pass in as above, but must go through a function tf.py_function to “encapsulate” the Python function into a Tensorflow operation for the computation on the TensorFlow graph.

Next, I will split the initial string taken from the file into a tuple of 1 English word and its corresponding stress using the map method. Similar to a filter , this method applies the map_fn function to each element in the dataset, and returns a new dataset consisting of the transformed elements in the correct order. map_fn can be used to change both the value and structure of an element in the dataset.

As a result, the newly transformed element has the following form:

Next, to perform encode input and build the output layer for the model, I need to find out what is the longest word in the dictionary and how many stresses there are. So I will have to go through all the elements in the dataset at once, although this procedure is quite time consuming. By the way, I will also “take advantage” to count the number of elements in the dataset as tf.data.TextLineDataset will return an unknown shape, that is, only when the entire dataset has been run through all of us once, we will know if it is. how many elements.

Result:

Then I will do one-hot encoding with input and label: The longest word in my dictionary has 34 characters and the dictionary has the stresses at positions 1 to 8. Thus an input word will encoded into a vector with 34×26 = 884 elements with each letter corresponding to a one hot vector of 26 elements, if the number of letters of the word is less than 34, the rest will be 0. Similarly, each labels The corresponding word stress representation becomes a one hot vector with 8 elements.

As a guide to get better performance with the tf.data API on the home page of Tensorflow ( link ), I add a step to cache dataset after the last map step. When caching a dataset, the transformation steps before the cache (open file, read data) will only have to be done once in the first epoch. But the next epoch will reuse the cached data.

As mentioned above, the dataset is read from tf.data.TextLineDataset will return an unknown shape and rank, so to include it in the model, I need to reset the shape for it. At the same time I will batch create data and perform the prefetching step. Simply put, prefetching helps to parallel two processes, training and loading data. For example, when the model is doing the nth training step, the input pipeline will read the data for step n + 1. Thus, we can reduce training time as well as optimize GPU performance (while GPU training, CPU loads data, instead of training one step, the GPU has to wait for the CPU to finish processing)

You can read more details about prefetch here .

3. Divide train / val / test

Because the specificity of the Dataset I use is in the dictionary form, arranged in alphabetical order and there are many repeated words that change only partially according to the word type (for example, PRODUCE – PRODUCT – PRODUCTS – PRODUCTION, etc.), so I We will not shuffle first, but split train / validation / test and then shuffle the train to avoid the model memorizing only existing words.

The tf.data API has shuffling support. Note that for the best shuffle, buffer_size needs to be greater than or equal to the size of the dataset. Alternatively you can add the argument reshuffle_each_iteration=True to re-shuffle every epoch. (This is a new feature since TF 2.0, before that if you want the shuffle order to change, you must use the repeat step)

4. Model

In this lesson, I will use a simple neural network with 2 hidden layers.

Input is a one-hot vector of 884 elements representing English words.

The output is an 8-element vector corresponding to the stress probability of the word located at each position from 1-8.

Since my input vector is a sparse vector, I added a convolutional layer 1D and flattened it to connect to the dense layer. As follows:

In addition, I also incorporate some Callbacks API of Keras when training to optimize time and avoid overfitting:

Let’s run it

After 32 epochs, my model will have accuracy at 92.7% on train and 91.8% on validation set. Highest Accuracy achieved on train is 94.5%

Evaluate on the test set:

Results: 86.9% on the test set

I will try to get a few results to look visual:

Some “answers”:

Conclude

So in this article, I used Tensorflow’s tf.data API to build an input pipeline and put into the machine learning model using Keras to predict the stress of an English word. The result is nearly 87% on the test set, the number is not high, but it is enough for me to conclude that using machine learning can “learn” part of the English stress rules based on excrement. mix letters in words. If combined with the word type, I think it will get a higher accuracy, but that part must be left for the future if there is a chance.

I have just started with machine learning in general and tensorflow in particular not long ago, so if there are errors, I hope to hear from you.

Share the news now

Source : Viblo