Create Language Model to automatically generate Vietnamese text

Monday, 03/06/2019

Tram Ho

Google search reads your thoughts

This is no longer a strange thing for us, but it is still a great feature to increase the experience for Google users.
When you start typing in the search box on the Google Search page, you can see the next few words, even the rest of the search suggestions. The Google Search system uses an algorithm to automatically generate these suggestions without human intervention (not the opinions, questions of someone else or Google), and one part of it related to this article I want to introduce: Language Model

Target

Just like the title, today I will show you how to create a Language Model using Deep Learning's simple RNN network, from which we will try to model the text to see and evaluate its effectiveness.

What is Language Model

If you are already learning about Natural Language Processing, then this is probably no longer a strange term. The purpose of this model is to provide prior distribution probability distribution, which helps us to know whether a sentence is "reasonable" with the specified language (here is Vietnamese) or an additional word after a sentence is contextual and matches the previous words.

The simple example you can see is the sentence above, with the input being "weather" , the probability of the next word / phrase "internal" and words like "static" , "giang" will be higher than the probability of many others.

In recent times, Language Model plays an important role in many applications, problems of natural language processing and is a research topic attracting the attention of programmers and researchers. save the world. However, since each country will use a different language with different grammatical, cultural and verbal structures, the Language Model for each language will have to solve many specific problems of it. In the next section, we will build a "Language Model" for Vietnamese!

Implementation steps

Like most Machine Learning problems, to solve this problem, we will have to perform 2 main steps:

Data Pre-processing
Model building and training model

1. Processing data

The first thing is to search and collect data. With this problem, I have been able to search the data set, including more than 40000 articles / newsletters, in 8 different fields. You can proceed to download this data file here:

https://github.com/hoanganhpham1006/English_Language_Model/blob/master/Train_Full.zip

However, we will not need to use all of this, I will only use about a quarter of the data above.

When reading a data file, we will get the following result:

Close all roads that move poultry to HCMC n January 15, veterinary forces coordinate with police and traffic inspectors in Ho Chi Minh City to open control points on every suburban street. Performing the electricity work of the Ministry of Agriculture and Rural Development a day earlier, the functional forces blocked any chickens from outside the province from entering the city.In Binh Chanh District Animal Quarantine Station, the gateway In the west of the city, the deputy head of the station, Pham Ngoc Lanh, said: "From the time of the epidemic up to now, the brothers have changed here.

At first glance we will see some problems as follows:

Uppercase / lowercase letters are mixed : For us this is a normal thing, but computers differentiate between uppercase and lowercase letters, and this increases the complexity of processing though and Basically, the meaning of the word when written in uppercase or lowercase is still unchanged.
Many punctuation and extra characters: Dots, commas, strange characters appear many times in the text. This will also make processing more difficult and time consuming
Some abbreviations / proper names : HCMC

This is probably the common point of most of the data we collect when working on natural language processing. In the previous post, when I talked about "A text summary program", I also talked quite a lot about this issue. And in today's lesson I will speak again.

We will do some algorithms to solve the above problems including:

Merge Vietnamese words before separating them to ensure that they retain their meaning
Put it all in lower case
Remove all punctuation and extra characters.

 def clean_document (doc):
    doc = ViTokenizer.tokenize (doc) #Pyvi Vitokenizer library
    doc = doc.lower () #Lower
    tokens = doc.split () #Split in_to words
    table = str.maketrans ('', '', string.punctuation.replace ("_", "")) #Remove all punctuation
    tokens = [w.translate (table) for w in tokens]
    tokens = [word cho từ trong từ từ đểkens]
    return tokens

def clean_document (doc):

doc = ViTokenizer.tokenize (doc) #Pyvi Vitokenizer library

doc = doc.lower () #Lower

tokens = doc.split () #Split in_to words

table = str.maketrans ('', '', string.punctuation.replace ("_", "")) #Remove all punctuation

tokens = [w.translate (table) for w in tokens]

tokens = [word cho từ trong từ từ đểkens]

return tokens

The results obtained after this step will be a list of words

 ['close the door', 'all', 'recline', 'line', 'move', 'gia_cầm', 'enter', 'tp', 'hcm', 'date', '151', 'power_lượng' ,. ..]

1	['close the door', 'all', 'recline', 'line', 'move', 'gia_cầm', 'enter', 'tp', 'hcm', 'date', '151', 'power_lượng' ,. ..]

Doing so with all the text you want to use, each of our text will get a corresponding word list. However, this is not what we need to get into our model.

Model of RNN
For the purpose of introducing a paragraph (a number of words, sentences) and to predict the next word, the RNN model we build this time will take the input of 50 words (the number of words is up to you) and the output will be 1 word. So the data we put into the training will be many 51 words, take 50 words to make training data (data) and 1 last word of that sentence makes the label.

 INPUT_LENGTH = 50
sequences = []
cho f trong tập tin_list:
    f1 = open (f, encoding = 'utf-16')
    doc = f1.read ()
    tokens = clean_document (doc)

    for i trong phạm vi (INPUT_LENGTH + 1, len (tokens)):
        seq = tokens [i-INPUT_LENGTH-1: i]
        line = '' .join (seq)
        sequences.append (line)

INPUT_LENGTH = 50

sequences = []

cho f trong tập tin_list:

f1 = open (f, encoding = 'utf-16')

doc = f1.read ()

tokens = clean_document (doc)

for i trong phạm vi (INPUT_LENGTH + 1, len (tokens)):

seq = tokens [i-INPUT_LENGTH-1: i]

line = '' .join (seq)

sequences.append (line)

In the above code, I have a file_list including file names, each file contains 1 text. For each document, I carry out preprocessing, then from 1 writing, every 51 consecutive words we connect together (50 from the beginning to use as data, 1 from the end as a label) to create banana training . (The first word comes from the 51st is a string, from the 2nd to the 52th is a string, …)

The final step of preprocessing is that we will have to perform "digitization" for all the words in the strings that are in t. All Deep Learning models now handle, optimally by operations on numbers and RNN model to build our Language Model will not be an exception.

There are many methods to do this work (last article I introduced you to word2vec method – each word we will represent by 1 vector, in this article, I want to introduce To you another simpler method, we will build a corresponding table, where each different word will be denoted by a unique unique integer.
Keras has assisted us in building this table with the keras.preprocessing.text.Tokenizer function

 tokenizer = keras.preprocessing.text.Tokenizer (filters = '! "# $% & () * +, -. / :; <=>? @ [] ^` {|} ~')
tokenizer.fit_on_texts (sequences)

1 2	tokenizer = keras.preprocessing.text.Tokenizer (filters = '! "# $% & () * +, -. / :; <=>? @ [] ^` {\|} ~') tokenizer.fit_on_texts (sequences)

The filter here is that the characters will be omitted, here because we wanted to deal with Vietnamese (the Vietnamese words in the previous step were marked with the letter "_": "close_may", ' gia_cầm '.. so I removed the "" character from the filter parameter)

The function fit_on_texts helps us build a table to correspond from words to numbers as we need.

You can use the function to see the results table

 tokenizer.word_index

1	tokenizer.word_index

 {'port_ao': 30919, 'would': 3224, '114': 17777, 'ceiling_bason_minh': 16017, 'abstract_a': 23416, 'left-hand-side': 3944, 'nguyễn_thị_xu_feature': 27864, 'split_en': 12470, ' chocolate ': 5705,' seems_the_sexy ': 34437,' natalie_zhu ': 30224,' narain ': 32563, ...}

1	{'port_ao': 30919, 'would': 3224, '114': 17777, 'ceiling_bason_minh': 16017, 'abstract_a': 23416, 'left-hand-side': 3944, 'nguyễn_thị_xu_feature': 27864, 'split_en': 12470, ' chocolate ': 5705,' seems_the_sexy ': 34437,' natalie_zhu ': 30224,' narain ': 32563, ...}

After the table is available, we make each word turn into the corresponding number in all strings

 sequences_digit = tokenizer.texts_to_sequences (sequences)

1	sequences_digit = tokenizer.texts_to_sequences (sequences)

After running here, we will get a set of numbers to be ready for the next part. !!

2. Construction and coaching Language Model

Before putting data in to be able to do training, we need to declare and standardize inputs and outputs

 # Separate vào nhập và kết quả
sequences_digit = array (sequences_digit)
X, y = sequences_digit [:,: - 1], sequences_digit [:, - 1]
y = keras.utils.to_categorical (y, num_classes = vocab_size)
seq_length = X.shape [1]

# Separate vào nhập và kết quả

sequences_digit = array (sequences_digit)

X, y = sequences_digit [:,: - 1], sequences_digit [:, - 1]

y = keras.utils.to_categorical (y, num_classes = vocab_size)

seq_length = X.shape [1]

We will split the whole series into 2 parts as originally planned, the first 50 words (the first word to the penultimate) are the training data and the last word as labels
Next, our labels will take the one-hot vector form using the to_categorical function of keras.util

Our RNN model for Language Model this time will only have 2 LSTM classes as follows:

 vocab_size = len (tokenizer.word_index) + 1

1	vocab_size = len (tokenizer.word_index) + 1

 model = Sequential ()
model.add (Embedding (vocab_size, 50, input_length = 50))
model.add (BatchNormalization ())
model.add (LSTM (512, return_sequences = True))
model.add (LSTM (512))
model.add (Dense (100, activation = 'relu'))
model.add (Dropout (0.2))
model.add (BatchNormalization ())
model.add (Dense (vocab_size, activation = 'softmax'))
model.summary ()

model.compile (loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

model = Sequential ()

model.add (Embedding (vocab_size, 50, input_length = 50))

model.add (BatchNormalization ())

model.add (LSTM (512, return_sequences = True))

model.add (LSTM (512))

model.add (Dense (100, activation = 'relu'))

model.add (Dropout (0.2))

model.add (BatchNormalization ())

model.add (Dense (vocab_size, activation = 'softmax'))

model.summary ()

model.compile (loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In order to reduce training time, LSTM classes here only use 512 units. After passing 2 LSTM classes, I used 1 Dense class to give 100 outputs and finally use 1 Dense class to have the same number of output as the dictionary (vocab_size).

Everything has been prepared, now is the time to start (waiting …) to train the model

 model.fit (X, y, batch_size = 512, epochs = 100)

1	model.fit (X, y, batch_size = 512, epochs = 100)

We will wait for the results, I am using a GPU on Google Colab, it will take more than 15 minutes for an epoch training (I am using 1 million data series for this problem), I plan to train 100 epoch ..

 Epoch 1/100
1953/1953 [==============================] - 1234s 624ms / step - loss: 5.0340 - acc: 0.1684
Epoch 2/100
1953/1953 [==============================] - 1233s 624ms / step - loss: 4.9822 - acc: 0.1726
...
Epoch 99/100
1953/1953 [==============================] - 1310s 671ms / step - loss: 2.1418 - acc: 0.5137
Epoch 100/100
1150/1953 [================> .............] - ETA: 8:58 - loss: 2,1458 - acc: 0.5125

Epoch 1/100

1953/1953 [==============================] - 1234s 624ms / step - loss: 5.0340 - acc: 0.1684

Epoch 2/100

1953/1953 [==============================] - 1233s 624ms / step - loss: 4.9822 - acc: 0.1726

...

Epoch 99/100

1953/1953 [==============================] - 1310s 671ms / step - loss: 2.1418 - acc: 0.5137

Epoch 100/100

1150/1953 [================> .............] - ETA: 8:58 - loss: 2,1458 - acc: 0.5125

After training 100 epochs, we have a model with 51.25% accuracy, here I have actively finished training early because I just want to stop at this level of accuracy. Too high a precision will make the Language Model tend to "memorize" and lose the necessary "creativity", which ensures the ability to understand the context of the model. If you want, you can continue training.

We will save this model to use, along with that, we will need to save the tokenizer (the word-to-number reference table) into a pkl file to make sure the decoding process is correct later. This time, I also proceed to save sequences_digit so that it doesn't take time to convert from number to number.

 import pickle

model.save ('51_acc_language_model.h5')

with open ('tokenizer.pkl', 'wb') as f:
    pickle.dump (tokenizer, f)

with open ('sequences_digit.pkl', 'wb') as f:
    pickle.dump (sequences_digit, f)

import pickle

model.save ('51_acc_language_model.h5')

with open ('tokenizer.pkl', 'wb') as f:

pickle.dump (tokenizer, f)

with open ('sequences_digit.pkl', 'wb') as f:

pickle.dump (sequences_digit, f)

So, having completed the training for the model, we already have it in Language Model for Vietnamese. With the minimalization of everything, hopefully you've all done it successfully.

Test the text with Language Model

If you perform this test in a new file, the new environment, first, reload the model, tokenizer.

 import pickle
from load_model import keras.models

with open ('tokenizer.pkl', 'rb') as f:
    tokenizer = pickle.load (f)
    
with open ('sequences_digit', 'rb') as f:
    sequences_digit = pickle.load (f)

model = load_model ('51_acc_language_model.h5')

import pickle

from load_model import keras.models

with open ('tokenizer.pkl', 'rb') as f:

tokenizer = pickle.load (f)

with open ('sequences_digit', 'rb') as f:

sequences_digit = pickle.load (f)

model = load_model ('51_acc_language_model.h5')

In order for our model to generate text, first, we need to supply and process the input. Our input will be any text, they will have to be standardized and then encoded accordingly into the numbers in the same table as we used when we did the training. practice model

 import numpy as np

def preprocess_input (doc):
    tokens = clean_document (doc)
    tokens = tokenizer.texts_to_sequences (tokens)
    tokens = keras.preprocessing.sequence ([tokens], maxlen = 50, truncating = 'pre')
    return np.reshape (tokens, (1.50))

import numpy as np

def preprocess_input (doc):

tokens = clean_document (doc)

tokens = tokenizer.texts_to_sequences (tokens)

tokens = keras.preprocessing.sequence ([tokens], maxlen = 50, truncating = 'pre')

return np.reshape (tokens, (1.50))

In the above code, I have used a pad_sequences function of keras. The purpose of this is to make sure our input is always a string of 50 elements. If we include a string less than 50 elements, we will first add empty characters until 50 elements are enough.

To perform the "prediction" of the next word with the highest probability of occurrence from our input sequence, we will only need to perform a simple command that calls the predict_classes command of the model.

 model.predict_classes (tokens)

1	model.predict_classes (tokens)

The result returned will be a number corresponding to a certain word with the highest probability of occurrence for our input bananas. After that, I will make this word into the input banana, so that the Language Model will continue to predict the next word, just do so until the number of words we are expecting

 def generate_text (text_input, n_words):
    tokens = preprocess_input (text_input)
    for _ in range (n_words):
        next_digit = model.predict_classes (tokens)
        tokens = np.append (tokens, next_digit)
        tokens = np.delete (tokens, 0)
        tokens = np.reshape (tokens, (1, 50))
    
    # Mapping to text
    tokens = np.reshape (tokens, (50))
    out_word = []
    for token in tokens:
        for word, index in tokenizer.word_index.items ():
            if index == token:
                out_word.append (word)
                break
 <span class="token keyword">return</span> <span class="token string">''</span> <span class="token punctuation">.</span> join <span class="token punctuation">(</span> out_word <span class="token punctuation">)</span>

def generate_text (text_input, n_words):

tokens = preprocess_input (text_input)

for _ in range (n_words):

next_digit = model.predict_classes (tokens)

tokens = np.append (tokens, next_digit)

tokens = np.delete (tokens, 0)

tokens = np.reshape (tokens, (1, 50))

# Mapping to text

tokens = np.reshape (tokens, (50))

out_word = []

for token in tokens:

for word, index in tokenizer.word_index.items ():

if index == token:

out_word.append (word)

break

return '' . join ( out_word )

Once there is a string being generated, the last thing we need to do is decode the output string into the corresponding Vietnamese words, then perform the matching into sentences.

Here are some of the results I have tested from giving my Language Model a text:

Input

 street in vietnam

1	street in vietnam

output

 creating many features of other artists who are different from many works is considered to be a place where many columns of income are rated.

1	creating many features of other artists who are different from many works is considered to be a place where many columns of income are rated.

Input

 At the dialogue, many businesses said that they have to receive many inspection and examination teams from all levels every year, which is unintentional and troublesome.

1	At the dialogue, many businesses said that they have to receive many inspection and examination teams from all levels every year, which is unintentional and troublesome.

output

 greater than the year before the end of the year, the investors have just announced 2 usd while the country has been able to export the current interest rate to continue the process

1	greater than the year before the end of the year, the investors have just announced 2 usd while the country has been able to export the current interest rate to continue the process

The resulting result looks plausible with the context of the input. In fact, text generation will also be supported with more information, but here, we only need one input paragraph is enough. Try putting input and generating text with your model!

summary

Through this sharing, I tried to introduce you to one of the ways to build a Language Model for Vietnamese language – the issue is of great concern now and is the core of the system. Natural language processing. With simple network structure (RNN), hope you will not have any problems when building this model with you.

In summary, there will be some things that you should pay attention in the article:

Pre-processing method of text data for Language Model training
Construction method and Language Model training
How the Language Model works and uses the newly trained model to produce text.

Share the news now

Source : viblo.asia

Create Language Model to automatically generate Vietnamese text

Google search reads your thoughts

Target

What is Language Model

Implementation steps

1. Processing data

2. Construction and coaching Language Model

Test the text with Language Model

summary

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers