NLP: Khmer Romanization using Seq2Seq

Tram Ho


In our previous article, we implemented and application to convert Khmer word into Roman by writing the logic from scratch following given paper since we didn’t have enough data to apply deep learning for this problem. However, we notice that in googles translation, they also convert Khmer word into Roman. Therefore, we can easily use our Khmer words list in our previous article to get list of its Romanization. Then we can use these data to train our model for converting Khmer word to roman.

Plan of attack

There are many machines learning algorithms that we could use to solve our problem. Since, our problem is implementing a model to translate Khmer word to Roman, one particulate algorithm is very standout to this. It’s Seq2Seq architecture. A Seq2Seq model is a model that takes a sequence of input (words, letters, time series, etc) and outputs another sequence of result. This model has achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. Moreover, we had also used this model for implementing our article about chatbot.


For this experiment, we are using Keras for development our Seq2Seq model. Luckily, Keras also has a tutorial about build a model for translating English to French. We will modify those code to translate Khmer word to Roman instead. If there any lack of understand my code, you can go check the original code for more explaination here.

First, we import packages needed:

Then we load the data into memory using panda:

Once data is loaded, we need to clear them and separeate it into unique individual character:

Next, we init array for input and output sequences base on max length of input and output sample data.

Then, we encode/decode our input and out data before pass it into our model:

Using Keras we can build a seq2seq with ease:

Then, we can start train our model:

And don’t forget to save our trained model if you don’t to re-trin it again:


Once training is complete, we now can test our model and check the result:

Let’s run it.

Base on the result, it seems our model is over fited. So, it your turn to improve this model to make it more awesome.


What’s next?

In article, we learned how to prepare our text data, and we create the model which will take the data we processed and use it to train translating Khmer word to Roman. We used an architecture called (seq2seq) or (Encoder Decoder), It is suitable for solving sequential problem. Where in our case the input sequence is Khmer words and our out put sequence is roman word where its length is different. However, our model is not produce good prediction yet and it’s your turn to improve this model to compete with google.

Share the news now

Source : Viblo