Learn and build Chatbot systems in reality

Tram Ho

Trends and effectiveness of using Chatbot in recent years

At the time of writing this article, Chatbot was even a famous noun that when everyone mentioned it, they had their own specific images for it. According to real statistics from the beginning of 2018 of Hubspot, the number of goods sold to users worldwide through chatbots accounted for more than 47% and this number has certainly been much larger so far. In addition to the advantages of always being ready to communicate, advising users 24/7, chatbots also help us save human resources and big time to be able to invest in other activities.

The benefits and advantages of this system are quite obvious, until now, the Chatbot systems have become more and more flexible, more perfect, most obviously for chatbots created by Tech giant such as Google with Google Assistant or Apple with Siri, Microsoft's Cortana … In this article, I will delve into the classification of Chatbot systems today, and then, we will try to build together. Build a simple Chatbot, serving the search for information and knowledge from the internet


The goal of a Chatbot system is to maintain and continue conversations with users with the goal of mimicking "unstructured" conversations between people. The usefulness and intelligence of Chatbot was evaluated based on Turing test . Currently, to optimize, the chatbot system will only focus on selecting some areas of talk, knowledge specific to Chatbot (domain) and evaluate by Turing test on this limit.

Classification of Chatbot systems

Chatbot is a system based on natural language processing techniques, so like most other language processing problems, Chatbot systems can be divided into 2 types.

  1. Chatbot system based on rules and habits in the user's language (Rule-based chatbots)
  2. Chatbot system built on a given conversation database (Corpus-based chatbots) – This document store can be collected by large amount of data from user conversations, using methods of extracting Export information (Infomation Retrieval) or machine learning methods to create answers based on the context of the conversation with the user.

Rule-based chatbots and ELIZA system

Understandably, Rule-based chatbot will answer users based entirely on our language usage habits without having to handle memorizing information beforehand. Although in our everyday spoken language, everyone will have their own way of expressing and using words to create their own conversations, but the habit of using human language tends to Pretty much the repeat direction. These habits will be exploited by programmers to build Chatbot with program complexity depending on the creator's wishes. An example can be mentioned if if the person mentioned in the question "weather", "rain or sun" or "temperature", the user is likely to want to ask about the weather situation ..

The most successful and popular Chatbot system of this type is ELIZA Chatbot. ELIZA was created in 1966, and is considered an important step in the history of Chatbot in general and artificial intelligence in particular. ELIZA simulates a psychiatrist by listening carefully and asking carefully about patient stories, so that they can gradually say things that they cannot share with anyone. Let's dive into the Rule-based chatbots typical system to understand it.

Let's look at an example of a conversation between ELIZA (in capital letters) and a normal person posted on their system below (English)

The conversation took place in a very natural way (I tried to translate it into Vietnamese but it probably wouldn't be very good). So ELIZA how to achieve this? Look through some "pattern matching" algorithms of ELIZA

In sentence patterns, the numbers 1, 2 will replace verbs. An example of applying the above pattern to sentences in practice is as follows

In order to determine which pattern the user's sentence falls into (pattern), ELIZA will start based on the keywords (keyword) that appear in the user's sentence. This means that the system has to predefined keywords, and arrange their priority in a really reasonable way (In case there are many keywords of different patterns appearing in the sentence). For each pattern, we have 3-5 answer types for the system to randomly select

Also according to the author, if no keyword is identified, ELIZA's response will be like: PLEASE GO ON, "THAT'S VERY INTERESTING", or "I SEE" . This is a suggestion for users to continue to offer new conversations.

Corpus-based chatbots

Instead of using manual building rules such as rule-based chatbots, the chatbot system relies on real-world conversation data stores to find the right feedback for your situation. These dialogue data can be collected directly on some chat platforms or taken from the lines of characters in movies together.

Corpus-based today will usually have two main types of activities: Extracting necessary information (Information Retrieval) and applying Deep learning problems in the form of sequence to sequence (similar to Automatic Translation).

Information Retrieval based chatbots (IR-based)

The most famous system that follows this model is Simsimi. Applications once used to be very much interested in the online community in Vietnam. The mechanism of operation of an IR-based system takes the first step to search the conversation database, a statement that is most similar to the current sentence of the user. From there there are 2 ways to give an answer:

  1. Use that statement

response = most _similar (q) r e s p o n s e = m o s t _ s i m i l a r ( q )

  1. Use the answer to that statement

response = process (most _similar (q)) r e s p o n s e = p r o c e s s ( m o s t _ s i m i l a r ( q ) )

In fact, in each problem, we have a way of calculating the similarity of sentences, but the most common way is to use the TF-IDF algorithm to convert to the real number vector form and calculate the similarity. Based on cosine distance .

Although, at first glance, the way to give the answer after having processed the same conversation retrieved in the database seems to be a more reasonable way, but in reality it is not. Adding an indirect processing step (processing on sentences similar to user statements) will result in a potentially large noise.

Also in the next section, I will introduce you to an IR-based chatbot system but slightly modified to fit the task of answering questions about knowledge.

Sequence to sequence chatbots

Sequence to sequence is a problem that is being solved quite strongly by Deep Learning networks now. With an input of 1 sentence, based on our dataset, we will be able to generate answers based on Deep Learning . This problem is almost similar to the Auto Translation problem , except that in this case of chatbots, the source language and the target language will be the same language.

Some of the keywords you can find out in this direction are: Recurrent Neural Network , LSTM , GRU , BC and Transformer .

Chatbot for a specific task (Frame Based Agents)

In addition to the two types of Chatbot, Rules-based and Corpus-based, I have just introduced above, I also want to talk about another type of Chatbot. The majority of Chatbots that people come into contact with (except for some of the big tech chatbot assistants) fall into this category – Frame Based Agents. Na is the Chatbot Dali. It simulates the task of an agent, will collect the customer request and send it when the information has been collected fully.

A specific example here is the Chatbot of a Hotel, which helps users to book in advance. A conversation between an employee and a customer to book a room usually takes place according to the following process

Above is an example conversation about how this Chatbot works. As I said from the beginning, the chatbot will help customers to book rooms and it will collect information from the customers until it has enough information. Some basic information it needs can be easily seen like:

  • Date of checkin
  • Checkin time
  • Date of checkout
  • User name.

With this type of Chatbot, Chatbot's execution flow is quite clear and not "vague" as the above types. Lack of information Chatbot will have to ask more about that information, but the challenge is getting the information right. For example, in the case of asking for date information, many people will be able to include an hour in an answer such as: "I book from 4-5h 15-9-2019". So how to get the date and time correctly? Each person will have their own way of responding to this information, which takes time for us to handle the entire situation.

Try making a Question – Answering Chatbot

As I said above, at the end of this article, we will create a simple Chatbot system, with the task of answering knowledge questions. The processing flow will include the following stages

  • Question processing : Language processing in questions (Including word splitting, keyword splitting, question format identification)
  • Document Retrieval : Collect related documents based on Question processing output, processing and cleaning information
  • Answer Extraction : Get the most appropriate answer based on the questions and collected documents

For your convenience, I have pushed the code to github at the following link: https://github.com/hoanganhpham1006/SimpleQuestionAnswering

The libraries we will need to use for this function include:

  • Underthesea: Strong library / Open Source for processing Vietnamese language
  • Google API Search: To collect resources on the Internet
  • Framler: A very useful library of author huyhoang17, I will explain in detail about its effects in the following section.

First we will import the necessary libraries

For Vietnamese math problems, word splitting is important but it is much harder than English. If in English, we just need to separate words into spaces, Vietnamese is not so simple. The use of compound words causes the separation of words to lose meaning, which results in unsatisfactory problems. However, our difficulties have underthesea solved households.)

In the above code, I have declared the function to separate words and the keyword separation function. The separation of keywords in this article I made simple by removing the way characters, punctuation and words that make less sense in stopwords . This keyword splitting function will be used to separate the keyword in the question and in searchable documents.

We will now declare and process the question

With the above question, the keyword we will have is:

['coach', 'team_transfer', 'soccer_ball', 'national_gold', 'vietnam_nam', 'ai']

Next we will go to the Documents Retrieval section, first, I will search the question on Google (as we often do when we do not know a problem) and record the results link. To do this, we will need to use Google-API-Search

Pass the search query and page number of the results we want to get, here I will only get the results that appear on the first page.

Before going to the next section, I would like to recommend to you an open-source library that is very useful for those who are working on natural language processing and have difficulty finding and gathering more resources. data – a tool to help you easily download the content and information of domestic and foreign newspapers: Framler (divine Crawler)

This tool makes it easy to get content like:

  • The title of the article
  • Name of the author of the article
  • Uploaded date
  • The content of the article (pre-processed, very clean and beautiful))
  • The image links cited in that article

All just need 1 url, instead of us having to handle it. Thanks to this tool, the demo of my system also reduced significantly. Details of how to use it can be read directly in the README section of the package on github

Now we will deal with the two main parts, Document Retrieval and Answer Extraction

For each search result I find, I will retrieve the content of those documents. I handle each sentence of the text. With each content we will perform the separation of words and extract the keword. We will only use the sentence if the number of keywords in the question appears at least once.

Finally, extracting the answer, with the purpose of the question "who", I will use the Name Entity Recognition of underthesea to be able to pick out the keyword reminders of people (PERSON). The result we obtained from the above question will be:

[('Park Hang', 12), ('Coach Park Hang', 7), ('Park Hang Seo', 7), ('UTC', 5), ('Park', 5)]


Through this article, I hope I can bring you more insights and knowledge on a very useful tool Chatbot. With the construction of a chatbot for the Question Answering problem, I follow the minimalist way possible to help you understand it best. However, in order to bring about high accuracy, each of our steps should be more optimized (Ranking of collected documents, eliminating noise documents, identifying more accurate entities, ..)

Share the news now

Source : Viblo