Determine question intent in the question and answer system

Tram Ho

Post goals

Question analysis is the first phase in the general architecture of an Q&A system, with the task of finding out necessary information as input for the processing of the following phases (document extraction, sentence extraction reply, …). Therefore, question analysis plays a very important role, directly influencing the operation of the whole system. If the analysis of the question is not good, it will not be possible to find the answer.

Today I will present about the methods of classifying the intentions of the questioners in a question and answer system based on the data set of questions of Civil Engineering University students. During the construction process, I have tested many different methods, but with this blog scope I will share a best method that I have used, other methods I will gradually share in the posts. next.

What is Intent Determination?

For question and answer systems, intent classification is the determination of the intent of the questioner when interacting with the system through the user’s question or query. For example to the question: “Where can I ask the address of facility A?”, The user’s intention is to ask about ‘ADDRESS’, for example, the question: “Opening hours What time is the store? ” then the user’s intention is to ask about ‘TIME’. Relying on the intent determination will help answer the question correctly and give the answer the user wants.

Methods of determining intent

For problems that require an intent to be determined, there are many methods to do it.

Shallow approach

The classical approach is based on the frequency of occurrence and the importance of words in known intentions or shallow approach. For example, to ask about time, in the question, words like “what time”, “what day”, “what month”, “year”, … and the intention to ask about the place, there will be words: “where”, “address”, “where”, etc. Many methods used in Q&A use keyword-based techniques to locate sentences and paragraphs that can contain Answers from selected texts about. Then keep the sentences, paragraphs containing string of characters of the same type as the desired answer type (eg questions about names of people, places, quantity …).

When determining the most common words in the intentions, depending on the probability of the occurrence of the words in the question, we will predict the likelihood of which the user’s intent will belong. However, such dictionary-based determination will be incomplete and inaccurate. Natural language is an ambiguous language, so some questions, based on such words, are unlikely to determine the intent of the question.

A deep approach

In cases where the surface approach cannot find the answer, grammatical, semantic, and contextual processes are needed to extract or generate the answer. Commonly used techniques such as named-entity recognition, relationship extraction, semantic ambiguity, etc. The system often uses knowledge resources such as Wordnet, ontology to enrich its capabilities. arguments through definitions and semantic relationships. Statistical language model-based Q&A systems are also gaining popularity.

In this article, I will approach in-depth approach.

Data used

To build a model to determine question intent, I will use ontology which are pairs of “question – intent” collected from students of the University of Civil Engineering. I will take the problem of building a classification model with classes as the intentions of the questioner. The following examples are the questions in the data set:

The questions were divided into 10 groups of intent: ['DIEM', 'HOC_BONG', 'DKMH', 'HOC_PHI', 'KHAC', 'LICH_HOC', 'TAI_KHOAN', 'THU_TUC_SV', 'TN', 'TOEIC'] In there

  • ‘DIEM’ includes the Scoring questionnaire
  • ‘HOC_BONG’ includes a scholarship questionnaire
  • ‘DKMH’ includes questions about subject registration
  • ‘HOC_PHI’ includes questions about tuition fees

  • ‘KHAC’ includes questions that do not belong to one of the above 9 groups

Building a model

Resources

The data, pre-trained for the representation models, can be downloaded here .

Install the necessary packages

In this article, I use the pyvi library to conduct some pre-processing with the text. To install, run the following command:

Import the required libraries

Declare preprocessor functions

To remove the stop word, I use the stop word list, you replace stopwords.csv with the path in my file above.

Train the FastText model to match the problem data

In this article, I use FastText in Gensim’s library package to encode words into vector (word2vec). Your training data to the file xaa . FastText is considered to be better than word2vec in expressing new words, so I will use it in this problem. This is a file consisting of a part of articles from Wikipedia, the documents have been pre-processed with a number of techniques such as word separation, word removal, normalization …

In addition, I also need to add the sentences in the data file of the intent classification data set to train the word2vec model. This helps to add some words in the topology data domain that are not available in the wikipedia linguistic set. This makes the FastText model more expressive.

Train the FastText model as follows:

After the training is complete, you need to save the model to use for the next time with the following code:

To re-type the model we do the following:

Test print one word size:

Word representation and sentence representation

After training the FastText model, we will encode sentences into vectors by encoding each word in the sentence and putting the vector representing these words into a vector the size of the longest sentence. (to make sure the sentences are fully represented). Sentences shorter than the longest sentence will be padding with zeros to bring sentences to the same size without affecting the meaning of the sentence. The padding I use the function tf.keras.preprocessing.sequence.pad_sequences

The code does the following:

Read the data

Data division

Because the data in the classes is not equal, so to correctly evaluate the model we need to divide the rating data so that the number of samples in the classes is equal.

Here is an image of the number of questions in the respective classes:

Number of questions in classes

To divide the questions in the same validate set classes, we do the following:

Proceed to divide training and test data:

Model definition

In this article, I use LSTM to conduct classification. LSTM network is used with keras in a quite simple way as follows

See detailed number of model parameters:

Model training

To train the model we will fit the training and test data like this:

The parameter verbose=1 specifies printing evaluation results after each epoch has been executed

The results after running some epochs are as follows:

Model evaluation

Import the required libraries for evaluation:

We will evaluate the model based on metrics like f1_score, accuracy as follows:

The classification methods you can see here

  • f1-score

The output will be:

  • Accuracy

Result:

Thus, the predicted results of the model on the test set reached 82.7%. The results are not very high, but acceptable.

Fine-tune model

To improve the accuracy of the model, readers can test the changes using better word representation model or use another model like BERT, GRU, RNN … Also we have can test change the parameters to and compare the changes to come up with the best model.

summary

In this article, I have just presented a technique to determine the intent of a question using deeplearning. Any questions and suggestions you can exchange under this article.

Link google colab of the article

Original article

Thank you for reading the article.

Share the news now

Source : Viblo