(NLP Series) Entity Identification – NER (part 1)

Tram Ho

Hi everyone, Next I will write about some specific tasks that I have done for the NLP series. The first task will be entity recognition. I chose this task because in my process of making NLP array, this task is quite basic and used a lot. Let’s get started.

Named Entity Recognition (NER)

1. What is entity identification

Named Entity Recognition – NER: entity identification, a basic task in the field of Natural Language Processing. The main role of this task is to identify phrases in the text and classify them into predefined groups such as name of person, organization, location, time, type of product, brand, etc. and so on … From the results of the NER task, we can handle many more complex problems like Chatbot, Question Answering, Search, …

An example of NER:

2. What are the methods and datasets to practice

The problem of entity identification has also been posed for a long time, so there are many solutions.

Rule-based approach

The rule-based NER works like this: a set of predefined or automatically generated rules. Each token in the text will be represented as a feature set. The input text will be compared with this rule set, if the match rule will perform the extract. Such a rule consists of pattern + action. The pattern is usually the regular expression defined on the token’s feature set. When this pattern matches, the action will be triggered. You can code your own or use some support libraries available. One of the famous frameworks / libraries is Facebook’s Duckling ( Link )

Approach to Statistical learning

NER was transferred to the sequence labeling problem. The problem is defined as follows: given the set of observation strings denoted by x = (x_1, x_2, …, x_n). Usually x_i is represented as a vector. We want to label y_i based on data from previous x_i. To label, we often use BIO notation. For each entity of type T, we have two labels BT and IT. BT is begin type T, IT is inside type T. In addition, we also have O label indicating the outside name entity. You can refer to the example below

The method used:

  • Hidden Markov Model
  • Maximum Entropy
  • Conditional Random Fields – CRFs

Machine Learning / Deep Learning approach

Along with the development of Machine Learning / Deep Learning new methods of NER were born. You can go to PapersWithCode.com to see the best datasets and methods.

The top NER methods for CoNLL 2003 dataset ( Link ):

Evaluation of underthesea testing on 2016 VLSP ( Link ):

Dataset (data sets) often used

The most commonly used Dataset for model evaluation is CoNLL 2003 (English) and for Vietnamese you can use the VLSP 2016 set.

3. Testing

Due to the slightly lengthy framework, the following article will introduce and test two methods with the best results currently, Flair and BERT for two sets of CoNLL 2003 dataset (English) and VLSP 2016 (Vietnamese).

Hopefully this article provides an overview for you about the NER problem and the commonly used methods / dataset.

If you have any questions, just put your tiles to answer

Reference source

Share the news now

Source : Viblo