The problem of extracting information from the text – Part I

Tram Ho

For anyone trying to analyze text, the most difficult thing is not finding the right documents but finding the correct information in them. Understand the association between entities in the paragraph, the events in the description paragraph, or find valuable information from keywords in that paragraph, … and so figuring out how to automatically extract information from textual data and presenting it in a structured way will reap many benefits, greatly reducing the time we spend skimming through documents. This is what I will talk about next. In this part, I will explain in depth the basic theory and how to process text data first, let’s start, the next sections will be updated at the earliest …

Information Extraction

Information Extraction (IE) is a field of structured information extraction in natural language processing. This field is used a lot in NLP problems such as: Knowledge graph, Question answering system, Summary of text, … RE itself is also a subfield of IE.

IE’s mission and implication is to extract meaningful information from unstructured text data and present it in a structured form.

Using IE it is possible to extract information such as a person’s name, a place or define relationships among entities and store this information in a structured format like Database.

What is Parts of speech (POS)?

In a sentence or paragraph, we will have verbs, nouns, adverbs, pronouns, … We use POS to classify the words in a sentence which kind of words.

Example: We are the same

POS: We / PRP is / VBZ the / IN same / NN

This kind of word classification helps the machine know the meaning of the sentence, for example a word in English can have two meanings. For example, the word growing noun is “Development” adjective “Expansion” . If we do not categorize, we can confuse the structure of this example and from there our sentences will no longer be true. POS can help the machine distinguish which is the noun from the adjective based on the content of the sentence.

We will use spaCy library to perform some examples:

Result:

With a few simple statements we can analyze the sentence composition. But how does the composition analysis of this sentence help extract information? By analyzing the above syntax, we can filter out nouns, and NLP nouns often carry a lot of very important information in the sentence.

We can easily extract words based on their POS tags. But sometimes extracting information entirely based on POS tokens is not enough. See the sentence below:

If I want to extract the subject and object of a sentence, I can’t do it based on their POS tag. So I need to see how these words relate to each other. We will use something called Dependencies .

We will use the dependency graph to show the relationship between different words of a sentence.

Each word is a node in the Dependency graph . The relationship between words is represented by curves.

The arrows have a lot of meaning here:

  • The arrowhead points to words depending on the word pointed by the origin of the arrow . And when dependent, it will be a child node of the word that points to it.
  • The word without an arrow is called the sentence root node. We will try to extract Subject and Object of the above sentence

Working with United Nations General Debate Corpus data set

Now that we have covered the basics above, we will now put our practice together with an actual dataset called the United Nations General Debate Corpus . The Dataset includes speeches from representatives of all member states from 1970 to 2018 at the joint debate of the annual session of the United Nations General Assembly. But in this section we’ll take a small subset of the entire dataset to do.

First load the text file using the glob library then visualize it with pandas.

The next step we will pre-process text data, here I will take a piece of text as a visual example.

I congratulate Mr. nAmara Essy on his election as President of the General nAssembly at the forty-ninth session. We are particularly ngratified that an eminent son of Africa is leading the nAssembly’s deliberations this year. NWe offer our thanks to his predecessor, Ambassador nInsanally, who presided over a year of considerable nactivity in the General Assembly with great aplomb and nfinesse. The Secretary-General, Mr. Boutros nBoutros-Ghali, will be completing three years in office. NWe wish him well as he continues to lead the United nNations. NWe have already welcomed the new South Africa to nthe United Nations. South Africa today is a reminder of nthe triumph of the principle of equality of man – a ntriumph in which the United Nations played a major role. NThe world community must commit itself to ensure that nthis principle is implemented for all time to come. All nefforts should be made for the development of South nAfrica. NForty-nine years ago a world tired of war declared n …

I congratulate Mr. Amara Essy on his election as President of the GeneralAssembly at the forty-ninth session. We are particularlygratified that an eminent son of Africa is leading theAssembly’s deliberations this year.We offer our thanks to his predecessor, AmbassadorInsanally, who presided over a year of considerableactivity in the General Assembly with great aplomb andfinesse. The Secretary-General, Mr. BoutrosBoutros-Ghali, will be completing three years in office.We wish him well as he continues to lead the UnitedNations.We have already welcomed the new South Africa tothe United Nations. South Africa today is a reminder ofthe triumph of the principle of equality of man – atriumph in which the United Nations played a major role.The world community must commit itself to ensuring thatthis principle is implemented for all time to come. Allefforts should be made for the development of SouthAfrica.Forty-nine years ago a world tired of war declaredthat at this foundry of the United Nations …

I congratulate Mr. Amara Essy on his election as President of the GeneralAssembly at the forty-ninth session. We are particularlygratified that an eminent son of Africa is leading theAssembly’s deliberations this year.We offer our thanks to his predecessor, AmbassadorInsanally, who presided over a year of considerableactivity in the General Assembly with great aplomb andfinesse. The Secretary-General, Mr. BoutrosBoutros-Ghali, will be completing three years in office.We wish him well as he continues to lead the UnitedNations.We have already welcomed the new South Africa tothe United Nations. South Africa today is a reminder ofthe triumph of the principle of equality of man – atriumph in which the United Nations played a major role.The world community must commit itself to ensuring thatthis principle is implemented for all time to come. Allefforts should be made for the development of SouthAfrica.Forty-nine years ago a world tired of war declaredthat at this foundry of the United Nations

… Write the function:

Result:

I congratulate MrAmara Essy on his election as President of the GeneralAssembly at the fortyninth session. We are particularlygratified that an eminent son of Africa is leading theAssembly’s deliberations this year.We offer our thanks to his predecessor, AmbassadorInsanally, who presided over a year of considerableactivity in the General Assembly with great aplomb andfinesse. The SecretaryGeneral, Mr BoutrosBoutrosGhali, will be completing three years in office.We wish him well as he continues to lead the UnitedNations.We have already welcomed the new South Africa tothe United Nations. South Africa today is a reminder ofthe triumph of the principle of equality of man atriumph in which the United Nations played a major role.The world community must commit itself to ensure thatthis principle is implemented for all time to come. Allefforts should be made for the development of SouthAfrica.Fortynine years ago a world tired of war declaredthat at this foundry of the United Nations it wou …

After text-preprocessing, we will separate each sentence of the person in the paragraph into different sentences, the following function will separate the sentences after the period into a separate sentence.

And we have the following result:

In this article we have gone through the theoretical and practical basic parts of information extraction, in the next article I will go into more depth and talk more about the terms in NLP, find NER entities, find the speech patterns in speeches, Rules on noun-verb-noun phrases, Rules on the structure of noun adjectives …

One last time thank you for reading

Share the news now

Source : Viblo