Some sentence splitting tools for Japanese text – Japanese sentence boundary disambiguation

Tram Ho

1. Sentence separation problem in Japanese NLP problem

Besides text preprocessing steps such as part-of-speech tagging, tokenization, stemming & lemmatization, etc., sentence separation is also an important step, especially for NLP tasks that treat sentences as processing units. (information retrieval, semantic search, etc.)

Sentence separation, also known as sentence boundary disambiguation, sentence segmentation, sentencizer is a complete task in natural language processing that indicates the beginning and the end of a sentence. Under normal circumstances, a sentence will end with punctuation marks: period, question mark, exclamation point. However, periods can also represent abbreviations, decimal points, ellipses or email addresses, etc. According to Wikipedia, about 47% of the dots in the Wall Street Journal corpus represent abbreviations. Similarly, question marks and exclamation points can be used in emojis, slang, etc.

With languages ​​like Japanese and Chinese, sentence boundaries are even more “vague”:

  • In Japanese, periods, exclamation points, and question marks are the delimiters for segmenting sentences in most cases. Each type of sign can be represented by different characters, in the form of full-width, half-width, etc.

image.png

  • Some variations of Japanese sentence boundaries include: describing emotions (e.g. “(笑)”: smile, “(涙)”: crying), emoji (e.g. “(*) ˆ∇ˆ *) ”,“ ”), etc – often found in informal writing, on the Internet
  • Sentence boundaries can also be represented by line breaks without punctuation.

image.png

  • On the other hand, we also cannot simply consider the line break position as the boundary of a sentence because in Japanese text, a sentence can get a line break midway. People often do this to make long sentences easier to read. According to a 2003 study that performed structural analysis of Japanese patent documents, 48.5% of the first claims of 59,968 patent documents had line breaks in sentences.

image.png

  • Another case where sentence separation becomes more difficult is processing text copied/converted from pdf files, tables or OCR tools. The text will now fall to the line without semantic rules or become quite disjointed.

At these times, a good sentence splitting tool will help a lot for data quality as well as save our time and effort.

2. Some sentence splitting tools for Japanese text

Sentence splitting is an important part of text preprocessing, as input for the next steps. So having an effective sentencizer tool up your sleeves wouldn’t be too bad, especially for a language with many exceptions like Japanese.

In this article, I will introduce 3 popular tools and try to use them to split sentences for the following example:

Besides the sentences that are broken according to the usual rules, I have added some special cases such as emoji, newline when the sentence is not finished, etc. for easy comparison.

2.1. ja_sentence_segmenter (Rule-based)

Link: https://github.com/wwwcojp/ja_sentence_segmenter

This is a sentence separator based on common punctuation rules (e.g. period, exclamation point, question mark at the end of a sentence, handling of single-quotes, etc.)

It works fine with administrative documents or standard-format books, but when it comes to special cases as mentioned in part 1, documents in ordinary life such as messages, on the internet are not useful. It is very useful because it almost does not handle the case of line breaks when the sentence is not finished and does not consider the semantics of the sentence. However, because it is rule-based, it is quite light and fast, so if it is a normal document, it can be completely considered.

Result:

2.2. Spacy Dependency parser

Link: https://spacy.io/usage/language-features#sbd

Unlike other libraries, spaCy uses dependency parse to determine sentence boundaries. That is, it will parse a sentence to show the dependency relationship between the elements in the sentence. This is usually a very precise approach, but for non-formatted documents it is possible to add a custom component to make the pipeline more efficient. image.png

Result:

It’s okay, spacy can distinguish the part after (* ˆ∇ˆ *) is a separate sentence. But it was confused with the comma (.) in the company name ABC as a sentence separator. As mentioned above, we can add a rule based component to the pipeline to increase accuracy.

In addition, spacy also provides other tools such as tokenizer, POS tagger, etc. create a complete pipeline

2.3. Bunkai

Link: https://github.com/megagonlabs/bunkai

This tool consists of 2 main components:

  • Bunkai: set of annotators to detect rule-based sentence breaks and handle exceptions.
  • SBD model: described in the article Sentence Boundary Detection on Line Breaks in Japanese , the model is finetune from the Bert Japanese model of Tohoku University, focusing on determining whether a carriage return is a sentence break boundary. (More information: as far as I know, some tests show that two pre-trained Japanese language models for the best performance of tasks are NICT BERT and Bert Japanese of Tohoku University )

Result:

It can be seen that the SBD model recognizes quite well, especially for the case of a newline when the sentence is not finished (it correctly identifies the followingため、 is not a sentence boundary).

This tool is especially useful for processing documents converted from PDFs or tables.

3. Conclusion

In this article, I have pointed out the difficulties in the sentence segmentation step when processing Japanese documents, and introduced 3 commonly used tools. Each type has its own strengths and weaknesses, so you can consider and use it in each appropriate case

References

 

Share the news now

Source : Viblo