Unicode

Tram Ho

Characters, Code Points, Graphemes

All of the Unicode regex engines mentioned in this article consider a Unicode code point as a character. In regular expression, . represents every character. In Unicode, it represents any Unicode code point. à can be encoded into 2 code points U + 0061 (a) and U + 0300 (grave accents). Therefore . Match with a (no suspension oil). ^.$ Will not match with the , because the contains two code points must therefore regex ^..$

Wow!

Unicode code point U + 0300 (grave accent) is a combined mark (mark used to combine with other characters). Any code point other than a combined mark may be followed by a combined mark. People call U + 0061 U + 0300 a grapheme.

Unfortunately, the à can also be encoded in the Unicode code point U + 00E0. The reason this duality is the old character set encoding à as a character.

Regex for Code Point

To match a specified Unicode code point, use uFFFF where FFFF is the hexadecimal number of the code point you want. You must always specify 4 hexadecimal digits. For example u00E0 will match with à , but only if à is encoded into code point U + 00E0.

Some languages ​​like Perl, PCRE, Boost and std :: regex use x{FFFF} instead of uFFFF . You can omit leading zeros in parentheses. Because x not a valid regex token, x{1234} not misunderstood as x 1234 times. It always matches Unicode code point U + 1234. x{1234}{1111} will match 1111 times of code point U + 1234.

Unicode Categories

The Unicode standard is divided into several categories. Each unicode character belongs to a certain category. You can match the characters of the “letter” category with p{L} , characters that don’t belong to the “letter” category with P{L}

As mentioned, 1 character is equivalent to 1 Unicode code point. p{L} match for one code points in the category “Letter”. If the input is a à (this would encode a U + 00e0), the match with the à . If the input is the (this is equivalent to a + ̀ encode U + 0061 to U + 0300), the match with a. the reason is that both U + 0061 code point ( a ) and U + 00E0 ( à ) belong to the category “letter”, while U + 0300 ̀ ) attached category “mark”

Here are a few tests for the (or a + ̀ will encode into U + 0061 U + 0300, not à typed from the keyboard)

Where A and z in the ruby ​​regex are equivalent to ^ and $ in the regex of other languages p{M} : match characters from the “mark” category of unicode

P{M} : match characters that do not belong to the “mark” category of unicode

Some languages, such as PCRE, PHP, .NET, are case sensitive to tokens between tokens p/ . p{Zs} matches blank characters, while p{zs} will throw an error. The rest of the other regex parsers support both uppercase and lowercase letters. You should apply a capitalization and often combine it as listed below to make your regex work on all regex parsers.

In addition to the standard symbols p{L} , Java, Perl, PCRE, JGsoft and XRegExp 3 allow you to use the shortened form pL . This method only works with 1-character Unicode attributes. For example, pLl not equivalent to p{Ll} but is equivalent to p{L}l , it matches “Al”, “àl” or the following character phrases followed by the “l” character.

Perl, XRegExp and JGsoft also support the full format p{Letter} . You can see the full list of Unicode attributes below. You can ignore _ underscores, hyphens - or spaces.

Unicode Scripts

There are code points already assigned to characters, besides there are code points that have not been assigned to any characters at all. The Unicode standard divides assigned code points into multiple scripts. Each script is a group of code points used by a specific writing system. Some scripts such as Thai correspond to a language (here in Thai). Other scripts like Latin spread across many languages.

Some languages ​​include many scripts. For example: There is no Japanese Unicode script but instead, Unicode provides scripts like Hiragana , Katakana , Han and Latin for use in Japanese documents.

Common is a special script. This script contains all common types of characters from many other scripts. It includes all kinds of punctuation, spaces and miscellaneous symbols.

All assigned code points (match P{Cn} ) belong to Unicode scripts. Unassigned code points (matching p{Cn} ) do not belong to any Unicode script.

Perl, PCRE, PHP, Ruby 1.9, Delphi and XRegExp can match the following scripts.

Unicode Blocks

The Unicode standard also divides characters into blocks or ranges. Each block is used to define characters of a specific script like “Tibetan” or belong to a specific group like “Braille Patterns”. Most blocks contain unassigned code points, reserved for later Unicode expansion.

Note that Unicode blocks are not exactly the same as scripts. There are basic differences between blocks and scripts. Block is the range of code points (as listed below). And the script contains characters across the entire Unicode range. Blocks can include unassigned code points (code points matching p{Cn} ). Script contains only code points that have been assigned to characters. In general, if you don’t know what type to use, choose a script.

For example, the Currency block does not contain $ and ¥ but 2 blocks Basic_Latin and Latin-1_Supplement do, although neither of them is Latin characters. This is for historical reasons, because the ASCII standard contains the $ sign and the ISO-8859 standard contains the ¥ sign. You should not blindly use any of the blocks listed below based on their names. Instead, see the range of characters they match. A tool like RegexBuddy can help you. p{Sc} or p{Currency_Symbol} would be a better choice p{InCurrency_Symbols} when you are p{InCurrency_Symbols} currency symbols.

Not all regex parsers use the same syntax to match Unicode blocks. Java, Ruby 2.0, and XRegExp use the syntax p{InBlock} as listed above. But .Net and XML use p{IsBlock} . Perl and JGsoft flaver support both types of syntax. You should use In if the analyzer supports it. In can only be used for Unicode blocks while Is can also be used for Unicode properties and scripts.

For .NET and XML, you must omit the underscore but retain the hyphen at the name of the block. For example, use p{IsLatinExtended-A} instead of p{InLatin_Extended-A} . In Java, you must ignore dashes. .NET and XML also case sensitive, while Perl, Ruby and JGsoft flavor are not. Java 4 is case sensitive but Java 5 and later is case sensitive only for the Is prefix.

The names of the blocks are the same as the names in regex parsers. The names of the blocks are defined in the Unicode standard. PCRE and PHP do not support Unicode blocks but support scripts

Do you need to worry about different encryption?

While you should remember that pitfalls are created in different ways to encode accented characters, you don’t always have to worry about them. If you’re sure your input and regex strings use the same type, there’s no need to worry. This process is called Unicode normalization. All programming languages ​​that support native Unicode such as Java, C # and VB.NET have libraries for string standardization. If you standardize both the subject and the regex before executing the match, there won’t be any conflicts.

If using Java, you can pass the CANON_EQ flag as the second parameter of the Pattern.compile () function. This tells Java’s regex parser that the characters that look the same are the same. For example, regex match with à (U + 00E0) will match with à (U + 0061 U + 0300) and vice versa. Other than Java, no regex analyzer supports this.

If you type à with the keyboard, most word processors will enter the code point U + 00E0 into the file. So if you’re working with text you type, any regexes you type match in the same way.

Share the news now

Source : Viblo