Characters, Code Points, Graphemes
All of the Unicode regex engines mentioned in this article consider a Unicode code point as a character. In regular expression, .
represents every character. In Unicode, it represents any Unicode code point. à
can be encoded into 2 code points U + 0061 (a) and U + 0300 (grave accents). Therefore .
Match with a
(no suspension oil). ^.$
Will not match with the à
, because the à
contains two code points must therefore regex ^..$
1 2 3 | <span class="token string">"à"</span> <span class="token punctuation">.</span> length <span class="token operator">=</span> <span class="token operator">></span> <span class="token number">2</span> |
Wow!
Unicode code point U + 0300 (grave accent) is a combined mark (mark used to combine with other characters). Any code point other than a combined mark may be followed by a combined mark. People call U + 0061 U + 0300 a grapheme.
Unfortunately, the à
can also be encoded in the Unicode code point U + 00E0. The reason this duality is the old character set encoding à
as a character.
Regex for Code Point
To match a specified Unicode code point, use uFFFF where FFFF is the hexadecimal number of the code point you want. You must always specify 4 hexadecimal digits. For example u00E0 will match with à
, but only if à
is encoded into code point U + 00E0.
Some languages like Perl, PCRE, Boost and std :: regex use x{FFFF}
instead of uFFFF
. You can omit leading zeros in parentheses. Because x
not a valid regex token, x{1234}
not misunderstood as x
1234 times. It always matches Unicode code point U + 1234. x{1234}{1111}
will match 1111 times of code point U + 1234.
Unicode Categories
The Unicode standard is divided into several categories. Each unicode character belongs to a certain category. You can match the characters of the “letter” category with p{L}
, characters that don’t belong to the “letter” category with P{L}
As mentioned, 1 character is equivalent to 1 Unicode code point. p{L}
match for one code points in the category “Letter”. If the input is a à
(this would encode a U + 00e0), the match with the à
. If the input is the à
(this is equivalent to a
+ ̀
encode U + 0061 to U + 0300), the match with a. the reason is that both U + 0061 code point ( a
) and U + 00E0 ( à
) belong to the category “letter”, while U + 0300 ̀
) attached category “mark”
Here are a few tests for the à
(or a
+ ̀
will encode into U + 0061 U + 0300, not à
typed from the keyboard)
1 2 3 | <span class="token string">"à"</span> <span class="token punctuation">.</span> match <span class="token punctuation">(</span> <span class="token regex">/AP{M}p{M}z/</span> <span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token operator">></span> <span class="token comment">#<MatchData "à"></span> |
Where A and z in the ruby regex are equivalent to ^ and $ in the regex of other languages p{M}
: match characters from the “mark” category of unicode
P{M}
: match characters that do not belong to the “mark” category of unicode
1 2 3 4 5 6 | <span class="token string">"à"</span> <span class="token punctuation">.</span> match <span class="token punctuation">(</span> <span class="token regex">/AP{M}/</span> <span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token operator">></span> <span class="token comment">#<MatchData "a"></span> <span class="token string">"à"</span> <span class="token punctuation">.</span> match <span class="token punctuation">(</span> <span class="token regex">/p{M}z/</span> <span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token operator">></span> <span class="token comment">#<MatchData "̀"> (chuỗi "̀" này không rỗng nhé)</span> |
Some languages, such as PCRE, PHP, .NET, are case sensitive to tokens between tokens p/
. p{Zs}
matches blank characters, while p{zs}
will throw an error. The rest of the other regex parsers support both uppercase and lowercase letters. You should apply a capitalization and often combine it as listed below to make your regex work on all regex parsers.
In addition to the standard symbols p{L}
, Java, Perl, PCRE, JGsoft and XRegExp 3 allow you to use the shortened form pL
. This method only works with 1-character Unicode attributes. For example, pLl
not equivalent to p{Ll}
but is equivalent to p{L}l
, it matches “Al”, “àl” or the following character phrases followed by the “l” character.
Perl, XRegExp and JGsoft also support the full format p{Letter}
. You can see the full list of Unicode attributes below. You can ignore _
underscores, hyphens -
or spaces.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | * p{L} hoặc p{Letter}: bất kỳ chữ cái từ bất kỳ ngôn ngữ nào. * * p{Ll} hoặc p{Lowercase_Letter}: chữ cái thường mà có biến thể chữ hoa * * p{Lu} hoặc p{Uppercase_Letter}: chữ cái hoa mà có biến thể chữ thường * * p{Lt} hoặc p{Titlecase_Letter}: một chữ cái xuất hiện ở đầu từ khi chỉ chữ cái đầu tiên của từ được viết hoa. * * p{L&} hoặc p{Cased_Letter}: ký tự có cả 2 biến thể hoa và thường. * * p{Lm} hoặc p{Modifier_Letter}: kí tự đặc biệt được sử dụng như 1 chữ cái * * p{Lo} hoặc p{Other_Letter}: chữ cái hoặc tượng hình không có các biến thể chữ thường, chữ hoa * p{M} hoặc p{Mark}: Ký tự dùng để kết hợp với các ký tự khác (dấu huyền, sắc, ô ... ) * * p{Mn} hoặc p{Non_Spacing_Mark}: Ký tự dùng để kết hợp với các ký tự khác mà không chiếm không gian * * p{Mc} hoặc p{Spacing_Combining_Mark}: Ký tự dùng để kết hợp với các ký tự khác mà có chiếm không gian * * p{Me} hoặc p{Enclosing_Mark}: Ký tự bao quanh các ký tự mà nó kết hợp * p{Z} hoặc p{Separator}: Bất kỳ các loại ký tự trắng hoặc các ký tự phân cách vô hình * * p{Zs} hoặc p{Space_Separator}: Ký tự trắng vô hình nhưng chiếm không gian * * p{Zl} hoặc p{Line_Separator}: Ký tự phân cách dòng U+2028. * * p{Zp} hoặc p{Paragraph_Separator}: Ký tự phân cách đoạn U+2029. * p{S} hoặc p{Symbol}: Ký tự toán học, tiền tệ, dingbats, box-drawing... * * p{Sm} hoặc p{Math_Symbol}: Kí hiệu toán học * * p{Sc} hoặc p{Currency_Symbol}: Ký hiệu tiền tệ * * p{Sk} hoặc p{Modifier_Symbol}: Ký tự kết hợp (Mark) là 1 ký tự đầy đủ của riêng mình * * p{So} hoặc p{Other_Symbol}: Một vài kí tự khác mà nằm ngoài kí hiệu toán học, tiền tệ và kí tự kết hợp * p{N} hoặc p{Number}: Ký tự số * * p{Nd} hoặc p{Decimal_Digit_Number}: Số không đến chín trong các loại chữ ngoại trừ tượng hình * * p{Nl} hoặc p{Letter_Number}: Số mà giống như chữ cái. Ví dụ: số La Mã * * p{No} hoặc p{Other_Number}: số ở dạng superscript hoặc subscript hoặc số không là không đến chín (Không bao gồm số tượng hình) * p{P} hoặc p{Punctuation}: Dấu câu * * p{Pd} hoặc p{Dash_Punctuation}: Bất kỳ loại dấu gạch nối hoặc dấu gạch ngang * * p{Ps} hoặc p{Open_Punctuation}: Bất kì các loại ngoặc mở * * p{Pe} hoặc p{Close_Punctuation}: Bất kì các loại ngoặc đóng * * p{Pi} hoặc p{Initial_Punctuation}: Bất kì các loại dấu trích dẫn mở * * p{Pf} hoặc p{Final_Punctuation}: Bất kì các loại dấu trích dẫn đóng * * p{Pc} hoặc p{Connector_Punctuation}: Các dấu dùng để kết nối các từ như dấu gạch dưới * * p{Po} hoặc p{Other_Punctuation}: Các dấu câu còn lại * * p{C} hoặc p{Other}: các ký tự vô hình và các code point không được sử dụng * p{Cc} hoặc p{Control}: Ký tự điều khiển ASCII hoặc Latin-1: 0x00–0x1F và 0x7F–0x9F * * p{Cf} hoặc p{Format} * * p{Co} hoặc p{Private_Use} * * p{Cs} hoặc p{Surrogate} * * p{Cn} hoặc p{Unassigned} |
Unicode Scripts
There are code points already assigned to characters, besides there are code points that have not been assigned to any characters at all. The Unicode standard divides assigned code points into multiple scripts. Each script is a group of code points used by a specific writing system. Some scripts such as Thai
correspond to a language (here in Thai). Other scripts like Latin
spread across many languages.
Some languages include many scripts. For example: There is no Japanese Unicode script but instead, Unicode provides scripts like Hiragana
, Katakana
, Han
and Latin
for use in Japanese documents.
Common
is a special script. This script contains all common types of characters from many other scripts. It includes all kinds of punctuation, spaces and miscellaneous symbols.
All assigned code points (match P{Cn}
) belong to Unicode scripts. Unassigned code points (matching p{Cn}
) do not belong to any Unicode script.
Perl, PCRE, PHP, Ruby 1.9, Delphi and XRegExp can match the following scripts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | p{Common} p{Arabic} p{Armenian} p{Bengali} p{Bopomofo} p{Braille} p{Buhid} p{Canadian_Aboriginal} p{Cherokee} p{Cyrillic} p{Devanagari} p{Ethiopic} p{Georgian} p{Greek} p{Gujarati} p{Gurmukhi} p{Han} p{Hangul} p{Hanunoo} p{Hebrew} p{Hiragana} p{Inherited} p{Kannada} p{Katakana} p{Khmer} p{Lao} p{Latin} p{Limbu} p{Malayalam} p{Mongolian} p{Myanmar} p{Ogham} p{Oriya} p{Runic} p{Sinhala} p{Syriac} p{Tagalog} p{Tagbanwa} p{TaiLe} p{Tamil} p{Telugu} p{Thaana} p{Thai} p{Tibetan} p{Yi} |
Unicode Blocks
The Unicode standard also divides characters into blocks or ranges. Each block is used to define characters of a specific script like “Tibetan” or belong to a specific group like “Braille Patterns”. Most blocks contain unassigned code points, reserved for later Unicode expansion.
Note that Unicode blocks are not exactly the same as scripts. There are basic differences between blocks and scripts. Block is the range of code points (as listed below). And the script contains characters across the entire Unicode range. Blocks can include unassigned code points (code points matching p{Cn}
). Script contains only code points that have been assigned to characters. In general, if you don’t know what type to use, choose a script.
For example, the Currency block does not contain $
and ¥
but 2 blocks Basic_Latin
and Latin-1_Supplement
do, although neither of them is Latin
characters. This is for historical reasons, because the ASCII standard contains the $
sign and the ISO-8859 standard contains the ¥
sign. You should not blindly use any of the blocks listed below based on their names. Instead, see the range of characters they match. A tool like RegexBuddy can help you. p{Sc}
or p{Currency_Symbol}
would be a better choice p{InCurrency_Symbols}
when you are p{InCurrency_Symbols}
currency symbols.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | p{InBasic_Latin}: U+0000–U+007F p{InLatin-1_Supplement}: U+0080–U+00FF p{InLatin_Extended-A}: U+0100–U+017F p{InLatin_Extended-B}: U+0180–U+024F p{InIPA_Extensions}: U+0250–U+02AF p{InSpacing_Modifier_Letters}: U+02B0–U+02FF p{InCombining_Diacritical_Marks}: U+0300–U+036F p{InGreek_and_Coptic}: U+0370–U+03FF p{InCyrillic}: U+0400–U+04FF p{InCyrillic_Supplementary}: U+0500–U+052F p{InArmenian}: U+0530–U+058F p{InHebrew}: U+0590–U+05FF p{InArabic}: U+0600–U+06FF p{InSyriac}: U+0700–U+074F p{InThaana}: U+0780–U+07BF p{InDevanagari}: U+0900–U+097F p{InBengali}: U+0980–U+09FF p{InGurmukhi}: U+0A00–U+0A7F p{InGujarati}: U+0A80–U+0AFF p{InOriya}: U+0B00–U+0B7F p{InTamil}: U+0B80–U+0BFF p{InTelugu}: U+0C00–U+0C7F p{InKannada}: U+0C80–U+0CFF p{InMalayalam}: U+0D00–U+0D7F p{InSinhala}: U+0D80–U+0DFF p{InThai}: U+0E00–U+0E7F p{InLao}: U+0E80–U+0EFF p{InTibetan}: U+0F00–U+0FFF p{InMyanmar}: U+1000–U+109F p{InGeorgian}: U+10A0–U+10FF p{InHangul_Jamo}: U+1100–U+11FF p{InEthiopic}: U+1200–U+137F p{InCherokee}: U+13A0–U+13FF p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F p{InOgham}: U+1680–U+169F p{InRunic}: U+16A0–U+16FF p{InTagalog}: U+1700–U+171F p{InHanunoo}: U+1720–U+173F p{InBuhid}: U+1740–U+175F p{InTagbanwa}: U+1760–U+177F p{InKhmer}: U+1780–U+17FF p{InMongolian}: U+1800–U+18AF p{InLimbu}: U+1900–U+194F p{InTai_Le}: U+1950–U+197F p{InKhmer_Symbols}: U+19E0–U+19FF p{InPhonetic_Extensions}: U+1D00–U+1D7F p{InLatin_Extended_Additional}: U+1E00–U+1EFF p{InGreek_Extended}: U+1F00–U+1FFF p{InGeneral_Punctuation}: U+2000–U+206F p{InSuperscripts_and_Subscripts}: U+2070–U+209F p{InCurrency_Symbols}: U+20A0–U+20CF p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF p{InLetterlike_Symbols}: U+2100–U+214F p{InNumber_Forms}: U+2150–U+218F p{InArrows}: U+2190–U+21FF p{InMathematical_Operators}: U+2200–U+22FF p{InMiscellaneous_Technical}: U+2300–U+23FF p{InControl_Pictures}: U+2400–U+243F p{InOptical_Character_Recognition}: U+2440–U+245F p{InEnclosed_Alphanumerics}: U+2460–U+24FF p{InBox_Drawing}: U+2500–U+257F p{InBlock_Elements}: U+2580–U+259F p{InGeometric_Shapes}: U+25A0–U+25FF p{InMiscellaneous_Symbols}: U+2600–U+26FF p{InDingbats}: U+2700–U+27BF p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF p{InSupplemental_Arrows-A}: U+27F0–U+27FF p{InBraille_Patterns}: U+2800–U+28FF p{InSupplemental_Arrows-B}: U+2900–U+297F p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF p{InKangxi_Radicals}: U+2F00–U+2FDF p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F p{InHiragana}: U+3040–U+309F p{InKatakana}: U+30A0–U+30FF p{InBopomofo}: U+3100–U+312F p{InHangul_Compatibility_Jamo}: U+3130–U+318F p{InKanbun}: U+3190–U+319F p{InBopomofo_Extended}: U+31A0–U+31BF p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF p{InCJK_Compatibility}: U+3300–U+33FF p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF p{InYi_Syllables}: U+A000–U+A48F p{InYi_Radicals}: U+A490–U+A4CF p{InHangul_Syllables}: U+AC00–U+D7AF p{InHigh_Surrogates}: U+D800–U+DB7F p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF p{InLow_Surrogates}: U+DC00–U+DFFF p{InPrivate_Use_Area}: U+E000–U+F8FF p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF p{InVariation_Selectors}: U+FE00–U+FE0F p{InCombining_Half_Marks}: U+FE20–U+FE2F p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F p{InSmall_Form_Variants}: U+FE50–U+FE6F p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF p{InSpecials}: U+FFF0–U+FFFF |
Not all regex parsers use the same syntax to match Unicode blocks. Java, Ruby 2.0, and XRegExp use the syntax p{InBlock}
as listed above. But .Net and XML use p{IsBlock}
. Perl and JGsoft flaver support both types of syntax. You should use In
if the analyzer supports it. In
can only be used for Unicode blocks while Is
can also be used for Unicode properties and scripts.
For .NET and XML, you must omit the underscore but retain the hyphen at the name of the block. For example, use p{IsLatinExtended-A}
instead of p{InLatin_Extended-A}
. In Java, you must ignore dashes. .NET
and XML
also case sensitive, while Perl, Ruby and JGsoft flavor are not. Java 4
is case sensitive but Java 5 and later is case sensitive only for the Is
prefix.
The names of the blocks are the same as the names in regex parsers. The names of the blocks are defined in the Unicode standard. PCRE and PHP do not support Unicode blocks but support scripts
Do you need to worry about different encryption?
While you should remember that pitfalls are created in different ways to encode accented characters, you don’t always have to worry about them. If you’re sure your input and regex strings use the same type, there’s no need to worry. This process is called Unicode normalization. All programming languages that support native Unicode such as Java, C # and VB.NET have libraries for string standardization. If you standardize both the subject and the regex before executing the match, there won’t be any conflicts.
If using Java, you can pass the CANON_EQ
flag as the second parameter of the Pattern.compile () function. This tells Java’s regex parser that the characters that look the same are the same. For example, regex match with à (U + 00E0) will match with à (U + 0061 U + 0300) and vice versa. Other than Java, no regex analyzer supports this.
If you type à
with the keyboard, most word processors will enter the code point U + 00E0 into the file. So if you’re working with text you type, any regexes you type match in the same way.