Unicode

Thursday, 23/07/2020

Tram Ho

Characters, Code Points, Graphemes

All of the Unicode regex engines mentioned in this article consider a Unicode code point as a character. In regular expression, . represents every character. In Unicode, it represents any Unicode code point. à can be encoded into 2 code points U + 0061 (a) and U + 0300 (grave accents). Therefore . Match with a (no suspension oil). ^.$ Will not match with the à , because the à contains two code points must therefore regex ^..$

<span class="token string">"à"</span> <span class="token punctuation">.</span> length
<span class="token operator">=</span> <span class="token operator">&gt;</span> <span class="token number">2</span>

"à" . length

= > 2

Wow!

Unicode code point U + 0300 (grave accent) is a combined mark (mark used to combine with other characters). Any code point other than a combined mark may be followed by a combined mark. People call U + 0061 U + 0300 a grapheme.

Unfortunately, the à can also be encoded in the Unicode code point U + 00E0. The reason this duality is the old character set encoding à as a character.

Regex for Code Point

To match a specified Unicode code point, use uFFFF where FFFF is the hexadecimal number of the code point you want. You must always specify 4 hexadecimal digits. For example u00E0 will match with à , but only if à is encoded into code point U + 00E0.

Some languages like Perl, PCRE, Boost and std :: regex use x{FFFF} instead of uFFFF . You can omit leading zeros in parentheses. Because x not a valid regex token, x{1234} not misunderstood as x 1234 times. It always matches Unicode code point U + 1234. x{1234}{1111} will match 1111 times of code point U + 1234.

Unicode Categories

The Unicode standard is divided into several categories. Each unicode character belongs to a certain category. You can match the characters of the “letter” category with p{L} , characters that don’t belong to the “letter” category with P{L}

As mentioned, 1 character is equivalent to 1 Unicode code point. p{L} match for one code points in the category “Letter”. If the input is a à (this would encode a U + 00e0), the match with the à . If the input is the à (this is equivalent to a + ̀ encode U + 0061 to U + 0300), the match with a. the reason is that both U + 0061 code point ( a ) and U + 00E0 ( à ) belong to the category “letter”, while U + 0300 ̀ ) attached category “mark”

Here are a few tests for the à (or a + ̀ will encode into U + 0061 U + 0300, not à typed from the keyboard)

<span class="token string">"à"</span> <span class="token punctuation">.</span> match <span class="token punctuation">(</span> <span class="token regex">/AP{M}p{M}z/</span> <span class="token punctuation">)</span>
<span class="token operator">=</span> <span class="token operator">&gt;</span> <span class="token comment">#&lt;MatchData "à"&gt;</span>

"à" . match ( /AP{M}p{M}z/ )

= > #<MatchData "à">

Where A and z in the ruby regex are equivalent to ^ and $ in the regex of other languages p{M} : match characters from the “mark” category of unicode

P{M} : match characters that do not belong to the “mark” category of unicode

<span class="token string">"à"</span> <span class="token punctuation">.</span> match <span class="token punctuation">(</span> <span class="token regex">/AP{M}/</span> <span class="token punctuation">)</span>
<span class="token operator">=</span> <span class="token operator">&gt;</span> <span class="token comment">#&lt;MatchData "a"&gt;</span>

<span class="token string">"à"</span> <span class="token punctuation">.</span> match <span class="token punctuation">(</span> <span class="token regex">/p{M}z/</span> <span class="token punctuation">)</span>
<span class="token operator">=</span> <span class="token operator">&gt;</span> <span class="token comment">#&lt;MatchData "̀"&gt; (chuỗi "̀" này không rỗng nhé)</span>

"à" . match ( /AP{M}/ )

= > #<MatchData "a">

"à" . match ( /p{M}z/ )

= > #<MatchData "̀"> (chuỗi "̀" này không rỗng nhé)

Some languages, such as PCRE, PHP, .NET, are case sensitive to tokens between tokens p/ . p{Zs} matches blank characters, while p{zs} will throw an error. The rest of the other regex parsers support both uppercase and lowercase letters. You should apply a capitalization and often combine it as listed below to make your regex work on all regex parsers.

In addition to the standard symbols p{L} , Java, Perl, PCRE, JGsoft and XRegExp 3 allow you to use the shortened form pL . This method only works with 1-character Unicode attributes. For example, pLl not equivalent to p{Ll} but is equivalent to p{L}l , it matches “Al”, “àl” or the following character phrases followed by the “l” character.

Perl, XRegExp and JGsoft also support the full format p{Letter} . You can see the full list of Unicode attributes below. You can ignore _ underscores, hyphens - or spaces.

* p{L} hoặc p{Letter}: bất kỳ chữ cái từ bất kỳ ngôn ngữ nào.

* *     p{Ll} hoặc p{Lowercase_Letter}: chữ cái thường mà có biến thể chữ hoa

* *     p{Lu} hoặc p{Uppercase_Letter}: chữ cái hoa mà có biến thể chữ thường

* *     p{Lt} hoặc p{Titlecase_Letter}: một chữ cái xuất hiện ở đầu từ khi chỉ chữ cái đầu tiên của từ được viết hoa.

* *     p{L&amp;} hoặc p{Cased_Letter}: ký tự có cả 2 biến thể hoa và thường.

* *     p{Lm} hoặc p{Modifier_Letter}: kí tự đặc biệt được sử dụng như 1 chữ cái

* *     p{Lo} hoặc p{Other_Letter}: chữ cái hoặc tượng hình không có các biến thể chữ thường, chữ hoa

* p{M} hoặc p{Mark}: Ký tự dùng để kết hợp với các ký tự khác (dấu huyền, sắc, ô ... )

* *     p{Mn} hoặc p{Non_Spacing_Mark}: Ký tự dùng để kết hợp với các ký tự khác mà không chiếm không gian

* *     p{Mc} hoặc p{Spacing_Combining_Mark}: Ký tự dùng để kết hợp với các ký tự khác mà có chiếm không gian

* *     p{Me} hoặc p{Enclosing_Mark}: Ký tự bao quanh các ký tự mà nó kết hợp

* p{Z} hoặc p{Separator}: Bất kỳ các loại ký tự trắng hoặc các ký tự phân cách vô hình

* *     p{Zs} hoặc p{Space_Separator}: Ký tự trắng vô hình nhưng chiếm không gian

* *     p{Zl} hoặc p{Line_Separator}: Ký tự phân cách dòng U+2028.

* *     p{Zp} hoặc p{Paragraph_Separator}: Ký tự phân cách đoạn U+2029.

* p{S} hoặc p{Symbol}: Ký tự toán học, tiền tệ, dingbats, box-drawing...

* *     p{Sm} hoặc p{Math_Symbol}: Kí hiệu toán học

* *     p{Sc} hoặc p{Currency_Symbol}: Ký hiệu tiền tệ

* *     p{Sk} hoặc p{Modifier_Symbol}: Ký tự kết hợp (Mark) là 1 ký tự đầy đủ của riêng mình

* *     p{So} hoặc p{Other_Symbol}: Một vài kí tự khác mà nằm ngoài kí hiệu toán học, tiền tệ và kí tự kết hợp

* p{N} hoặc p{Number}: Ký tự số

* *     p{Nd} hoặc p{Decimal_Digit_Number}: Số không đến chín trong các loại chữ ngoại trừ tượng hình

* *     p{Nl} hoặc p{Letter_Number}: Số mà giống như chữ cái. Ví dụ: số La Mã

* *     p{No} hoặc p{Other_Number}: số ở dạng superscript hoặc subscript hoặc số không là không đến chín (Không bao gồm số tượng hình)

* p{P} hoặc p{Punctuation}: Dấu câu

* *     p{Pd} hoặc p{Dash_Punctuation}: Bất kỳ loại dấu gạch nối hoặc dấu gạch ngang

* *     p{Ps} hoặc p{Open_Punctuation}: Bất kì các loại ngoặc mở

* *     p{Pe} hoặc p{Close_Punctuation}: Bất kì các loại ngoặc đóng

* *     p{Pi} hoặc p{Initial_Punctuation}: Bất kì các loại dấu trích dẫn mở

* *     p{Pf} hoặc p{Final_Punctuation}: Bất kì các loại dấu trích dẫn đóng

* *     p{Pc} hoặc p{Connector_Punctuation}: Các dấu dùng để kết nối các từ như dấu gạch dưới

* *     p{Po} hoặc p{Other_Punctuation}: Các dấu câu còn lại

* *     p{C} hoặc p{Other}: các ký tự vô hình và các code point không được sử dụng

* p{Cc} hoặc p{Control}: Ký tự điều khiển ASCII hoặc Latin-1: 0x00–0x1F và 0x7F–0x9F

* *     p{Cf} hoặc p{Format}

* *     p{Co} hoặc p{Private_Use}

* *     p{Cs} hoặc p{Surrogate}

* *     p{Cn} hoặc p{Unassigned}

* p{L} hoặc p{Letter}: bất kỳ chữ cái từ bất kỳ ngôn ngữ nào.

* * p{Ll} hoặc p{Lowercase_Letter}: chữ cái thường mà có biến thể chữ hoa

* * p{Lu} hoặc p{Uppercase_Letter}: chữ cái hoa mà có biến thể chữ thường

* * p{Lt} hoặc p{Titlecase_Letter}: một chữ cái xuất hiện ở đầu từ khi chỉ chữ cái đầu tiên của từ được viết hoa.

* * p{L&} hoặc p{Cased_Letter}: ký tự có cả 2 biến thể hoa và thường.

* * p{Lm} hoặc p{Modifier_Letter}: kí tự đặc biệt được sử dụng như 1 chữ cái

* * p{Lo} hoặc p{Other_Letter}: chữ cái hoặc tượng hình không có các biến thể chữ thường, chữ hoa

* p{M} hoặc p{Mark}: Ký tự dùng để kết hợp với các ký tự khác (dấu huyền, sắc, ô ... )

* * p{Mn} hoặc p{Non_Spacing_Mark}: Ký tự dùng để kết hợp với các ký tự khác mà không chiếm không gian

* * p{Mc} hoặc p{Spacing_Combining_Mark}: Ký tự dùng để kết hợp với các ký tự khác mà có chiếm không gian

* * p{Me} hoặc p{Enclosing_Mark}: Ký tự bao quanh các ký tự mà nó kết hợp

* p{Z} hoặc p{Separator}: Bất kỳ các loại ký tự trắng hoặc các ký tự phân cách vô hình

* * p{Zs} hoặc p{Space_Separator}: Ký tự trắng vô hình nhưng chiếm không gian

* * p{Zl} hoặc p{Line_Separator}: Ký tự phân cách dòng U+2028.

* * p{Zp} hoặc p{Paragraph_Separator}: Ký tự phân cách đoạn U+2029.

* p{S} hoặc p{Symbol}: Ký tự toán học, tiền tệ, dingbats, box-drawing...

* * p{Sm} hoặc p{Math_Symbol}: Kí hiệu toán học

* * p{Sc} hoặc p{Currency_Symbol}: Ký hiệu tiền tệ

* * p{Sk} hoặc p{Modifier_Symbol}: Ký tự kết hợp (Mark) là 1 ký tự đầy đủ của riêng mình

* * p{So} hoặc p{Other_Symbol}: Một vài kí tự khác mà nằm ngoài kí hiệu toán học, tiền tệ và kí tự kết hợp

* p{N} hoặc p{Number}: Ký tự số

* * p{Nd} hoặc p{Decimal_Digit_Number}: Số không đến chín trong các loại chữ ngoại trừ tượng hình

* * p{Nl} hoặc p{Letter_Number}: Số mà giống như chữ cái. Ví dụ: số La Mã

* * p{No} hoặc p{Other_Number}: số ở dạng superscript hoặc subscript hoặc số không là không đến chín (Không bao gồm số tượng hình)

* p{P} hoặc p{Punctuation}: Dấu câu

* * p{Pd} hoặc p{Dash_Punctuation}: Bất kỳ loại dấu gạch nối hoặc dấu gạch ngang

* * p{Ps} hoặc p{Open_Punctuation}: Bất kì các loại ngoặc mở

* * p{Pe} hoặc p{Close_Punctuation}: Bất kì các loại ngoặc đóng

* * p{Pi} hoặc p{Initial_Punctuation}: Bất kì các loại dấu trích dẫn mở

* * p{Pf} hoặc p{Final_Punctuation}: Bất kì các loại dấu trích dẫn đóng

* * p{Pc} hoặc p{Connector_Punctuation}: Các dấu dùng để kết nối các từ như dấu gạch dưới

* * p{Po} hoặc p{Other_Punctuation}: Các dấu câu còn lại

* * p{C} hoặc p{Other}: các ký tự vô hình và các code point không được sử dụng

* p{Cc} hoặc p{Control}: Ký tự điều khiển ASCII hoặc Latin-1: 0x00–0x1F và 0x7F–0x9F

* * p{Cf} hoặc p{Format}

* * p{Co} hoặc p{Private_Use}

* * p{Cs} hoặc p{Surrogate}

* * p{Cn} hoặc p{Unassigned}

Unicode Scripts

There are code points already assigned to characters, besides there are code points that have not been assigned to any characters at all. The Unicode standard divides assigned code points into multiple scripts. Each script is a group of code points used by a specific writing system. Some scripts such as Thai correspond to a language (here in Thai). Other scripts like Latin spread across many languages.

Some languages include many scripts. For example: There is no Japanese Unicode script but instead, Unicode provides scripts like Hiragana , Katakana , Han and Latin for use in Japanese documents.

Common is a special script. This script contains all common types of characters from many other scripts. It includes all kinds of punctuation, spaces and miscellaneous symbols.

All assigned code points (match P{Cn} ) belong to Unicode scripts. Unassigned code points (matching p{Cn} ) do not belong to any Unicode script.

Perl, PCRE, PHP, Ruby 1.9, Delphi and XRegExp can match the following scripts.

p{Common}

p{Arabic}

p{Armenian}

p{Bengali}

p{Bopomofo}

p{Braille}

p{Buhid}

p{Canadian_Aboriginal}

p{Cherokee}

p{Cyrillic}

p{Devanagari}

p{Ethiopic}

p{Georgian}

p{Greek}

p{Gujarati}

p{Gurmukhi}

p{Han}

p{Hangul}

p{Hanunoo}

p{Hebrew}

p{Hiragana}

p{Inherited}

p{Kannada}

p{Katakana}

p{Khmer}

p{Lao}

p{Latin}

p{Limbu}

p{Malayalam}

p{Mongolian}

p{Myanmar}

p{Ogham}

p{Oriya}

p{Runic}

p{Sinhala}

p{Syriac}

p{Tagalog}

p{Tagbanwa}

p{TaiLe}

p{Tamil}

p{Telugu}

p{Thaana}

p{Thai}

p{Tibetan}

p{Yi}

p{Common}

p{Arabic}

p{Armenian}

p{Bengali}

p{Bopomofo}

p{Braille}

p{Buhid}

p{Canadian_Aboriginal}

p{Cherokee}

p{Cyrillic}

p{Devanagari}

p{Ethiopic}

p{Georgian}

p{Greek}

p{Gujarati}

p{Gurmukhi}

p{Han}

p{Hangul}

p{Hanunoo}

p{Hebrew}

p{Hiragana}

p{Inherited}

p{Kannada}

p{Katakana}

p{Khmer}

p{Lao}

p{Latin}

p{Limbu}

p{Malayalam}

p{Mongolian}

p{Myanmar}

p{Ogham}

p{Oriya}

p{Runic}

p{Sinhala}

p{Syriac}

p{Tagalog}

p{Tagbanwa}

p{TaiLe}

p{Tamil}

p{Telugu}

p{Thaana}

p{Thai}

p{Tibetan}

p{Yi}

Unicode Blocks

The Unicode standard also divides characters into blocks or ranges. Each block is used to define characters of a specific script like “Tibetan” or belong to a specific group like “Braille Patterns”. Most blocks contain unassigned code points, reserved for later Unicode expansion.

Note that Unicode blocks are not exactly the same as scripts. There are basic differences between blocks and scripts. Block is the range of code points (as listed below). And the script contains characters across the entire Unicode range. Blocks can include unassigned code points (code points matching p{Cn} ). Script contains only code points that have been assigned to characters. In general, if you don’t know what type to use, choose a script.

For example, the Currency block does not contain $ and ¥ but 2 blocks Basic_Latin and Latin-1_Supplement do, although neither of them is Latin characters. This is for historical reasons, because the ASCII standard contains the $ sign and the ISO-8859 standard contains the ¥ sign. You should not blindly use any of the blocks listed below based on their names. Instead, see the range of characters they match. A tool like RegexBuddy can help you. p{Sc} or p{Currency_Symbol} would be a better choice p{InCurrency_Symbols} when you are p{InCurrency_Symbols} currency symbols.

p{InBasic_Latin}: U+0000–U+007F

p{InLatin-1_Supplement}: U+0080–U+00FF

p{InLatin_Extended-A}: U+0100–U+017F

p{InLatin_Extended-B}: U+0180–U+024F

p{InIPA_Extensions}: U+0250–U+02AF

p{InSpacing_Modifier_Letters}: U+02B0–U+02FF

p{InCombining_Diacritical_Marks}: U+0300–U+036F

p{InGreek_and_Coptic}: U+0370–U+03FF

p{InCyrillic}: U+0400–U+04FF

p{InCyrillic_Supplementary}: U+0500–U+052F

p{InArmenian}: U+0530–U+058F

p{InHebrew}: U+0590–U+05FF

p{InArabic}: U+0600–U+06FF

p{InSyriac}: U+0700–U+074F

p{InThaana}: U+0780–U+07BF

p{InDevanagari}: U+0900–U+097F

p{InBengali}: U+0980–U+09FF

p{InGurmukhi}: U+0A00–U+0A7F

p{InGujarati}: U+0A80–U+0AFF

p{InOriya}: U+0B00–U+0B7F

p{InTamil}: U+0B80–U+0BFF

p{InTelugu}: U+0C00–U+0C7F

p{InKannada}: U+0C80–U+0CFF

p{InMalayalam}: U+0D00–U+0D7F

p{InSinhala}: U+0D80–U+0DFF

p{InThai}: U+0E00–U+0E7F

p{InLao}: U+0E80–U+0EFF

p{InTibetan}: U+0F00–U+0FFF

p{InMyanmar}: U+1000–U+109F

p{InGeorgian}: U+10A0–U+10FF

p{InHangul_Jamo}: U+1100–U+11FF

p{InEthiopic}: U+1200–U+137F

p{InCherokee}: U+13A0–U+13FF

p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F

p{InOgham}: U+1680–U+169F

p{InRunic}: U+16A0–U+16FF

p{InTagalog}: U+1700–U+171F

p{InHanunoo}: U+1720–U+173F

p{InBuhid}: U+1740–U+175F

p{InTagbanwa}: U+1760–U+177F

p{InKhmer}: U+1780–U+17FF

p{InMongolian}: U+1800–U+18AF

p{InLimbu}: U+1900–U+194F

p{InTai_Le}: U+1950–U+197F

p{InKhmer_Symbols}: U+19E0–U+19FF

p{InPhonetic_Extensions}: U+1D00–U+1D7F

p{InLatin_Extended_Additional}: U+1E00–U+1EFF

p{InGreek_Extended}: U+1F00–U+1FFF

p{InGeneral_Punctuation}: U+2000–U+206F

p{InSuperscripts_and_Subscripts}: U+2070–U+209F

p{InCurrency_Symbols}: U+20A0–U+20CF

p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF

p{InLetterlike_Symbols}: U+2100–U+214F

p{InNumber_Forms}: U+2150–U+218F

p{InArrows}: U+2190–U+21FF

p{InMathematical_Operators}: U+2200–U+22FF

p{InMiscellaneous_Technical}: U+2300–U+23FF

p{InControl_Pictures}: U+2400–U+243F

p{InOptical_Character_Recognition}: U+2440–U+245F

p{InEnclosed_Alphanumerics}: U+2460–U+24FF

p{InBox_Drawing}: U+2500–U+257F

p{InBlock_Elements}: U+2580–U+259F

p{InGeometric_Shapes}: U+25A0–U+25FF

p{InMiscellaneous_Symbols}: U+2600–U+26FF

p{InDingbats}: U+2700–U+27BF

p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF

p{InSupplemental_Arrows-A}: U+27F0–U+27FF

p{InBraille_Patterns}: U+2800–U+28FF

p{InSupplemental_Arrows-B}: U+2900–U+297F

p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF

p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF

p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF

p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF

p{InKangxi_Radicals}: U+2F00–U+2FDF

p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF

p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F

p{InHiragana}: U+3040–U+309F

p{InKatakana}: U+30A0–U+30FF

p{InBopomofo}: U+3100–U+312F

p{InHangul_Compatibility_Jamo}: U+3130–U+318F

p{InKanbun}: U+3190–U+319F

p{InBopomofo_Extended}: U+31A0–U+31BF

p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF

p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF

p{InCJK_Compatibility}: U+3300–U+33FF

p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF

p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF

p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF

p{InYi_Syllables}: U+A000–U+A48F

p{InYi_Radicals}: U+A490–U+A4CF

p{InHangul_Syllables}: U+AC00–U+D7AF

p{InHigh_Surrogates}: U+D800–U+DB7F

p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF

p{InLow_Surrogates}: U+DC00–U+DFFF

p{InPrivate_Use_Area}: U+E000–U+F8FF

p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF

p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F

p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF

p{InVariation_Selectors}: U+FE00–U+FE0F

p{InCombining_Half_Marks}: U+FE20–U+FE2F

p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F

p{InSmall_Form_Variants}: U+FE50–U+FE6F

p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF

p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF

p{InSpecials}: U+FFF0–U+FFFF

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

p{InBasic_Latin}: U+0000–U+007F

p{InLatin-1_Supplement}: U+0080–U+00FF

p{InLatin_Extended-A}: U+0100–U+017F

p{InLatin_Extended-B}: U+0180–U+024F

p{InIPA_Extensions}: U+0250–U+02AF

p{InSpacing_Modifier_Letters}: U+02B0–U+02FF

p{InCombining_Diacritical_Marks}: U+0300–U+036F

p{InGreek_and_Coptic}: U+0370–U+03FF

p{InCyrillic}: U+0400–U+04FF

p{InCyrillic_Supplementary}: U+0500–U+052F

p{InArmenian}: U+0530–U+058F

p{InHebrew}: U+0590–U+05FF

p{InArabic}: U+0600–U+06FF

p{InSyriac}: U+0700–U+074F

p{InThaana}: U+0780–U+07BF

p{InDevanagari}: U+0900–U+097F

p{InBengali}: U+0980–U+09FF

p{InGurmukhi}: U+0A00–U+0A7F

p{InGujarati}: U+0A80–U+0AFF

p{InOriya}: U+0B00–U+0B7F

p{InTamil}: U+0B80–U+0BFF

p{InTelugu}: U+0C00–U+0C7F

p{InKannada}: U+0C80–U+0CFF

p{InMalayalam}: U+0D00–U+0D7F

p{InSinhala}: U+0D80–U+0DFF

p{InThai}: U+0E00–U+0E7F

p{InLao}: U+0E80–U+0EFF

p{InTibetan}: U+0F00–U+0FFF

p{InMyanmar}: U+1000–U+109F

p{InGeorgian}: U+10A0–U+10FF

p{InHangul_Jamo}: U+1100–U+11FF

p{InEthiopic}: U+1200–U+137F

p{InCherokee}: U+13A0–U+13FF

p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F

p{InOgham}: U+1680–U+169F

p{InRunic}: U+16A0–U+16FF

p{InTagalog}: U+1700–U+171F

p{InHanunoo}: U+1720–U+173F

p{InBuhid}: U+1740–U+175F

p{InTagbanwa}: U+1760–U+177F

p{InKhmer}: U+1780–U+17FF

p{InMongolian}: U+1800–U+18AF

p{InLimbu}: U+1900–U+194F

p{InTai_Le}: U+1950–U+197F

p{InKhmer_Symbols}: U+19E0–U+19FF

p{InPhonetic_Extensions}: U+1D00–U+1D7F

p{InLatin_Extended_Additional}: U+1E00–U+1EFF

p{InGreek_Extended}: U+1F00–U+1FFF

p{InGeneral_Punctuation}: U+2000–U+206F

p{InSuperscripts_and_Subscripts}: U+2070–U+209F

p{InCurrency_Symbols}: U+20A0–U+20CF

p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF

p{InLetterlike_Symbols}: U+2100–U+214F

p{InNumber_Forms}: U+2150–U+218F

p{InArrows}: U+2190–U+21FF

p{InMathematical_Operators}: U+2200–U+22FF

p{InMiscellaneous_Technical}: U+2300–U+23FF

p{InControl_Pictures}: U+2400–U+243F

p{InOptical_Character_Recognition}: U+2440–U+245F

p{InEnclosed_Alphanumerics}: U+2460–U+24FF

p{InBox_Drawing}: U+2500–U+257F

p{InBlock_Elements}: U+2580–U+259F

p{InGeometric_Shapes}: U+25A0–U+25FF

p{InMiscellaneous_Symbols}: U+2600–U+26FF

p{InDingbats}: U+2700–U+27BF

p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF

p{InSupplemental_Arrows-A}: U+27F0–U+27FF

p{InBraille_Patterns}: U+2800–U+28FF

p{InSupplemental_Arrows-B}: U+2900–U+297F

p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF

p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF

p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF

p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF

p{InKangxi_Radicals}: U+2F00–U+2FDF

p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF

p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F

p{InHiragana}: U+3040–U+309F

p{InKatakana}: U+30A0–U+30FF

p{InBopomofo}: U+3100–U+312F

p{InHangul_Compatibility_Jamo}: U+3130–U+318F

p{InKanbun}: U+3190–U+319F

p{InBopomofo_Extended}: U+31A0–U+31BF

p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF

p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF

p{InCJK_Compatibility}: U+3300–U+33FF

p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF

p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF

p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF

p{InYi_Syllables}: U+A000–U+A48F

p{InYi_Radicals}: U+A490–U+A4CF

p{InHangul_Syllables}: U+AC00–U+D7AF

p{InHigh_Surrogates}: U+D800–U+DB7F

p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF

p{InLow_Surrogates}: U+DC00–U+DFFF

p{InPrivate_Use_Area}: U+E000–U+F8FF

p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF

p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F

p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF

p{InVariation_Selectors}: U+FE00–U+FE0F

p{InCombining_Half_Marks}: U+FE20–U+FE2F

p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F

p{InSmall_Form_Variants}: U+FE50–U+FE6F

p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF

p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF

p{InSpecials}: U+FFF0–U+FFFF

Not all regex parsers use the same syntax to match Unicode blocks. Java, Ruby 2.0, and XRegExp use the syntax p{InBlock} as listed above. But .Net and XML use p{IsBlock} . Perl and JGsoft flaver support both types of syntax. You should use In if the analyzer supports it. In can only be used for Unicode blocks while Is can also be used for Unicode properties and scripts.

For .NET and XML, you must omit the underscore but retain the hyphen at the name of the block. For example, use p{IsLatinExtended-A} instead of p{InLatin_Extended-A} . In Java, you must ignore dashes. .NET and XML also case sensitive, while Perl, Ruby and JGsoft flavor are not. Java 4 is case sensitive but Java 5 and later is case sensitive only for the Is prefix.

The names of the blocks are the same as the names in regex parsers. The names of the blocks are defined in the Unicode standard. PCRE and PHP do not support Unicode blocks but support scripts

Do you need to worry about different encryption?

While you should remember that pitfalls are created in different ways to encode accented characters, you don’t always have to worry about them. If you’re sure your input and regex strings use the same type, there’s no need to worry. This process is called Unicode normalization. All programming languages that support native Unicode such as Java, C # and VB.NET have libraries for string standardization. If you standardize both the subject and the regex before executing the match, there won’t be any conflicts.

If using Java, you can pass the CANON_EQ flag as the second parameter of the Pattern.compile () function. This tells Java’s regex parser that the characters that look the same are the same. For example, regex match with à (U + 00E0) will match with à (U + 0061 U + 0300) and vice versa. Other than Java, no regex analyzer supports this.

If you type à with the keyboard, most word processors will enter the code point U + 00E0 into the file. So if you’re working with text you type, any regexes you type match in the same way.

Share the news now

Source : Viblo

Unicode

Characters, Code Points, Graphemes

Regex for Code Point

Unicode Categories

Unicode Scripts

Unicode Blocks

Do you need to worry about different encryption?

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers