Introduce
Today while surfing the web, I came across a link about you can not parse HTML with regex (note: a lot of joking answers in there). In short, the theoretical reason is that HTML is Context-Free Language , a more general language standard (also called the parent set of) Regular Expression . However, the reality is a bit different:
- First, regex is in fact not the same as regex by definition of linguistics in computer science;
- You can also use regex for simple HTML, escaping special characters (eg
>
>
). An example regex taken from the link above is:
1 2 | <([^s]+)(s[^>]*?)?(?<!/)> |
But this is not the main content of this article (^^;) That link leads to a short regex allegedly from the Perl power user Abigail , created from the 9x years circulated through newsletter channels. This regex is recognizable if a number is prime:
1 2 | /^1?$|^(11+?)1+$/ |
Interesting, right?
Functional test
In this link, the author has tried regex with Ruby to check to make sure whether this is a good regex, or is Perl’s magic. I will write a similar code but in the Python language:
1 2 3 4 5 6 | <span class="token keyword">import</span> re is_prime <span class="token operator">=</span> <span class="token keyword">lambda</span> num <span class="token punctuation">:</span> re <span class="token punctuation">.</span> search <span class="token punctuation">(</span> r <span class="token string">'^1?$|^(11+?)1+$'</span> <span class="token punctuation">,</span> <span class="token string">'1'</span> <span class="token operator">*</span> num <span class="token punctuation">)</span> <span class="token keyword">is</span> <span class="token boolean">None</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token number">100</span> <span class="token punctuation">,</span> is_prime <span class="token punctuation">(</span> <span class="token number">100</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token number">101</span> <span class="token punctuation">,</span> is_prime <span class="token punctuation">(</span> <span class="token number">101</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> |
Note that the trailing /
starting and ending regex pattern is not included in Python, and the input of this pattern is encode using a unary system. And as expected, the result was as expected:
1 2 3 | 100 False 101 True |
So how does this regex work?
Explain
It can be seen that this regex has 2 parts, divided by |
: means match 1 of 2 is fine. We will go find out one by one:
^1?$
: In this pattern,^
requires the regex to start at the beginning of the string, and$
requires the capture of the end of the string, which means that the pattern must capture the entire string to be valid. The rest is1?
, meaning that catch 0 or 1 of the characters1
, which corresponds to the values 0 and 1, and it’s true that those two values are not prime numbers. If the input is not these two values, we go to the second part of the regex.^(11+?)1+$
: Similar to the above regex, this pattern requires capturing the entire input string. Let’s break down this regex into 2 more parts:(11+?)
: This string will catch 1 number1
, then catch at least1
more1
by the+
operator. However, the operator?
requires the previous element to catch the lazy string (instead of taking as default) [1] . More specifically, the default1+
will catch as many 1s as possible; however1+?
will start by capturing 1 number 1. If it doesn’t work, it will backtrack and catch 2 numbers 1. If still not successful, it backtrack and move to 3 numbers1
, and so on. So,11+?
will start from 2 number1
or more, and so increase the number of digits 1 if caught without success.1+
:1
points to capture group 1, which means what was captured in parentheses. And similarly,1+
ask the group to repeat 1 or more times.
Putting the 2 parts together, we get how it works:
(11+?)
Will start capturing 2 numbers1
, and1+
will start a multiple-of-2 numbers1
. So,(11+?)1+
will catch 4, 6, 8, 10, … the numbers1
, and^(11+?)1+$
will require the string to be captured to occupy the entire first string. into the.- If it fails,
(11+?)
Will catch 3 numbers1
, and^(11+?)1+$
will catch strings with the number1
having length 6, 9, … - Just like that, until
(11+?)
too long for the input string, the regex will end, and the return will not catch, meaning that the input is a prime number.
With the above explanation, surely you also find the operating principle of this regex quite understandable? The capture group will try each value from 2 onwards, and repeat the capture group to try the whole sequence to see if it is a non-trivial multiple of the above value. If a match is reached, the regex will return a match, and if the capture group is too long for the input string and it still cannot find the divisor, then that number is a prime number.
[1] It should be noted that ?
Here is different from the optional operator: 1?
It means to catch 0 or 1 of 1, but 1+?
then as I explained above: notice that the operator ?
It works here on the +
operator, not in the first 1
.
Comment
Compare with Eratosthenes sieve
You can see that this “algorithm” is quite similar to the Eratosthenes sieve , but the only difference is that this algorithm does not remember the numerical values that were excluded to skip checking those conventions anymore. When using this regex with large strings, they will take a lot of time:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | <span class="token keyword">from</span> timeit <span class="token keyword">import</span> timeit <span class="token keyword">def</span> <span class="token function">time</span> <span class="token punctuation">(</span> num <span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token punctuation">,</span> count <span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token operator">></span> <span class="token boolean">None</span> <span class="token punctuation">:</span> val <span class="token operator">=</span> timeit <span class="token punctuation">(</span> <span class="token keyword">lambda</span> <span class="token punctuation">:</span> is_prime <span class="token punctuation">(</span> num <span class="token punctuation">)</span> <span class="token punctuation">,</span> number <span class="token operator">=</span> count <span class="token punctuation">)</span> <span class="token operator">/</span> count <span class="token keyword">print</span> <span class="token punctuation">(</span> f <span class="token string">'is_prime({num}) takes {val:.06f} secs'</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">100</span> <span class="token punctuation">,</span> <span class="token number">10000</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">101</span> <span class="token punctuation">,</span> <span class="token number">10000</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">1000</span> <span class="token punctuation">,</span> <span class="token number">1000</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">1001</span> <span class="token punctuation">,</span> <span class="token number">1000</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">10000</span> <span class="token punctuation">,</span> <span class="token number">100</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">10001</span> <span class="token punctuation">,</span> <span class="token number">100</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">100000</span> <span class="token punctuation">,</span> <span class="token number">10</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">100001</span> <span class="token punctuation">,</span> <span class="token number">10</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">1000000</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">)</span> time <span class="token punctuation">(</span> <span class="token number">1000001</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">)</span> |
I have to reduce the number of times to run the code gradually because it takes too long. The result is:
1 2 3 4 5 6 7 8 9 10 11 | is_prime(100) takes 0.000016 secs is_prime(101) takes 0.000135 secs is_prime(1000) takes 0.000064 secs is_prime(1001) takes 0.000312 secs is_prime(10000) takes 0.000843 secs is_prime(10001) takes 0.011829 secs is_prime(100000) takes 0.005481 secs is_prime(100001) takes 0.051016 secs is_prime(1000000) takes 0.211841 secs is_prime(1000001) takes 1.620660 secs |
So, this regex statement is just to show the Turing-complete of regex apparatuses, don’t use this when you need to check elementality (^^;)
So what is the complexity of this algorithm?
In fact, each attempt to make a divisor, the regex will try to capture the entire string, and the divisor to be tested is also the string length. However, the string length is 2 n 2 ^ n 2 n , because the algorithm complexity is calculated in binary; So the complexity of this regex is ( 2 n ) 2 (2 ^ n) ^ 2 ( 2 n ) 2 , on both exponential. In addition, notice that a sequence from binary (the storage of numbers in a computer) to a single binary is lost. O ( 2 n ) O (2 ^ n) O ( 2 n ) of memory (by the length of the string), not to mention time, so you have more and more reasons why you should not use this regex to check elementality (^^;)
But this pattern is not Regular Expression! I tricked anyone but I studied Context-Free Grammar!
If the above sentence sounds familiar, because in the introduction above, I introduced Context-Free Grammar, basically a more general set of Regular Expressions. And, because that regex pattern contains recursion, it is not a regular expression as defined in computer science . The proof is extremely simple, using Pumping lemma takes 3 lines . However, the current regular regex systems support backreference of capture groups (not to mention Perl monsters!). So strictly theoretically, this is actually not a regular expression, but some laws have been relaxed, and have become a subset of context-free grammar bigger than the regular expression. However, if you are willing to temporarily turn a blind eye to books and define regex according to current implementations.
There are no more posts.
Hope this article makes you feel as interesting as the first time I read about this