Check if a prime number is not equal to regex?

Tram Ho

Introduce

Today while surfing the web, I came across a link about you can not parse HTML with regex (note: a lot of joking answers in there). In short, the theoretical reason is that HTML is Context-Free Language , a more general language standard (also called the parent set of) Regular Expression . However, the reality is a bit different:

  • First, regex is in fact not the same as regex by definition of linguistics in computer science;
  • You can also use regex for simple HTML, escaping special characters (eg > > ). An example regex taken from the link above is:

But this is not the main content of this article (^^;) That link leads to a short regex allegedly from the Perl power user Abigail , created from the 9x years circulated through newsletter channels. This regex is recognizable if a number is prime:

Interesting, right?

Functional test

In this link, the author has tried regex with Ruby to check to make sure whether this is a good regex, or is Perl’s magic. I will write a similar code but in the Python language:

Note that the trailing / starting and ending regex pattern is not included in Python, and the input of this pattern is encode using a unary system. And as expected, the result was as expected:

So how does this regex work?

Explain

It can be seen that this regex has 2 parts, divided by | : means match 1 of 2 is fine. We will go find out one by one:

  • ^1?$ : In this pattern, ^ requires the regex to start at the beginning of the string, and $ requires the capture of the end of the string, which means that the pattern must capture the entire string to be valid. The rest is 1? , meaning that catch 0 or 1 of the characters 1 , which corresponds to the values ​​0 and 1, and it’s true that those two values ​​are not prime numbers. If the input is not these two values, we go to the second part of the regex.
  • ^(11+?)1+$ : Similar to the above regex, this pattern requires capturing the entire input string. Let’s break down this regex into 2 more parts:
    • (11+?) : This string will catch 1 number 1 , then catch at least 1 more 1 by the + operator. However, the operator ? requires the previous element to catch the lazy string (instead of taking as default) [1] . More specifically, the default 1+ will catch as many 1s as possible; however 1+? will start by capturing 1 number 1. If it doesn’t work, it will backtrack and catch 2 numbers 1. If still not successful, it backtrack and move to 3 numbers 1 , and so on. So, 11+? will start from 2 number 1 or more, and so increase the number of digits 1 if caught without success.
    • 1+ : 1 points to capture group 1, which means what was captured in parentheses. And similarly, 1+ ask the group to repeat 1 or more times.

    Putting the 2 parts together, we get how it works:

    • (11+?) Will start capturing 2 numbers 1 , and 1+ will start a multiple-of-2 numbers 1 . So, (11+?)1+ will catch 4, 6, 8, 10, … the numbers 1 , and ^(11+?)1+$ will require the string to be captured to occupy the entire first string. into the.
    • If it fails, (11+?) Will catch 3 numbers 1 , and ^(11+?)1+$ will catch strings with the number 1 having length 6, 9, …
    • Just like that, until (11+?) too long for the input string, the regex will end, and the return will not catch, meaning that the input is a prime number.

With the above explanation, surely you also find the operating principle of this regex quite understandable? The capture group will try each value from 2 onwards, and repeat the capture group to try the whole sequence to see if it is a non-trivial multiple of the above value. If a match is reached, the regex will return a match, and if the capture group is too long for the input string and it still cannot find the divisor, then that number is a prime number.

[1] It should be noted that ? Here is different from the optional operator: 1? It means to catch 0 or 1 of 1, but 1+? then as I explained above: notice that the operator ? It works here on the + operator, not in the first 1 .

Comment

Compare with Eratosthenes sieve

You can see that this “algorithm” is quite similar to the Eratosthenes sieve , but the only difference is that this algorithm does not remember the numerical values ​​that were excluded to skip checking those conventions anymore. When using this regex with large strings, they will take a lot of time:

I have to reduce the number of times to run the code gradually because it takes too long. The result is:

So, this regex statement is just to show the Turing-complete of regex apparatuses, don’t use this when you need to check elementality (^^;)

So what is the complexity of this algorithm?

In fact, each attempt to make a divisor, the regex will try to capture the entire string, and the divisor to be tested is also the string length. However, the string length is 2 n 2 ^ n 2 n , because the algorithm complexity is calculated in binary; So the complexity of this regex is ( 2 n ) 2 (2 ^ n) ^ 2 ( 2 n ) 2 , on both exponential. In addition, notice that a sequence from binary (the storage of numbers in a computer) to a single binary is lost. O ( 2 n ) O (2 ^ n) O ( 2 n ) of memory (by the length of the string), not to mention time, so you have more and more reasons why you should not use this regex to check elementality (^^;)

But this pattern is not Regular Expression! I tricked anyone but I studied Context-Free Grammar!

If the above sentence sounds familiar, because in the introduction above, I introduced Context-Free Grammar, basically a more general set of Regular Expressions. And, because that regex pattern contains recursion, it is not a regular expression as defined in computer science . The proof is extremely simple, using Pumping lemma takes 3 lines . However, the current regular regex systems support backreference of capture groups (not to mention Perl monsters!). So strictly theoretically, this is actually not a regular expression, but some laws have been relaxed, and have become a subset of context-free grammar bigger than the regular expression. However, if you are willing to temporarily turn a blind eye to books and define regex according to current implementations.

There are no more posts.

Hope this article makes you feel as interesting as the first time I read about this

Share the news now

Source : Viblo