Check if a prime number is not equal to regex?

Saturday, 16/05/2020

Tram Ho

Introduce

Today while surfing the web, I came across a link about you can not parse HTML with regex (note: a lot of joking answers in there). In short, the theoretical reason is that HTML is Context-Free Language , a more general language standard (also called the parent set of) Regular Expression . However, the reality is a bit different:

First, regex is in fact not the same as regex by definition of linguistics in computer science;
You can also use regex for simple HTML, escaping special characters (eg > > ). An example regex taken from the link above is:

&lt;([^s]+)(s[^&gt;]*?)?(?&lt;!/)&gt;

1 2	<([^s]+)(s[^>]*?)?(?<!/)>

But this is not the main content of this article (^^;) That link leads to a short regex allegedly from the Perl power user Abigail , created from the 9x years circulated through newsletter channels. This regex is recognizable if a number is prime:

/^1?$|^(11+?)1+$/

1 2	/^1?$\|^(11+?)1+$/

Interesting, right?

Functional test

In this link, the author has tried regex with Ruby to check to make sure whether this is a good regex, or is Perl’s magic. I will write a similar code but in the Python language:

<span class="token keyword">import</span> re
is_prime <span class="token operator">=</span> <span class="token keyword">lambda</span> num <span class="token punctuation">:</span> re <span class="token punctuation">.</span> search <span class="token punctuation">(</span> r <span class="token string">'^1?$|^(11+?)1+$'</span> <span class="token punctuation">,</span> <span class="token string">'1'</span> <span class="token operator">*</span> num <span class="token punctuation">)</span> <span class="token keyword">is</span> <span class="token boolean">None</span>

<span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token number">100</span> <span class="token punctuation">,</span> is_prime <span class="token punctuation">(</span> <span class="token number">100</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span>
<span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token number">101</span> <span class="token punctuation">,</span> is_prime <span class="token punctuation">(</span> <span class="token number">101</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span>

import re

is_prime = lambda num : re . search ( r '^1?$|^(11+?)1+$' , '1' * num ) is None

print ( 100 , is_prime ( 100 ) )

print ( 101 , is_prime ( 101 ) )

Note that the trailing / starting and ending regex pattern is not included in Python, and the input of this pattern is encode using a unary system. And as expected, the result was as expected:

100 False
101 True

100 False

101 True

So how does this regex work?

Explain

It can be seen that this regex has 2 parts, divided by | : means match 1 of 2 is fine. We will go find out one by one:

^1?$ : In this pattern, ^ requires the regex to start at the beginning of the string, and $ requires the capture of the end of the string, which means that the pattern must capture the entire string to be valid. The rest is 1? , meaning that catch 0 or 1 of the characters 1 , which corresponds to the values 0 and 1, and it’s true that those two values are not prime numbers. If the input is not these two values, we go to the second part of the regex.
^(11+?)1+$ : Similar to the above regex, this pattern requires capturing the entire input string. Let’s break down this regex into 2 more parts:
- (11+?) : This string will catch 1 number 1 , then catch at least 1 more 1 by the + operator. However, the operator ? requires the previous element to catch the lazy string (instead of taking as default) ^[1] . More specifically, the default 1+ will catch as many 1s as possible; however 1+? will start by capturing 1 number 1. If it doesn’t work, it will backtrack and catch 2 numbers 1. If still not successful, it backtrack and move to 3 numbers 1 , and so on. So, 11+? will start from 2 number 1 or more, and so increase the number of digits 1 if caught without success.
- 1+ : 1 points to capture group 1, which means what was captured in parentheses. And similarly, 1+ ask the group to repeat 1 or more times.
Putting the 2 parts together, we get how it works:
- (11+?) Will start capturing 2 numbers 1 , and 1+ will start a multiple-of-2 numbers 1 . So, (11+?)1+ will catch 4, 6, 8, 10, … the numbers 1 , and ^(11+?)1+$ will require the string to be captured to occupy the entire first string. into the.
- If it fails, (11+?) Will catch 3 numbers 1 , and ^(11+?)1+$ will catch strings with the number 1 having length 6, 9, …
- Just like that, until (11+?) too long for the input string, the regex will end, and the return will not catch, meaning that the input is a prime number.

With the above explanation, surely you also find the operating principle of this regex quite understandable? The capture group will try each value from 2 onwards, and repeat the capture group to try the whole sequence to see if it is a non-trivial multiple of the above value. If a match is reached, the regex will return a match, and if the capture group is too long for the input string and it still cannot find the divisor, then that number is a prime number.

^[1] It should be noted that ? Here is different from the optional operator: 1? It means to catch 0 or 1 of 1, but 1+? then as I explained above: notice that the operator ? It works here on the + operator, not in the first 1 .

Comment

Compare with Eratosthenes sieve

You can see that this “algorithm” is quite similar to the Eratosthenes sieve , but the only difference is that this algorithm does not remember the numerical values that were excluded to skip checking those conventions anymore. When using this regex with large strings, they will take a lot of time:

<span class="token keyword">from</span> timeit <span class="token keyword">import</span> timeit
<span class="token keyword">def</span> <span class="token function">time</span> <span class="token punctuation">(</span> num <span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token punctuation">,</span> count <span class="token punctuation">:</span> <span class="token builtin">int</span> <span class="token punctuation">)</span> <span class="token operator">-</span> <span class="token operator">&gt;</span> <span class="token boolean">None</span> <span class="token punctuation">:</span>
    val <span class="token operator">=</span> timeit <span class="token punctuation">(</span> <span class="token keyword">lambda</span> <span class="token punctuation">:</span> is_prime <span class="token punctuation">(</span> num <span class="token punctuation">)</span> <span class="token punctuation">,</span> number <span class="token operator">=</span> count <span class="token punctuation">)</span> <span class="token operator">/</span> count
    <span class="token keyword">print</span> <span class="token punctuation">(</span> f <span class="token string">'is_prime({num}) takes {val:.06f} secs'</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">100</span> <span class="token punctuation">,</span> <span class="token number">10000</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">101</span> <span class="token punctuation">,</span> <span class="token number">10000</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">1000</span> <span class="token punctuation">,</span> <span class="token number">1000</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">1001</span> <span class="token punctuation">,</span> <span class="token number">1000</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">10000</span> <span class="token punctuation">,</span> <span class="token number">100</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">10001</span> <span class="token punctuation">,</span> <span class="token number">100</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">100000</span> <span class="token punctuation">,</span> <span class="token number">10</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">100001</span> <span class="token punctuation">,</span> <span class="token number">10</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">1000000</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">)</span>
time <span class="token punctuation">(</span> <span class="token number">1000001</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">)</span>

from timeit import timeit

def time ( num : int , count : int ) - > None :

val = timeit ( lambda : is_prime ( num ) , number = count ) / count

print ( f 'is_prime({num}) takes {val:.06f} secs' )

time ( 100 , 10000 )

time ( 101 , 10000 )

time ( 1000 , 1000 )

time ( 1001 , 1000 )

time ( 10000 , 100 )

time ( 10001 , 100 )

time ( 100000 , 10 )

time ( 100001 , 10 )

time ( 1000000 , 1 )

time ( 1000001 , 1 )

I have to reduce the number of times to run the code gradually because it takes too long. The result is:

is_prime(100) takes 0.000016 secs
is_prime(101) takes 0.000135 secs
is_prime(1000) takes 0.000064 secs
is_prime(1001) takes 0.000312 secs
is_prime(10000) takes 0.000843 secs
is_prime(10001) takes 0.011829 secs
is_prime(100000) takes 0.005481 secs
is_prime(100001) takes 0.051016 secs
is_prime(1000000) takes 0.211841 secs
is_prime(1000001) takes 1.620660 secs

is_prime(100) takes 0.000016 secs

is_prime(101) takes 0.000135 secs

is_prime(1000) takes 0.000064 secs

is_prime(1001) takes 0.000312 secs

is_prime(10000) takes 0.000843 secs

is_prime(10001) takes 0.011829 secs

is_prime(100000) takes 0.005481 secs

is_prime(100001) takes 0.051016 secs

is_prime(1000000) takes 0.211841 secs

is_prime(1000001) takes 1.620660 secs

So, this regex statement is just to show the Turing-complete of regex apparatuses, don’t use this when you need to check elementality (^^;)

So what is the complexity of this algorithm?

In fact, each attempt to make a divisor, the regex will try to capture the entire string, and the divisor to be tested is also the string length. However, the string length is $2^{n}$ , because the algorithm complexity is calculated in binary; So the complexity of this regex is $(2^{n})^{2}, on both exponential. In addition, notice that a sequence from binary (the storage of numbers in a computer) to a single binary is lost.$ $O (2^{n}) of memory (by the length of the string), not to mention time, so you have more and more reasons why you should not use this regex to check elementality (^^;)$

But this pattern is not Regular Expression! I tricked anyone but I studied Context-Free Grammar!

If the above sentence sounds familiar, because in the introduction above, I introduced Context-Free Grammar, basically a more general set of Regular Expressions. And, because that regex pattern contains recursion, it is not a regular expression as defined in computer science . The proof is extremely simple, using Pumping lemma takes 3 lines . However, the current regular regex systems support backreference of capture groups (not to mention Perl monsters!). So strictly theoretically, this is actually not a regular expression, but some laws have been relaxed, and have become a subset of context-free grammar bigger than the regular expression. However, if you are willing to temporarily turn a blind eye to books and define regex according to current implementations.

There are no more posts.

Hope this article makes you feel as interesting as the first time I read about this

Share the news now

Source : Viblo

Check if a prime number is not equal to regex?

Introduce

Functional test

Explain

Comment

Compare with Eratosthenes sieve

So what is the complexity of this algorithm?

But this pattern is not Regular Expression! I tricked anyone but I studied Context-Free Grammar!

There are no more posts.

TikTok becomes the second largest social platform in South Africa

The fastest depreciating after 9 months of launch, iPhone 14 Pro Max continues to break the bottom in Vietnam

Beginner's guide to R: Introduction

10 essential SublimeText plugins for JavaScript developers