## The Prime That Wasn’t

I’ve seen a lot of problems solved with regular expressions, but last Friday, thanks to Chris and Sean, I found out that one can use them to determine whether a given integer is a prime number. The original articles showed the following regular expression:

**/^1?$|^(11+?)\1+$/**

You don’t apply this to the integer itself. Instead, you create a string of repeating `1`

s, where the numbers of `1`

s is defined by the number you’re testing, and apply the regex to that. If it fails to match, the number is a prime. The regex relies on backtracking and thus would not work with a DFA-based engine, such as PHP’s (now deprecated) ereg* functions. But, it works just fine with the preg_* ones, or so I thought (more on that in a bit).

So, how does it work? It may seem like a brain buster, but it’s actually pretty simple. Ignore the portion before the alternation operator (`|`

), because it’s just a short-circuit case for checking whether the input string is empty or has a single `1`

. (If the string is empty, it means the number being tested is 0. If the string is a single `1`

, it means the number being tested is 1. Neither of these are prime.) The portion after the pipe is where the real magic happens.

The `(11+?)`

sub-pattern matches strings `11`

, `111`

, etc. The `\1+`

matches whatever the sub-pattern matched, one or more times. So on the first try, the engine will match `11`

and then attempt to match the same thing one or more times in the rest of the string. If it succeeds, the number is not a prime. Why? Because it just proved that the length of the string is divisible by 2 (the length of `11`

), therefore the original number is divisible by 2. If the overall match fails, the engine backtracks to the beginning and tries to match `111`

two or more times and so on successively. If the first sub-pattern gets long enough (`n/2`

basically), and the overall match still fails, the number is a prime. Beautiful, isn’t it?

Coincidentally, Sean recently created a code evaluation plugin for the Phergie-based IRC bot that runs in the channel we hang out in. The plugin is a simple proxy to ideone.com, but is helpful for doing quick code tests. We had fun experimenting with this regex pattern implemented in a PHP function that returns the next largest prime number after a given one. The trouble started when Sean fed it 99999999, and the code spit out 100000001. This didn’t seem right, and Wolfram Alpha confirmed; the answer we got was not a prime. (It factors into 17 * 5882353.)

A few similar tests also gave us numbers that were not prime. But, where was the problem? The PHP code was too simple to have a bug, many answers were correct, and the regex algorithm itself was sound. It was time for brute force. I quickly wrote some code to feed increasing odd integers into into regex pattern and check its answers against the normal sieve algorithm to see where it starts failing. The first number to fail was 22201; the regex said it was prime, but it’s actually a perfect square (149^{2}). Past that, the failure rate increased.

It then dawned on me that the culprit might be the backtracking itself, specifically how it was implemented in PCRE, the library at the heart of PHP’s regular expressions. As I mention in my Regex Clinic talk, unbounded backtracking can dramatically slow down the matching and should be well-understood before attempting to write complex patterns. To manage the danger of runaway patterns, PCRE implemented the `pcre.backtrack_limit`

setting a few years ago. In our case, backtracking is used to break up the string of `1`

s into successively larger chunks, and, with very long strings, the engine may run into this backtracking limit, which is set to 100000 by default. My guess was that with the 22201-character-long string, the default was not enough. Once the limit is reached, the match fails, and the number is declared prime.

I bumped the limit up to 200000, and voilà, 22201 was no longer declared prime. To fix the 100000001 match, the limit had to be raised dramatically, to around 250000000! And, the program took 14 seconds to deliver the verdict on my new i5 MacBook Pro. Needless to say, don’t use this prime determination algorithm for any sort of real-life stuff. Instead, just appreciate it for its elegance. I certainly did, and was glad that my bit of investigative work showed that abstract, pure, good ideas can still fail in the messy world of software and hardware.

Awesome post man

I think, the methods of primality testing methods would go newer distance with this. Your PCRE slides are also epic tut for beginner to advanced.

Thanks for sharing, it is really cool stuff! Thanks for mentioning “pcre.backtrack_limit”!

So you discovered a very short reg exp. But do you know the correct one for email addresses? It shouldn’t be that long…

Here it is: http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

I think this is a more useful email pattern:

http://code.iamcal.com/php/rfc822/

You should add preg_last_error() to your code to check for this error.

Ariel, yes, that is a good idea, because you’re bound to run into the limit sooner or later with larger numbers.

Lol, awesome :d Thanks !

beautiful ….

And that post is worth at least twice what I paid for it! Excellent.

@Richard, 0 × 2 = 0

“The \1+ matches whatever the sub-pattern matched, one or more times.”

Hmm, correct me if I’m wrong or maybe this is a problem with egrep but:

echo “h” | egrep ‘(h)\1+’

yields nothing.

egrep ‘(h)\1+’ seems to require 2 or more h’s….

@zitstif: (h) matches one h, \1+ matches one ore more instances of what was matched in the first subpattern (i.e. at least one more h).

Ive never-ending seen “+?” been used in a regex before. What does it mean? Is it equivalent to “*”?

Argh! That should have been “never seen”. Damn auto corrections…

14 seconds seems pretty slow, i ran 6 prime/non prime number tests in 14 milliseconds using C++ on a similar spec machine …is the overhead of PHP calling a C library that high ?

The +? is a lazy +, it forces the engine to match only once before trying permutations. Its an optimisation to reduce the overhead of backtracking…

See here for examples:

http://www.regular-expressions.info/reference.html

Glad to see my regular expression has some use.

Thanks Emerson

Emerson, no, it’s not the overhead, it’s just how the algorithm works.

Abigail, thanks for a great regex!

[...] The Prime That Wasn’t » Andrei Zmievski. Share and Enjoy: [...]

The regex is approximately O(n^2), as it has to visit at least half of the string for each length m of \1 ranging from 1 to n (the remaining part, which is at most half of the string, is the remainder n % m). For each m it will backtrack O(n/m) times, so the backtracking work will be O(n+n/2+n/3+…) which is O(n log n).

If you use /^1?$|^(11+?)\1++$/ it will use only O(n) backtracking, but the cost will remain the same.

[...] fails to test primes, needs to be tested for prime prime test test. By foebea, on August 5th, 2010 http://zmievski.org/2010/08/the-prime-that-wasnt 10 words It’s the matter of a successful balance. » [...]

Ok this is really cool.

Im impressed!

Well done.

art

Great post!

Reddit thread:

http://www.reddit.com/r/programming/comments/cxbuh/that_regular_expression_to_check_for_prime/

*AHEM*abigail*AHEM*

Please give attribution. Also, she has refined this a lot if my jaded memory serves.

[...] The Prime That Wasn’t » Andrei Zmievski Priemgetallen vinden met regular expressions. Ayup. (tags: regex math) [...]

[...] [...]

[...] The Prime That Wasn’t (zmievski.org) [...]

Nice! Thx!

So, how does it work? It may seem like a brain buster, but it

Nice post. Few people appreciate what backtracking really means, and its limitations, when they write regexes for real world use.

But I don’t agree with your last sentence. Neither backtracking nor the regex based prime test is abstract, pure, and certainly not a good idea. Perhaps its abstract or pure for some definition of those terms. But never a good idea. Its a bad idea on paper and as you demonstrated a bad idea in practice.

The prime checker is a nifty little one-liner that did bring a smile to my face when I first came across it. But it is little more than a toy, and its weakness is apparent to everyone as soon as he sees it is building a huge string and backtracking on it.

You could mention that Abigail created this regex. The first mention I can find is in comp.lang.perl.misc back in Nov 23 1997.

Hi Andrei,

Congartulations by the post. I liked too much.

Here is a Python solution based: http://pastebin.com/PGT94gdF

Cheers

Hails!

Awesome job, really appreciated!

Ruby is awesome too: https://gist.github.com/906108

This makes programming fun !!!!

Thanks for making it public

[...] [...]

hello – is it just me !! can any one explain why when i type in the firefox browser “zmievski.org” i get a different site yet whe i type it in google its ok? could this be a bug in my system or is any one else having same probs ?

sadensy

I just like the valuable information you supply in your articles. I’ll bookmark your weblog and take a look at once more here regularly. I am somewhat certain I will be told many new stuff proper here! Good luck for the next!