Category > PHP

The Prime That Wasn’t

03 August 2010 » In PHP, Tech » 27 Comments

I’ve seen a lot of problems solved with regular expressions, but last Friday, thanks to Chris and Sean, I found out that one can use them to determine whether a given integer is a prime number. The original articles showed the following regular expression:

/^1?$|^(11+?)\1+$/

You don’t apply this to the integer itself. Instead, you create a string of repeating 1s, where the numbers of 1s is defined by the number you’re testing, and apply the regex to that. If it fails to match, the number is a prime. The regex relies on backtracking and thus would not work with a DFA-based engine, such as PHP’s (now deprecated) ereg* functions. But, it works just fine with the preg_* ones, or so I thought (more on that in a bit).

So, how does it work? It may seem like a brain buster, but it’s actually pretty simple. Ignore the portion before the alternation operator (|), because it’s just a short-circuit case for checking whether the input string is empty or has a single 1. (If the string is empty, it means the number being tested is 0. If the string is a single 1, it means the number being tested is 1. Neither of these are prime.) The portion after the pipe is where the real magic happens.

The (11+?) sub-pattern matches strings 11, 111, etc. The \1+ matches whatever the sub-pattern matched, one or more times. So on the first try, the engine will match 11 and then attempt to match the same thing one or more times in the rest of the string. If it succeeds, the number is not a prime. Why? Because it just proved that the length of the string is divisible by 2 (the length of 11), therefore the original number is divisible by 2. If the overall match fails, the engine backtracks to the beginning and tries to match 111 two or more times and so on successively. If the first sub-pattern gets long enough (n/2 basically), and the overall match still fails, the number is a prime. Beautiful, isn’t it?

Coincidentally, Sean recently created a code evaluation plugin for the Phergie-based IRC bot that runs in the channel we hang out in. The plugin is a simple proxy to ideone.com, but is helpful for doing quick code tests. We had fun experimenting with this regex pattern implemented in a PHP function that returns the next largest prime number after a given one. The trouble started when Sean fed it 99999999, and the code spit out 100000001. This didn’t seem right, and Wolfram Alpha confirmed; the answer we got was not a prime. (It factors into 17 * 5882353.)

A few similar tests also gave us numbers that were not prime. But, where was the problem? The PHP code was too simple to have a bug, many answers were correct, and the regex algorithm itself was sound. It was time for brute force. I quickly wrote some code to feed increasing odd integers into into regex pattern and check its answers against the normal sieve algorithm to see where it starts failing. The first number to fail was 22201; the regex said it was prime, but it’s actually a perfect square (1492). Past that, the failure rate increased.

It then dawned on me that the culprit might be the backtracking itself, specifically how it was implemented in PCRE, the library at the heart of PHP’s regular expressions. As I mention in my Regex Clinic talk, unbounded backtracking can dramatically slow down the matching and should be well-understood before attempting to write complex patterns. To manage the danger of runaway patterns, PCRE implemented the pcre.backtrack_limit setting a few years ago. In our case, backtracking is used to break up the string of 1s into successively larger chunks, and, with very long strings, the engine may run into this backtracking limit, which is set to 100000 by default. My guess was that with the 22201-character-long string, the default was not enough. Once the limit is reached, the match fails, and the number is declared prime.

I bumped the limit up to 200000, and voilà, 22201 was no longer declared prime. To fix the 100000001 match, the limit had to be raised dramatically, to around 250000000! And, the program took 14 seconds to deliver the verdict on my new i5 MacBook Pro. Needless to say, don’t use this prime determination algorithm for any sort of real-life stuff. Instead, just appreciate it for its elegance. I certainly did, and was glad that my bit of investigative work showed that abstract, pure, good ideas can still fail in the messy world of software and hardware.

Time Machine Forensics

01 July 2010 » In Other, PHP » 2 Comments

About a year ago, I finally wisened up to the obvious: 1) losing data sucks, and 2) doing daily (or frequent) backups manually are a pain in the ass, especially for portables. 2) is why most people forgo backups in the first place, resulting in 1). I needed something that I could set up once and pretty much forget about. Thankfully, there was just the thing: Time Capsule  —  a single device that encapsulates both Airport Extreme base station and a huge hard drive, and can be used for automatic, transparent, wireless backups from any of your Mac devices via the Time Machine.

I purchased the 1 TB version, configured it, and have been using it ever since. It has already saved my butt a couple of times, when I accidentally deleted (rm -rf) a directory and had that familiar sinking feeling, but then remembered that the Time Machine was there ready to lend a hand.

However, I noticed recently that the hourly backups were taking longer and growing in size, sometimes up to 1GB or more. It didn’t seem like there was that much data changing on an hourly basis, so I set out to investigate. A quick look around revealed that the Time Machine itself will not reveal the backup details, but someone wrote a script called timedog that displays the files that the Time Machine backed up during its most recent run (or any of your choosing). The output of the script is something like this (abbreviated):

# cd /Volumes/Backup\ of\ foo/Backups.backupdb/foo
# timedog -d 5 -l
==> Comparing TM backup 2010-06-29-160846 to 2010-06-29-101853
     399B->     399B    [1] /Mac HD/Library/Application Support/CrashReporter/
[..skip..]
   52.6KB->   53.0KB        /Mac HD/Users/andrei/.dbshell
     863B->     866B        /Mac HD/Users/andrei/.lesshst
   11.0KB->   15.1KB        /Mac HD/Users/andrei/.viminfo
   10.4KB->   30.7KB        /Mac HD/Users/andrei/.zsh_hist
    6.7MB->    6.7MB        /Mac HD/Users/andrei/.dropbox/dropbox.db
    3.9MB->    3.9MB        /Mac HD/Users/andrei/.dropbox/migrate.db
   25.2MB->   50.3MB   [10] /Mac HD/Users/andrei/.dropbox/cache/
   21.0KB->   21.0KB    [1] /Mac HD/Users/andrei/Desktop/
  120.0MB->  120.4MB    [1] /Mac HD/Users/andrei/Documents/Linkinus 2 Logs/
  142.8MB->  146.2MB  [156] /Mac HD/Users/andrei/Library/Application Support/
[..skip..]
  608.0MB->  608.0MB    [5] /Mac HD/private/var/data/mongodb/
[..skip..]
==> Total Backup: 967 changed files/directories, 1.88GB

Looking at this, a couple of offenders are immediately obvious. Linkinus (an IRC client) keeps all the conversation logs in one single file, and since I’m frequently on IRC, that file grows and gets backed up every hour. The data files for MongoDB that I use for development are also fairly large and change often. Finally, there is something in Library/Application Support, but it wasn’t shown because of the depth limit I set. After increasing the depth to 7, I discovered that it was the Google Chrome cache and history.

Conversation logs and Chrome stuff are not important enough for me to back up hourly, and MongoDB data I can copy periodically to an off-site server. By excluding these 3 items from the backup process via Time Machine Preferences I was able to reduce the size of the hourlies to 50 MB or less.

This brings up an important point though: while Time Machine is great for doing automated backups over your local network, you should have a separate copy of the data off-site, for redundancy. How and when to do the off-site backups varies according to everyone’s needs, but I would suggest something like CrashPlan, which does unlimited online backups for about $5/month. Once again, it’s automatic and hands-off, which is how you want it.

Tools I found useful while investigating this Time Machine issue:

Vim for Programmers on Slideshare

02 June 2010 » In PHP, Talks » 6 Comments

A few years ago, I was considering what proposal to submit to the Vancouver PHP Conference. The usual slate of “how to do this and that in PHP” was becoming a bit tired, so I decided to submit a talk about an essential skill that PHP (and other language) developers might need: using the Vim editor.

By that time I knew that I was firmly in the Vim camp (as opposed to Emacs or IDEs). Of course, writing a 45 minute talk about Vim is like trying to explain Mulholland Drive during an elevator ride, but I rose to the challenge and put together the first version of the slide deck. When I later received the feedback about the talk, I realized that it was the most highly rated one of the conference, above even Rasmus’s perennial PHP keynote. Clearly, I was onto something.

Since then I’ve expanded and adjusted the talk to fit the 45-60 minute slot, but I still usually run out of time due to the wealth of material. So I published the slides on Slideshare and created a Github repo for my time-tested Vim settings and plugins, so feel free to fork it and submit pull requests. And in general, go forth and Vim.

Regex Clinic on Slideshare

03 May 2010 » In PHP, Talks » 5 Comments

Back in 2004, I submitted a proposal to the only PHP conference ever to have taken place aboard a cruise ship. Yes, the infamous PHP Cruise (“Do we all know each other?”). The talk was about regular expressions and their usage in PHP. I wrote the original version in PowerPoint, but when I boarded the ship and saw Keynote 1.0 on someone’s Powerbook, I was hooked and had to port all the slides to Keynote right then and there.

Well, the talk – and the cruise – went well, and since then Andrei’s Regex Clinic has grown and shrunk in size depending on which conference I was giving it at and how much time was allotted. I considered retiring the talk several times, because surely people should know how to use regular expressions by now, but whenever I give it there is a roomful of people wanting to know what (?>=foo) does. And as much as I try, I always run out of time trying to cover everything from the basics to more advanced usage.

I feel that the slides are pretty polished by now and it’s better if people can read them at a comfortable pace, so the whole tutorial-sized presentation is now available on Slideshare. I omitted the section that covers PHP’s regular expression API, because it already has great online documentation. Otherwise, go forth and read about look-ahead assertions and recursive matching.

Please Start From The Beginning

11 January 2010 » In Development, Me, PHP » 3 Comments

Please Start From The Beginning is a video series that explores the career paths and experiences of web industry professionals. I was honored to be interviewed for the series along with such interesting people as Eric Meyer, Joe Stump, and Elliot Jay Stocks. If you to hear about how the heck I started in Web development, this is a video for you.

PHP Advent 2009: GeoIP Wrangling

21 December 2009 » In Analog, Development, PHP » 5 Comments

There have been a few kind words and questions about the GeoIP-based blurb that greets the visitors to our Analog holding page, so I wanted to blog about how it was made and what obstacles I ran into. However, while writing I realized that it would be the perfect candidate for this year’s edition of PHP Advent instead of the article I originally planned to write about Git and Github. So head on over and read about GeoIP Wrangling and feel free to leave your comments here.

Say Hello to Analog

18 December 2009 » In Analog, Me, PHP, Work » 8 Comments

Wow.

That pretty much sums up my feelings about the response to the announcement of Analog, a web design and development co-operative that I started with a few of my friends. I am stunned and humbled by the many kind words of congratulations, praise, and encouragement that we received about our launch via Twitter, Facebook, and personal communication. Thank you. Many have been wondering what I’ve been up to since leaving Digg in early September, and organizing and setting up Analog has been a big part of it.

The first time I discussed the idea for such a company was when Chris Shiflett and I went to Iceland in June. During that time of renewal, reset, and inspiration, we talked about our desire to work on interesting projects with a great team of peers. People like Jon and Jon, who Chris had worked with on a few occasions. From the start, we wanted to be a bona fide co-operative: an organization owned and operated by a group of individuals for their mutual benefit and adhering to the principles of equality and equitability.

Co-operatives, especially tech ones, aren’t very common — though they are much more prevalent across the pond in the UK — so it took us a while to work out the legal and tax issues of having an equitable company comprised of people from multiple countries. This time was also spent refining our brand identity and brand promise, both of which we believe strongly in. We also needed a short statement to explain who we are and what we do, and in the end it became this (the last part of which harkens back to our original motivation):

Analog is a company of friends who make web sites. It’s a co-operative where imagination, design, and engineering thrive; good people doing good work.

The part about “making web sites” may sound simplistic, but we believe in taking back simple, honest phrases like web site and web developer. They are precise and descriptive despite having been shunned or dismissed by people in favor of things like web application, front-end/back-end engineer, and other seemingly sexier nomenclature meant to sound more important. It’s time to call things what they are. Say it with me, “I am proud to be a web developer.”

Analog origins

Of course, the most difficult thing to decide on was the name. We had a lot more latitude within the .coop top-level domain (TLD), but even then, we must have gone through a hundred or more names looking for one that would somehow reflect our philosophy while being memorable. The flash of inspiration struck while imbibing the potent Bee Sting cider at the inimitable Duke of York in Bristol. The name had an instant appeal, and I imagine we all thought, “Yes, Analog is it.” Jon Tan even left a mark on the table, since he couldn’t wait to see what Analog would look like in type.

Analog appealed to us because of its association with handmade things, craftsmanship, and a “warmer” feeling in general. Somehow it felt good to think that we were going to do digital things the analog way, where a personal touch of each of us would be evident in our work and communication.

The team at Analog is one of the best that I’ve had the honor to be a part of. Alan Colville is an accomplished UX designer and customer experience and usability researcher. He has helped a number of clients in the past, including Vodafone, Virgin Media, BlackBerry, and Visa. Chris Shiflett has extensive background in web development, specializing in web security, and has worked on projects for Ning, National Geographic, Digg, and many other clients during his time as principal of Brain Bulb and OmniTI. Jon Gibbins is an ace developer and web accessibility expert who most recently lent his skills to OmniTI as well. And Jon Tan is simply the best designer and typography maven that I have a pleasure to know, with an extensive body of published work. Between us, we have many years of experience and a bountiful font of creative knowledge.

The type of work that we want to do is twofold. Firstly, we want to take on client projects that are built on an inventive concept, where we have as much creative freedom as possible. By this I mean that, as a group, we want to be part of the initial discussions and brainstorming, so that we can inject our own ideas into the process. We want the projects to utilize both our design and development expertise, involving aspects of the programmable web in a way that supports and enriches the original concept. Secondly, we want to incubate some of the ideas we’ve been knocking around into products that can be spun-off later, if necessary. We’re especially interested in possibilities presented by geo-location, geo-tagging, and other geo-things. We also want to share what we learn and produce as a team. The first thing we’re releasing is the JS grid overlay used on the Analog site; look for it shortly.

We’re on Twitter as @analogcoop. Get in touch if you have a cool project in mind and want to work with us to make it a reality, or use our nifty contact form at analog.coop.

Reblog this post [with Zemanta]

Starting Codeworks 2009

28 September 2009 » In PHP, Talks » 7 Comments

It seems that a short time ago, at the php|tek in Chicago, the Code Works conference was just a glimmer in the eye of Marco Tabini and his associates. And today I am in Atlanta, starting off the east coast leg of the conference along with a few of my friends and colleagues. The first part made stops in San Francisco (where I served as a social director since it’s my home town), Los Angeles and Dallas and it sounds like everyone had a great time.

After Atlanta we go onto Miami, which I am only familiar with through Dexter; Washington D.C., where I hope to take in a couple of sights, since I’ve never been; and New York, a city that I always love to visit.

My talks in this conference will cover VIM, regular expressions, and distributed systems with PHP, including memcache, mogilefs, and Gearman. There are many good talks on the schedule, so consider signing up and joining speakers and other attendees for what is bound to be an excellent event.

Voting opens for my SXSW panel

19 August 2009 » In PHP, Talks, Tech » 1 Comment

I have not yet been to South by Southwest festivals. And every time I mention this to friends and colleagues, they make big eyes and say, “You have to go!”. Well, I finally cracked and submitted a panel proposal for the 2010 SXSW Interactive portion. The panel is titled “Travelog with Maps: When 1,000 Photos Aren’t Enough” and will present and discuss ways to preserve the memories and tell the stories of your trips using latest geolocation tools, social data, and a bit of code. Chris Shiflett will be my co-panelist. We decided to do this panel after trying to share our Iceland trip in real-time and discovering what is and isn’t there to support it.

If you like this idea, please vote for this panel. Thank you!

Goodbye, Digg

10 August 2009 » In Me, PHP, Work » 24 Comments

I will be leaving Digg at the end of this month.

Reset point

The launching point for this decision was my visit in June to Iceland, where I had many opportunities in sublime surrounds to reflect on my life and aspirations. Standing on top of the world  at Sjónarsker, I realized that though my career has spanned some great companies, the next step in my professional life would have to involve building something of my own. Later on, watching a never-ending sunset from the hill at Stykkishólmur, I understood that this something has to happen sooner than later, and serendipitously, an opportunity to do just that has arisen recently. My friend Chris Shiflett, who visited Iceland with me, has similar aspirations and likes to say “good work, good people”, and that is definitely what I intend to do.

Thank you, Joe Stump, for recruiting me. My 8 months at Digg flew by quickly, but the friendships I made there will last for a lifetime, I hope. I have been privileged to work with and next to some of  the best and brightest people that I have met in my life. There is a great road ahead for Diggers as the company advances to stay on the cutting edge of the social news industry. I am sad to be leaving when some very cool developments are afoot, but excited for my own road ahead. By the way, Wine Wednesdays were the best.

I am planning to do a bit of traveling at the end of August and then start September afresh. There is plenty to be done to bring these ideas to life and I cannot wait to share the news when the time is right. Oh, and I’m still going to be involved in PHP, perhaps even more than usual.

Reset. Restart. Renew.

And, seriously, visit Iceland if you have a chance. It is another world.

Page 1 of 712345...Last »