Archive > April 2009

I Used DMCA

22 April 2009 » In Opinion, PHP, Talks » 13 Comments

Yes, it’s true.
A recent post on Twitter from @atourino pointed to my VIM for (PHP) Programmers slidedeck on scribd.com. The slidedeck has been really popular, gathering close to 50,000 views, 2,500 downloads, a few dozen favorites ratings, and a “Hot” award. Good deal, eh? Except that I didn’t upload this slidedeck—someone else did.
Scribd’s about page describes it as the place “where you publish, discover and discuss original writings and documents”. I’ve used it in the past to find all kinds of documents and there’s a lot to like about the site, but the keyword here is “original”. I really don’t mind sharing the slides—heck, I tell everyone at conferences to download them from my site—but on my Talks page I specifically ask people to obtain permission before re-publishing the slides elsewhere. It’s not a difficult thing to do. So far I’ve resisted putting a copyright notice on every slide, because I was hoping the common sense would apply, but apparently not for everyone.
I contacted Scribd’s customer support to see how I could take the ownership of the document in question. They replied that I would have to submit a DMCA copyright infringement notice and ask for the document to be taken down. I understand that this is their policy, but I think this is going overboard, especially for a case like mine. I really wanted to handle this in a polite manner and in such a way that people’s links to the document wouldn’t break instead of doing the dickish move of demanding it be removed completely. At the same time, I feel that the person who uploaded my slides without permission was wrong. Thus, I had no choice but to send the DMCA notice along with a request for the document to be re-assigned to me.
I would encourage everyone to be more careful in handling publicly available content. Please check for any restrictions on usage and publishing, and if in doubt—ask. This will help avoid resorting to heavy-handed stuff like DMCA notices.

Speaking at Dutch PHP Conference 2009

09 April 2009 » In PHP, Talks » 6 Comments

A couple of months ago I saw a teaser post from iBuildings about the new Dutch PHP Conference in Amsterdam this year. I really like Amsterdam, so I started racking my brain to see what kind of proposal I could submit. Not a day had passed when I got an email from Cal Evans asking if I would like to give the opening keynote at the conference. Success! Of course I agreed and suggested an additional talk about distributed processing with PHP, titled “All the Little Pieces”.
The line-up for the conference looks great: Xdebug’s Derick Rethans, php|architect’s Marco Tabini, Zend Framework architect Matthew Weier O’Phinney, security guru Stefan Esser, “RESTful” Ben Ramsey, PHP core developer Scott MacVicar and many others. The early bird pricing is available until April 30, so I would encourage you to take advantage of it and come see what is bound to be a great event.

Bloom Filters Quickie

03 April 2009 » In Development, PHP » 26 Comments

It’s been a couple of month since the release of pecl/memcached and I was getting the urge to write something else. At the same time, I was reading up on Bloom filters, but couldn’t find a PHP extension that implemented them. Thus, pecl/bloomy was born. Now, you may be wondering, what the heck is a Bloom filter why in the blooming sky would I want to use it? Well, read on.
A Bloom filter is a probabilistic data structure that can be used to answer a simple question, is the given element a member of a set? Now, this question can be answered via other means, such as hash table or binary search trees. But the thing about Bloom filters is that they are incredibly space-efficient when the number of potential elements in the set is large. The way that they achieve this is by allowing false positives with a certain error rate. Basically, a Bloom filter will give you either “no” or “maybe” as the answer and it’s up to you to determine the false positive error rate that you can live with. The smaller the rate the larger the size of the filter, of course, but it takes only 9.6 bits per element for 1% rate and every time you add 4.8 bits, the error rate becomes ten times smaller. Compare this to other structures that require storing at least the data elements themselves in the majority of implementations. However, the more elements you add to the set, the larger the probability of false positives becomes, so it is important to estimate the size of your data set properly.
Another nice thing about Bloom filters is that the time to store and look up the element is constant and does not depend on how many elements are in the set. Now that’s a pretty nice property to have, isn’t it?
So, what are they good for? Well, Google BigTable uses Bloom filters to reduce the disk look-ups for non-existent data; Cassandra also uses them to save IO; Digg might use them to implement checks for green tags on Digg buttons, i.e. have my friends Dugg this, etc. The possibilities are many.
The API of the extension is pretty simple. Create BloomFilter object and specify how many elements you expect to have in the set and what false positive rate you can tolerate. The extension will determine the optimal filter size and the number of hash functions to use. Then all do you is add and check elements.

$b = new BloomFilter(100000, 0.001);
$b->add('foo');
$b->add('bar');
$answer1 = $b->has('foo');
$answer2 = $b->has('zoo');

Here we say that we expect to store 100,000 elements with a false positive rate of 0.1%. Then we add a couple of elements and check a couple too. Now, $answer1 will be “yes”, but it may be wrong 0.1% of the time, i.e. if you check 1000 elements, you may expect to get “yes” for 1 of them when in fact it doesn’t exist in the set. The $answer2 will always be “no” because “zoo” was never added to the filter.
What’s the performance like? From a simple benchmark with 0.1% false positive rate, the time to insert 100,000 items was 0.12 seconds, and time to check 100,000 items was 0.11 seconds (this is on my Macbook Pro). The space used by filter? 179721 bytes.
I encourage you to download the extension, play around with Bloom filters and see what uses can come up with.
UPDATE: As Ryan pointed out in the comments, I misspoke when I said that $answer2 would always be “no” for “zoo”. It might be “yes”, but $answer2 would never be “no” for something that is in the filter, like “foo”.