It’s been a couple of month since the release of pecl/memcached and I was getting the urge to write something else. At the same time, I was reading up on Bloom filters, but couldn’t find a PHP extension that implemented them. Thus, pecl/bloomy was born. Now, you may be wondering, what the heck is a Bloom filter why in the blooming sky would I want to use it? Well, read on.
A Bloom filter is a probabilistic data structure that can be used to answer a simple question, is the given element a member of a set? Now, this question can be answered via other means, such as hash table or binary search trees. But the thing about Bloom filters is that they are incredibly space-efficient when the number of potential elements in the set is large. The way that they achieve this is by allowing false positives with a certain error rate. Basically, a Bloom filter will give you either “no” or “maybe” as the answer and it’s up to you to determine the false positive error rate that you can live with. The smaller the rate the larger the size of the filter, of course, but it takes only 9.6 bits per element for 1% rate and every time you add 4.8 bits, the error rate becomes ten times smaller. Compare this to other structures that require storing at least the data elements themselves in the majority of implementations. However, the more elements you add to the set, the larger the probability of false positives becomes, so it is important to estimate the size of your data set properly.
Another nice thing about Bloom filters is that the time to store and look up the element is constant and does not depend on how many elements are in the set. Now that’s a pretty nice property to have, isn’t it?
So, what are they good for? Well, Google BigTable uses Bloom filters to reduce the disk look-ups for non-existent data; Cassandra also uses them to save IO; Digg might use them to implement checks for green tags on Digg buttons, i.e. have my friends Dugg this, etc. The possibilities are many.
The API of the extension is pretty simple. Create BloomFilter object and specify how many elements you expect to have in the set and what false positive rate you can tolerate. The extension will determine the optimal filter size and the number of hash functions to use. Then all do you is add and check elements.
$b = new BloomFilter(100000, 0.001);
$answer1 = $b->has('foo');
$answer2 = $b->has('zoo');
Here we say that we expect to store 100,000 elements with a false positive rate of 0.1%. Then we add a couple of elements and check a couple too. Now,
$answer1 will be “yes”, but it may be wrong 0.1% of the time, i.e. if you check 1000 elements, you may expect to get “yes” for 1 of them when in fact it doesn’t exist in the set. The
$answer2 will always be “no” because “zoo” was never added to the filter.
What’s the performance like? From a simple benchmark with 0.1% false positive rate, the time to insert 100,000 items was 0.12 seconds, and time to check 100,000 items was 0.11 seconds (this is on my Macbook Pro). The space used by filter? 179721 bytes.
I encourage you to download the extension, play around with Bloom filters and see what uses can come up with.
UPDATE: As Ryan pointed out in the comments, I misspoke when I said that $answer2 would always be “no” for “zoo”. It might be “yes”, but $answer2 would never be “no” for something that is in the filter, like “foo”.