Bloom Filters Quickie
It’s been a couple of month since the release of pecl/memcached and I was getting the urge to write something else. At the same time, I was reading up on Bloom filters, but couldn’t find a PHP extension that implemented them. Thus, pecl/bloomy was born. Now, you may be wondering, what the heck is a Bloom filter why in the blooming sky would I want to use it? Well, read on.
A Bloom filter is a probabilistic data structure that can be used to answer a simple question, is the given element a member of a set? Now, this question can be answered via other means, such as hash table or binary search trees. But the thing about Bloom filters is that they are incredibly space-efficient when the number of potential elements in the set is large. The way that they achieve this is by allowing false positives with a certain error rate. Basically, a Bloom filter will give you either “no” or “maybe” as the answer and it’s up to you to determine the false positive error rate that you can live with. The smaller the rate the larger the size of the filter, of course, but it takes only 9.6 bits per element for 1% rate and every time you add 4.8 bits, the error rate becomes ten times smaller. Compare this to other structures that require storing at least the data elements themselves in the majority of implementations. However, the more elements you add to the set, the larger the probability of false positives becomes, so it is important to estimate the size of your data set properly.
Another nice thing about Bloom filters is that the time to store and look up the element is constant and does not depend on how many elements are in the set. Now that’s a pretty nice property to have, isn’t it?
So, what are they good for? Well, Google BigTable uses Bloom filters to reduce the disk look-ups for non-existent data; Cassandra also uses them to save IO; Digg might use them to implement checks for green tags on Digg buttons, i.e. have my friends Dugg this, etc. The possibilities are many.
The API of the extension is pretty simple. Create BloomFilter object and specify how many elements you expect to have in the set and what false positive rate you can tolerate. The extension will determine the optimal filter size and the number of hash functions to use. Then all do you is add and check elements.
$b = new BloomFilter(100000, 0.001);
$b->add('foo');
$b->add('bar');
$answer1 = $b->has('foo');
$answer2 = $b->has('zoo');
Here we say that we expect to store 100,000 elements with a false positive rate of 0.1%. Then we add a couple of elements and check a couple too. Now, $answer1 will be “yes”, but it may be wrong 0.1% of the time, i.e. if you check 1000 elements, you may expect to get “yes” for 1 of them when in fact it doesn’t exist in the set. The $answer2 will always be “no” because “zoo” was never added to the filter.
What’s the performance like? From a simple benchmark with 0.1% false positive rate, the time to insert 100,000 items was 0.12 seconds, and time to check 100,000 items was 0.11 seconds (this is on my Macbook Pro). The space used by filter? 179721 bytes.
I encourage you to download the extension, play around with Bloom filters and see what uses can come up with.
UPDATE: As Ryan pointed out in the comments, I misspoke when I said that $answer2 would always be “no” for “zoo”. It might be “yes”, but $answer2 would never be “no” for something that is in the filter, like “foo”.

03/04/2009 at 11:53 am Permalink
Does this work for custom objects or only for primitive data types?
03/04/2009 at 12:17 pm Permalink
Extremely interesting, thank you Andrei!
03/04/2009 at 12:49 pm Permalink
Bloom filters are cool.
Toying with an them together with a content addressed storage, so each store can publish a bloom filter of its contents rather trivially, just by splitting the content hash into 32bit words, and using those as the keys. Used pecl/bitset for the bit-ops.
Would nice to have memcached (or in something like REDIS would be more appropriate) support so could query/set a bloom filter without having to copy across network to the local process, when all really need is a single boolean response.
Squid also uses bloom filters, to work out if a url is cached, and share that information between squid instances. Uses md5 on the url and then splices that into 4 values IIRC.
03/04/2009 at 1:29 pm Permalink
Would it be possible to store $b somewhere to avoid having to create it across requests, thus making it kind of persistant?
03/04/2009 at 3:30 pm Permalink
This is excellent, but like Pablo mentioned, persistence of the data structure is important in many situations. For example, did I already receive a request with this combination of parameters?
03/04/2009 at 3:33 pm Permalink
@Matt: right now it supports only strings, so essentially any data type that can be represented in the string form. Is that too much of a limitation?
@Pablo: BloomFilter class can serialize itself to a string, so you can simply do serialize($filter) and store it in APC or memcache or elsewhere.
04/04/2009 at 1:05 am Permalink
One of your statements is incorrect. You state:
“The $answer2 will always be “no” because “zoo” was never added to the filter.”
This is incorrect. It is possible that “zoo” could return a ‘yes’ even if it hadn’t been added yet. This is because its possible it hashes to the same values as a previously added string. In your example, if ‘zoo’ hashed to the same values as ‘foo’ or ‘bar’, then this would happen.
However, A Bloom Filter will never return a ‘no’ when the element IS in the filter. A BF can definitely return a yes both when an element is in there, or not in there. This has everything to do with the false positive probability.
04/04/2009 at 1:21 am Permalink
@Ryan, sorry, you’re correct. Thanks for your note.
14/04/2009 at 2:53 am Permalink
thank you for this awsome work
03/04/2009 at 11:41 am Permalink
New blog post: Bloom Filters Quickie http://gravitonic.com/2009/04/bloom-filters-quickie
This comment was originally posted on Twitter
03/04/2009 at 12:39 pm Permalink
[php: Planet PHP] Bloom Filters Quickie – Andrei Zmievski http://gravitonic.com/2009/04/bloom-filters-quickie
This comment was originally posted on Twitter
03/04/2009 at 12:42 pm Permalink
Bloom Filters Quickie – Andrei Zmievski: It’s been a couple of month since the release of pecl/memcached and I w.. http://tinyurl.com/cw3mcw
This comment was originally posted on Twitter
03/04/2009 at 1:15 pm Permalink
reader: Bloom Filters Quickie: It’s been a couple of month since the release of pecl/memcached and I was .. http://tinyurl.com/cw3mcw
This comment was originally posted on Twitter
03/04/2009 at 2:25 pm Permalink
Planet PHP – Bloom Filters Quickie: It’s been a couple of month since the release of pecl/memcached .. http://tinyurl.com/cw3mcw
This comment was originally posted on Twitter
03/04/2009 at 3:38 pm Permalink
[php: Planet PHP] Bloom Filters Quickie – Andrei Zmievski http://gravitonic.com/2009/04/bloom-filters-quickie
This comment was originally posted on Twitter
03/04/2009 at 4:31 pm Permalink
Bloom Filters Quickie – Andrei Zmievski http://tinyurl.com/cw3mcw
This comment was originally posted on Twitter
04/04/2009 at 9:06 am Permalink
Bloom Filters in PHP: http://gravitonic.com/2009/04/bloom-filters-quickie
This comment was originally posted on Twitter
04/04/2009 at 12:48 pm Permalink
[php: Planet PHP] Bloom Filters Quickie – Andrei Zmievski http://gravitonic.com/2009/04/bloom-filters-quickie
This comment was originally posted on Twitter
08/04/2009 at 4:17 am Permalink
bloom filters extension for php http://tinyurl.com/cw3mcw ~ very interesting. wish i could use it on something
This comment was originally posted on Twitter