Bloom Filters Quickie

» 03 April 2009 » In Development, PHP »

It’s been a couple of month since the release of pecl/memcached and I was getting the urge to write something else. At the same time, I was reading up on Bloom filters, but couldn’t find a PHP extension that implemented them. Thus, pecl/bloomy was born. Now, you may be wondering, what the heck is a Bloom filter why in the blooming sky would I want to use it? Well, read on.

A Bloom filter is a probabilistic data structure that can be used to answer a simple question, is the given element a member of a set? Now, this question can be answered via other means, such as hash table or binary search trees. But the thing about Bloom filters is that they are incredibly space-efficient when the number of potential elements in the set is large. The way that they achieve this is by allowing false positives with a certain error rate. Basically, a Bloom filter will give you either “no” or “maybe” as the answer and it’s up to you to determine the false positive error rate that you can live with. The smaller the rate the larger the size of the filter, of course, but it takes only 9.6 bits per element for 1% rate and every time you add 4.8 bits, the error rate becomes ten times smaller. Compare this to other structures that require storing at least the data elements themselves in the majority of implementations. However, the more elements you add to the set, the larger the probability of false positives becomes, so it is important to estimate the size of your data set properly.

Another nice thing about Bloom filters is that the time to store and look up the element is constant and does not depend on how many elements are in the set. Now that’s a pretty nice property to have, isn’t it?

So, what are they good for? Well, Google BigTable uses Bloom filters to reduce the disk look-ups for non-existent data; Cassandra also uses them to save IO; Digg might use them to implement checks for green tags on Digg buttons, i.e. have my friends Dugg this, etc. The possibilities are many.

The API of the extension is pretty simple. Create BloomFilter object and specify how many elements you expect to have in the set and what false positive rate you can tolerate. The extension will determine the optimal filter size and the number of hash functions to use. Then all do you is add and check elements.

$b = new BloomFilter(100000, 0.001);
$b->add('foo');
$b->add('bar');
$answer1 = $b->has('foo');
$answer2 = $b->has('zoo');

Here we say that we expect to store 100,000 elements with a false positive rate of 0.1%. Then we add a couple of elements and check a couple too. Now, $answer1 will be “yes”, but it may be wrong 0.1% of the time, i.e. if you check 1000 elements, you may expect to get “yes” for 1 of them when in fact it doesn’t exist in the set. The $answer2 will always be “no” because “zoo” was never added to the filter.

What’s the performance like? From a simple benchmark with 0.1% false positive rate, the time to insert 100,000 items was 0.12 seconds, and time to check 100,000 items was 0.11 seconds (this is on my Macbook Pro). The space used by filter? 179721 bytes.

I encourage you to download the extension, play around with Bloom filters and see what uses can come up with.

UPDATE: As Ryan pointed out in the comments, I misspoke when I said that $answer2 would always be “no” for “zoo”. It might be “yes”, but $answer2 would never be “no” for something that is in the filter, like “foo”.

Trackback URL

11 Tweets

  1. andrei
    Matt
    03/04/2009 at 11:53 am Permalink

    Does this work for custom objects or only for primitive data types?

  2. andrei
    Bakyt
    03/04/2009 at 12:17 pm Permalink

    Extremely interesting, thank you Andrei!

  3. andrei
    Ren
    03/04/2009 at 12:49 pm Permalink

    Bloom filters are cool.
    Toying with an them together with a content addressed storage, so each store can publish a bloom filter of its contents rather trivially, just by splitting the content hash into 32bit words, and using those as the keys. Used pecl/bitset for the bit-ops.

    Would nice to have memcached (or in something like REDIS would be more appropriate) support so could query/set a bloom filter without having to copy across network to the local process, when all really need is a single boolean response.

    Squid also uses bloom filters, to work out if a url is cached, and share that information between squid instances. Uses md5 on the url and then splices that into 4 values IIRC.

  4. andrei
    Pablo
    03/04/2009 at 1:29 pm Permalink

    Would it be possible to store $b somewhere to avoid having to create it across requests, thus making it kind of persistant?

  5. andrei
    thejo
    03/04/2009 at 3:30 pm Permalink

    This is excellent, but like Pablo mentioned, persistence of the data structure is important in many situations. For example, did I already receive a request with this combination of parameters?

  6. andrei
    Andrei
    03/04/2009 at 3:33 pm Permalink

    @Matt: right now it supports only strings, so essentially any data type that can be represented in the string form. Is that too much of a limitation?

    @Pablo: BloomFilter class can serialize itself to a string, so you can simply do serialize($filter) and store it in APC or memcache or elsewhere.

  7. andrei
    Ryan
    04/04/2009 at 1:05 am Permalink

    One of your statements is incorrect. You state:

    “The $answer2 will always be “no” because “zoo” was never added to the filter.”

    This is incorrect. It is possible that “zoo” could return a ‘yes’ even if it hadn’t been added yet. This is because its possible it hashes to the same values as a previously added string. In your example, if ‘zoo’ hashed to the same values as ‘foo’ or ‘bar’, then this would happen.

    However, A Bloom Filter will never return a ‘no’ when the element IS in the filter. A BF can definitely return a yes both when an element is in there, or not in there. This has everything to do with the false positive probability.

  8. andrei
    Andrei
    04/04/2009 at 1:21 am Permalink

    @Ryan, sorry, you’re correct. Thanks for your note.

  9. andrei
    Ti4iT
    14/04/2009 at 2:53 am Permalink

    thank you for this awsome work

  10. andrei
    Felix Contrera
    02/10/2011 at 11:43 pm Permalink

    It’s perfect time to make some plans for the future and it is time to be happy. I’ve read this post and if I could I wish to suggest you few interesting things or tips. Perhaps you could write next articles referring to this article. I want to read even more things about it!

  11. andrei
    Mạnh
    10/05/2014 at 12:02 pm Permalink

    I’m learning about bloom filters “Multi-dimensional Range Query for Data Management using Bloom Filters”
    you have documentation about this part?
    I speak English is not good! i’m Sorry

  12. andrei
    Iris
    12/10/2014 at 5:26 pm Permalink

    Love the website– very user friendly and great deals to see!

  13. andrei
    Danny
    14/10/2014 at 8:39 am Permalink

    That is very fascinating, You’re an excessively professional blogger. I have joined your feed and look forward to searching for more of your excellent post. Also, I have shared your site in my social networks|

  14. That’s a wise answer to a tricky question

  15. andrei
    a
    03/04/2009 at 11:41 am Permalink

    New blog post: Bloom Filters Quickie http://gravitonic.com/2009/04/bloom-filters-quickie

    This comment was originally posted on Twitter

  16. andrei
    devfunnel
    03/04/2009 at 12:39 pm Permalink

    [php: Planet PHP] Bloom Filters Quickie – Andrei Zmievski http://gravitonic.com/2009/04/bloom-filters-quickie

    This comment was originally posted on Twitter

  17. andrei
    planetphp
    03/04/2009 at 12:42 pm Permalink

    Bloom Filters Quickie – Andrei Zmievski: It’s been a couple of month since the release of pecl/memcached and I w.. http://tinyurl.com/cw3mcw

    This comment was originally posted on Twitter

  18. andrei
    andrewwatson
    03/04/2009 at 1:15 pm Permalink

    reader: Bloom Filters Quickie: It’s been a couple of month since the release of pecl/memcached and I was .. http://tinyurl.com/cw3mcw

    This comment was originally posted on Twitter

  19. andrei
    phpbr
    03/04/2009 at 2:25 pm Permalink

    Planet PHP – Bloom Filters Quickie: It’s been a couple of month since the release of pecl/memcached .. http://tinyurl.com/cw3mcw

    This comment was originally posted on Twitter

  20. andrei
    devfunnel
    03/04/2009 at 3:38 pm Permalink

    [php: Planet PHP] Bloom Filters Quickie – Andrei Zmievski http://gravitonic.com/2009/04/bloom-filters-quickie

    This comment was originally posted on Twitter

  21. andrei
    dlbrown06
    03/04/2009 at 4:31 pm Permalink

    Bloom Filters Quickie – Andrei Zmievski http://tinyurl.com/cw3mcw

    This comment was originally posted on Twitter

  22. andrei
    darrellbrogdon
    04/04/2009 at 9:06 am Permalink

    Bloom Filters in PHP: http://gravitonic.com/2009/04/bloom-filters-quickie

    This comment was originally posted on Twitter

  23. andrei
    devfunnel
    04/04/2009 at 12:48 pm Permalink

    [php: Planet PHP] Bloom Filters Quickie – Andrei Zmievski http://gravitonic.com/2009/04/bloom-filters-quickie

    This comment was originally posted on Twitter

  24. andrei
    phpslacker
    08/04/2009 at 4:17 am Permalink

    bloom filters extension for php http://tinyurl.com/cw3mcw ~ very interesting. wish i could use it on something

    This comment was originally posted on Twitter

Trackbacks

  1. [...] Nachdem Andrei Zmievski am WE eine kleine Bloomfilter Erweiterung in PECL online gestellt hat, habe ich mir ein mögliches…

  2. [...] Bloom Filters Quickie [...]

  3. [...] Zmievski has written a new post about a new extension he’s worked up (out of curiosity for the technology)…

  4. [...] Zmievski has written a new post about a new extension he’s worked up (out of curiosity for the technology)…

  5. [...] http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ http://zmievski.org/2009/04/bloom-filters-quickie http://www.perl.com/pub/2004/04/08/bloom_filters.html [...]

Additional comments powered by BackType