thanks to @yenzie for the picture :P
Bloom filters are statistical data structures. The most common use of them is to consider them as buckets. In one bucket, you add elements. Once you’ve added a bunch of elements, it’s ready to be used.
You use it by presenting it yet an other element, and it’ll be able to say almost always if the element is already in the bucket or not.
More precisely, when asking the question “is this element in the filter ?”, if it answers no, then you are sure that it’s not in there. If it answers yes, then there is a high probability that it’s there.
So basically, you never have false negatives, but you can get a few false positives. The good thing is that depending on the space you allocate to the filter, and the number of elements it contains, you know what will be the probability of having false positives.
The huge benefit is that a bloom filter is very small, compared to a hash table.
At work, I replaced a heavy Redis instance ( using 60g of RAM) that was used primarily as a huge hash table, by a couple of bloom filters ( using 2g ). For that I used bloomd, from Armon Dadgar. It’s light, fast, has enough features, and the code looks sane.
All I needed was a Perl client to connect to it.
So I wrote Bloomd::Client. It is a light client that connects to bloomd using a regular INET socket, and speaks the simple ASCII protocol (very similar to Redis’ one) that bloomd implements.
When you use bloomd it usually means that you are in a high availibility environment, where you can’t get stuck waiting on a socket, just because something went wrong. So Bloomd::Client implements non-blocking timeouts on the socket. It’ll die if bloomd didn’t answer fast enough or if something broke. That allows you to incorporate the bloomd connection in a retry strategy to try again later, or fallback to another server…
To implement such a strategy, I recommend using Action::Retry. There is a blog post about it here :)
dams.