Let's begin with a little bit of theory. The idea behind the filter is to allocate a bit vector of length

*m*, initially all set to 0, and then choose

*k*independent hash functions,

*h*, each with range [

_{1}, h_{2}, ..., h_{k}*1 m*]. When an element

*a*is added to the set then the bits at positions

*h(a)*in the bit vector are set to 1. Given a query element

_{1}, h(a)_{2}, ..., h(a)_{k}*q*we can test whether it is in the set using the bits at positions

*h(q)*in the vector. If any of these bits is 0 we report that

_{1}, h(q)_{2}, ..., h(q)_{k}*q*is not in the set otherwise we report that

*q*is. The thing we have to care about is that in the first case there remains some probability that

*q*is not in the set which could lead us to a false positive response.

The following class is a naive implementation of a Bloom Filter (pay attention: this implementation is not supposed to be suitable for production. It is made just to show how a Bloom Filter works and to study its behavior):

class Bloom: """ Bloom Filter """ def __init__(self,m,k,hash_fun): """ m, size of the vector k, number of hash fnctions to compute hash_fun, hash function to use """ self.m = m # initialize the vector # (attention a real implementation # should use an actual bit-array) self.vector = [0]*m self.k = k self.hash_fun = hash_fun self.data = {} # data structure to store the data self.false_positive = 0 def insert(self,key,value): """ insert the pair (key,value) in the database """ self.data[key] = value for i in range(self.k): self.vector[self.hash_fun(key+str(i)) % self.m] = 1 def contains(self,key): """ check if key is cointained in the database using the filter mechanism """ for i in range(self.k): if self.vector[self.hash_fun(key+str(i)) % self.m] == 0: return False # the key doesn't exist return True # the key can be in the data set def get(self,key): """ return the value associated with key """ if self.contains(key): try: return self.data[key] # actual lookup except KeyError: self.false_positive += 1The usage of this filter is pretty easy, we have to initialize the data structure with a hash function, a value for

*k*and the size of the bit vector then we can start adding items as in this example:

import hashlib def hash_f(x): h = hashlib.sha256(x) # we'll use sha256 just for this example return int(h.hexdigest(),base=16) b = Bloom(100,10,hash_f) b.insert('this is a key','this is a value') print b.get('this is a key')Now, the problem is to choose the parameters of the filter in order to minimize the number of false positive results. We have that after inserting

*n*elements into a table of size

*m*, the probability that a particular bit is still 0 is exactly

Hence, afer

*n*insertions, the probability that a certain bit is 1 is

So, for fixed parameters

*m*and

*n*, the optimal value

**k**that minimizes this probability is

With this in mind we can test our filter. The first thing we need is a function which tests the Bloom Filter for fixed values of

*m*,

*n*and

*k*countinig the percentage of false positive:

import random def rand_data(n, chars): """ generate random strings using the characters in chars """ return ''.join(random.choice(chars) for i in range(n)) def bloomTest(m,n,k): """ return the percentage of false positive """ bloom = Bloom(m,k,hash_f) # generating a random data rand_keys = [rand_data(10,'abcde') for i in range(n)] # pushing the items into the data structure for rk in rand_keys: bloom.insert(rk,'data') # adding other elements to the dataset rand_keys = rand_keys + [rand_data(10,'fghil') for i in range(n)] # performing a query for each element of the dataset for rk in rand_keys: bloom.get(rk) return float(bloom.false_positive)/n*100.0If we fix

*m*= 10000 and

*n*= 1000, according to the equations above, we have that the value of

*k*which minimizes the false positive number is around 6.9314. We can confirm that experimentally with the following test:

# testing the filter m = 10000 n = 1000 k = range(1,64) perc = [bloomTest(m,n,kk) for kk in k] # k is varying # plotting the result of the test from pylab import plot,show,xlabel,ylabel plot(k,perc,'--ob',alpha=.7) ylabel('false positive %') xlabel('k') show()The result of the test should be as follows

Looking at the graph we can confirm that for k around 7 we have the lowest false positive percentage.

In your theory description should it not read

ReplyDelete"Given a query element q .... h(q)1, h(q)2, ..., h(q) in the vector"

rather than h(a)1, h(a)2, ..., h(a)

Thank you Simon.

Delete