Skip to main content

An efficient and scalable bloom filter module built in pure python.

Project description

Bloom Filter

\bloom fil-ter\ Noun

  1. Space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
  2. A fun side project!

About bloomy

Bloomy is a bloom filter module designed to be both lightweight and scalable. Bloomy does not use any external libraries, and scales internally to match scenarios. Bloomy also designed to reduce the rate of false positives. This is done my applying md5 hashing and bit rotation to generate k unique 8 byte hash values, and then using golden ratio compression for even distribution.

Advantages of bloom filters

  • Uses bit arrays to store the presence of items. This significantly reduces the in memory storage.
  • Implements binary and bit masking operations. This reduces the runtime of add and contains operations to practically O(1).
  • Raw types can be used.

Operations

___init__(m,[array of additional hash functions])

		 params: (optional) size of bit array, (optional) additional hash functions 
		 return: bloomFilter object 

Create a bit array bits with m bits cleared to 0, and saves k number of hash functions. M has a default of 1000 and the filter comes with five unique default hash functions.

add(value)

		 params: object to add to the bloomFilter set
		 return: None

Takes in a value and hashes it k times to obtain a list of indexes. For each index obtained by a given hash function, set the value in the bit array bits to 1.

__contains__(value)

		 params: object to search for in the set 
		 return: True/False indicating presence in the set

Takes in a value and hashes it k times to obtain a list of unique indexes. Create a bit mask with the initial value 0. For each index obtained by a hash function, set that index in the mask 1. Perform the following binary operation bits AND mask, and if the value performed by that operation equals the mask, we can conclude that the value probably exists in the bloom filter.

addHashFunction(fn)

		 params: additional hash function to add
		 return: None

Takes in a function as a parameter, and adds this function to the existing list of hash functions used by the bloom filter. The function fn has to take the form fn(value) where value is the item to hash, and the function must return an integer between 0 and the upper index bound. All other following parameters must have default values.

The function also has to be able to provide consistent values for the given input. There is no randomization involved, rather this is a pure key transformation function. Any compression function may be used, as long as it limits the key space within the upper bound of the bit array.

fingerprint(value)

		 params: value to inspect fingerprint
		 return: array (len = k) of hashed values 

Takes in a value and returns an array of the hashed values created by the list of hash functions. Used to debug and inspect the behavior of the hashing mechanism.

mask([ints])

		 params: integer values to use to create bit mask
		 return: integer bit mask

Creates a bit masked integer with 1 in the indexes specified by the parameter integer array.


Example (k = 3, m = 12)

0. constructor 

bit array: 
	--- --- --- --- --- --- --- --- --- --- --- --- --- 
 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ...
	--- --- --- --- --- --- --- --- --- --- --- --- --- 

1. add("Hello world") => 00, 03, 04

bit array: 
	--- --- --- --- --- --- --- --- --- --- --- --- --- 
 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ...
	--- --- --- --- --- --- --- --- --- --- --- --- --- 

2. contains("Hello World") => 00,03,04 

bit array: 
	--- --- --- --- --- --- --- --- --- --- --- --- --- 
 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ...
	--- --- --- --- --- --- --- --- --- --- --- --- --- 
	 ||          ||  ||
	 ||==========||==||=================> 1,1,1 = True 

3. contains("java is the best") => 02,03,07 

bit array: 
	--- --- --- --- --- --- --- --- --- --- --- --- --- 
 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ...
	--- --- --- --- --- --- --- --- --- --- --- --- --- 
					 ||  ||              ||
					 ||==||==============||=====> 0,1,0 = False

Thank you!!

Thank you for checking this out! Please feel free to reach out via email sdcroche@ncsu.edu, or twitter @shmam_

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bloomy-0.0.2.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bloomy-0.0.2-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file bloomy-0.0.2.tar.gz.

File metadata

  • Download URL: bloomy-0.0.2.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for bloomy-0.0.2.tar.gz
Algorithm Hash digest
SHA256 61528e4ec428f81c05408dea6a3ebdc79a51d55f58da671f0ed31322eb064786
MD5 37aeb1733741cdf520a92ad476ea3991
BLAKE2b-256 8a5fcb0f3ee594d5b0e8e504d08416407e44a4fae2d289d75b1224f627d5c2aa

See more details on using hashes here.

File details

Details for the file bloomy-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: bloomy-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.5

File hashes

Hashes for bloomy-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2276d470d5d81e7d69f6a1df3a995b7b1d846ee2c4a55de1ebdbe1cc5159869c
MD5 a55f0aeb08e28ccf8cf49d668236c1a6
BLAKE2b-256 e0ae70edd7a50953f420757bd3554885c4374a7ba824dd4cc684a9dc6ffc65c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page