Skip to main content

Highly optimized bloom filter in Python with bindings in Go.

Project description

Bloom filters

PyPI version CircleCI

Source project & considerations

This implementation of the Bloom filter is ported from the https://github.com/bits-and-blooms/bloom (Test Go Report Card Go Reference) project. The reason for porting this project from Go is due to the fact that there are no high-performance implementations of the Bloom filter in Python. All Bloom filter packages we found on PyPi are just too slow for our real-time inference pipeline.

The Godoc documentation of the ported library can be found here: https://pkg.go.dev/github.com/bits-and-blooms/bloom/v3

Benchmark

The following PyPi libraries have been benchmarked on a filter configuration of 1 mil elements and an error rate of 1%. The following shows the time needed to evaluate all elements. These are the results of running the benchmark on an Apple M1 chip. The clear winner is this library.

           shaped-bloom-filter  bloom-filter2  fastbloom-rs  easy-bloom-filter
count               100.000000     100.000000    100.000000         100.000000
mean (ms)            62.859311    2508.295782    464.162710        2201.442912
std (ms)              2.300990      25.811489      6.834798          19.279795
min (ms)             61.051130    2488.384008    457.671881        2177.664995
25% (ms)             61.701119    2496.009886    459.175646        2190.251410
50% (ms)             62.229514    2502.063990    462.151527        2196.185470
75% (ms)             63.203335    2506.640196    466.094494        2205.390155
max (ms)             76.441765    2666.959047    485.892057        2307.396173
speedup               1.000000      39.903329      7.384152          35.021747

Description

A Bloom filter is a concise/compressed representation of a set, where the main requirement is to make membership queries; i.e., whether an item is a member of a set. A Bloom filter will always correctly report the presence of an element in the set when the element is indeed present. A Bloom filter can use much less storage than the original set, but it allows for some 'false positives': it may sometimes report that an element is in the set whereas it is not.

When you construct, you need to know how many elements you have (the desired capacity), and what is the desired false positive rate you are willing to tolerate. A common false-positive rate is 1%. The lower the false-positive rate, the more memory you are going to require. Similarly, the higher the capacity, the more memory you will use. You may construct the Bloom filter capable of receiving 1 million elements with a false-positive rate of 1% in the following manner.

from shaped_bloom_filter import BloomFilter
filter = BloomFilter(1000000, 0.01)

Operations like the following can be done:

# single member addition
assert filter.is_member(10) == False
filter.add(10)
assert filter.is_member(10) == True

# multiple member addition
assert filter.are_members([1, 5, 9]) == [0, 0, 0]
filter.add_batch([5, 9])
assert filter.are_members([1, 5, 9]) == [0, 1, 1]

# serialization
serialized = filter.serialize()
new_filter = BloomFilter(restore_from_serialized=serialized)
assert new_filter.are_members([1, 5, 9]) == [0, 1, 1]

For composing your own keys using Python-serializable objects, you can use the following method:

from shaped_bloom_filter import BloomFilterExtended
filter = BloomFilterExtended(1000000, 0.01)

assert filter.is_one_member("Emma is writing a letter.".encode("utf-8")) == False
filter.add_one_member("Emma is writing a letter.".encode("utf-8"))
assert filter.is_one_member("Emma is writing a letter.".encode("utf-8")) == True

You should call the BloomFilter/BloomFilterExtended constructors conservatively: if you specify a number of elements that it is too small, the false-positive bound might be exceeded. A Bloom filter is not a dynamic data structure: you must know ahead of time what your desired capacity is.

Installation

pip install shaped-bloom-filter

Note: On MacOS/Windows machines, you also need Go as the Python package will be built from source.

Reference

module filter.py


class BloomFilter

function __init__
__init__(
    max_elements: Optional[int] = None,
    error_rate: Optional[float] = None,
    restore_from_serialized: bytes = None
)

function add
add(var: int)

Add a single 32-bit integer key to the filter.


function add_batch
add_batch(var: List[int])

Add a list of 32-bit integer keys to the filter.


function are_members
are_members(var: List[int])  List[bool]

Check if a given list of 32-bit integer keys have been set. A boolean list is returned.


function is_member
is_member(var: int)  bool

Check if a given 32-bit integer key has been set.


function serialize
serialize()  bytes

Serialize the filter for storing purposes. To restore it, pass the returned bytes into the constructor's restore_from_serialized parameter.


function load
load(serialized_bloom_filter: bytes)

Restore a serialized bloom filter into the existing context. Computationally faster than creating a new filter from scratch.


function __contains__
__contains__(var: int)  bool

Check if a given 32-bit integer key has been set.


class BloomFilterExtended(BloomFilter)

function add_one_member
add_one_member(var: Union[List[int], int, bytes])

Add a single key to the filter. Examples of keys: serialized Python objects, strings, 64-digit integers, etc.


function is_one_member
is_one_member(var: Union[List[int], int, bytes])  bool

Check if a single key has been set. Examples of keys: serialized Python objects, strings, 64-digit integers, etc.


This file was automatically generated via lazydocs.

Contributing

If you wish to contribute to this project, please branch and issue a pull request against master ("GitHub Flow")

This project includes a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Running all tests

Before committing the code, please check if it passes all tests using (note: this will install some dependencies):

make python-environment
make python-build
make python-install
make python-tests

Design

A Bloom filter has two parameters: m, the number of bits used in storage, and k, the number of hashing functions on elements of the set. (The actual hashing functions are important, too, but this is not a parameter for this implementation). A Bloom filter is backed by a BitSet; a key is represented in the filter by setting the bits at each value of the hashing functions (modulo m). Set membership is done by testing whether the bits at each value of the hashing functions (again, modulo m) are set. If so, the item is in the set. If the item is actually in the set, a Bloom filter will never fail (the true positive rate is 1.0); but it is susceptible to false positives. The art is to choose k and m correctly.

In this implementation, the hashing functions used is murmurhash, a non-cryptographic hashing function.

Given the particular hashing scheme, it's best to be empirical about this. Note that estimating the FP rate will clear the Bloom filter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shaped-bloom-filter-3.3.0.1.tar.gz (23.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

shaped_bloom_filter-3.3.0.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (643.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

shaped_bloom_filter-3.3.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (643.4 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

shaped_bloom_filter-3.3.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (643.4 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

shaped_bloom_filter-3.3.0.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (643.4 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

shaped_bloom_filter-3.3.0.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (643.4 kB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

shaped_bloom_filter-3.3.0.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl (643.4 kB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.28+ x86-64manylinux: glibc 2.5+ x86-64

File details

Details for the file shaped-bloom-filter-3.3.0.1.tar.gz.

File metadata

  • Download URL: shaped-bloom-filter-3.3.0.1.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.14

File hashes

Hashes for shaped-bloom-filter-3.3.0.1.tar.gz
Algorithm Hash digest
SHA256 d96ae318e56172a736e56381b93fe0ee93be63ba7272ce0caebe4ab13842825b
MD5 b590c6d218f10ab15596a5fa78b5c045
BLAKE2b-256 51593e3d7c1cb95f0a57907e003c4c5482904f361d12ae53a4541e87929c8cfa

See more details on using hashes here.

File details

Details for the file shaped_bloom_filter-3.3.0.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for shaped_bloom_filter-3.3.0.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 847d922a3548bb405be443421b986c0b562fd8c89599b49dc10cb987c8fe5a44
MD5 2d6ff0d0d269e67224ce75e9dd8e3f49
BLAKE2b-256 a6e439060834820db8754cfcc686da020929fae2e113449ac7fb193445092edf

See more details on using hashes here.

File details

Details for the file shaped_bloom_filter-3.3.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for shaped_bloom_filter-3.3.0.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 df5a0bf7d17e002ecb3538f87cf1112f7ac040c4ada7505f99cd2001c8a67355
MD5 9e17f23f99b22ace071b4f26c219fa14
BLAKE2b-256 d720ad7f82208c67c0fcb27690d53289cb7c1d6ca962f4a7ef75af58436b90d4

See more details on using hashes here.

File details

Details for the file shaped_bloom_filter-3.3.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for shaped_bloom_filter-3.3.0.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 17ec739f585df9240801e91b74221284f2dc42941fe02180e333bc352e07f073
MD5 77b9230c20b832edbb418ff87644089f
BLAKE2b-256 0aabf53cb2399db30f7efb11d0ce47d948350be330b7f06da6e65919d228a4de

See more details on using hashes here.

File details

Details for the file shaped_bloom_filter-3.3.0.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for shaped_bloom_filter-3.3.0.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c241517af7bb89e500fb302a7b82ff6fbd97356306a63375fa8ac0c30574e706
MD5 ce12b1ba7c91cac33bbb7312d7be6dc8
BLAKE2b-256 39273e14866c11475156dd07b582783fd12f1861d2a5d12ed1404c6a334befee

See more details on using hashes here.

File details

Details for the file shaped_bloom_filter-3.3.0.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for shaped_bloom_filter-3.3.0.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2e13308a9457500185c0bd5d76930321056d833f43642b90ea3476fc9e45dae4
MD5 5945a4a2783f2a1b9d82f5608cde2063
BLAKE2b-256 63b1e60bcc1e759efe410d4abe30ea95bb4f0d0317ddaee55dc510db9072bf4a

See more details on using hashes here.

File details

Details for the file shaped_bloom_filter-3.3.0.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for shaped_bloom_filter-3.3.0.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 62de6028ce0ebb069ddc73a3debba82ac26df4db3dabf280bb57999c15d9f980
MD5 0c9e2bcec2b45e3f567014841790dad2
BLAKE2b-256 409821c98f5d7a895ce14f459338f4fbe104255a1db9498023787b93fcac21e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page