Skip to main content

Highly optimized Bloom filter that mimics the Python set API, written in Rust

Project description

rBloom

PyPI license DOI build

A fast, simple and lightweight Bloom filter library for Python, implemented in Rust. It's designed to be as pythonic as possible, mimicking the built-in set type where it can, and works with any hashable object. While it's a new library (this project was started in 2023), it's currently the fastest option for Python by a long shot (see the section Benchmarks). Releases are published on PyPI.

Quickstart

This library defines only one class, which can be used as follows:

>>> from rbloom import Bloom
>>> bf = Bloom(200, 0.01)  # 200 items max, false positive rate of 1%
>>> bf.add("hello")
>>> "hello" in bf
True
>>> "world" in bf
False
>>> bf.update(["hello", "world"])  # "hello" and "world" now in bf
>>> other_bf = Bloom(200, 0.01)

### add some items to other_bf

>>> third_bf = bf | other_bf    # third_bf now contains all items in
                                # bf and other_bf
>>> third_bf = bf.copy()
... third_bf.update(other_bf)   # same as above
>>> bf.issubset(third_bf)    # bf <= third_bf also works
True

For the full API, see the section Documentation.

Installation

On almost all platforms, simply run:

pip install rbloom

If you're on an uncommon platform, this may cause pip to build the library from source, which requires the Rust toolchain. You can also build rbloom by cloning this repository and running maturin:

maturin build --release

This will create a wheel in the target/wheels/ directory, which can subsequently also be passed to pip.

Why rBloom?

Why should you use this library instead of one of the other Bloom filter libraries on PyPI?

  • Simple: Almost all important methods work exactly like their counterparts in the built-in set type.
  • Fast: rbloom is implemented in Rust, which makes it blazingly fast. See section Benchmarks for more information.
  • Lightweight: rbloom has no dependencies of its own.
  • Maintainable: This library is very concise, and it's written in idiomatic Rust. Even if I were to stop maintaining rbloom (which I don't intend to), it would be trivially easy for you to fork it and keep it working for you.

I started rbloom because I was looking for a simple Bloom filter dependency for a project, and the only sufficiently fast option (pybloomfiltermmap3) was segfaulting on recent Python versions. rbloom ended up being twice as fast and has grown to encompass a more complete API (e.g. with set comparisons like issubset). Do note that it doesn't use mmapped files, however. This shouldn't be an issue in most cases, as the random access heavy nature of a Bloom filter negates the benefits of mmap after very few operations, but it is something to keep in mind for edge cases.

Benchmarks

The following simple benchmark was implemented in the respective API of each library (see the comparison benchmarks):

bf = Bloom(10_000_000, 0.01)

for i in range(10_000_000):
    bf.add(i + 0.5)  # floats because ints are hashed as themselves

for i in range(10_000_000):
    assert i + 0.5 in bf

This resulted in the following average runtimes on an M1 Pro (confirmed to be proportional to runtimes on an Intel machine):

Library Time Notes
rBloom 2.52s works out-of-the-box
pybloomfiltermmap3 4.78s unreliable [1]
pybloom3 46.76s works out-of-the-box
Flor 76.94s doesn't work on arbitrary objects [2]
bloom-filter2 165.54s doesn't work on arbitrary objects [2]

[1] The official package failed to install on Python 3.11 and kept segfaulting on 3.10 (Linux, January 2023). It seems to be fine for now (October 2024). [2] I was forced to convert to a byte representation, which is bad default behavior as it presents the problems mentioned below in the section "Cryptographic security".

Also note that rbloom is compiled against a stable ABI for portability, and that you can get a small but measurable speedup by removing the "abi3-py37" flag from Cargo.toml and building it yourself.

Documentation

This library defines only one class, the signature of which should be thought of as follows. Note that only the first few methods differ from the built-in set type:

class Bloom:

    # expected_items:  max number of items to be added to the filter
    # false_positive_rate:  max false positive rate of the filter
    # hash_func:  optional argument, see section "Cryptographic security"
    def __init__(self, expected_items: int, false_positive_rate: float,
                 hash_func=__builtins__.hash)

    @property
    def size_in_bits(self) -> int      # number of buckets in the filter

    @property
    def hash_func(self) -> Callable[[Any], int]   # retrieve the hash_func
                                                  # given to __init__

    @property
    def approx_items(self) -> float    # estimated number of items in
                                       # the filter

    # see section "Persistence" for more information on these four methods
    @classmethod
    def load(cls, filepath: str, hash_func) -> Bloom
    def save(self, filepath: str)
    @classmethod
    def load_bytes(cls, data: bytes, hash_func) -> Bloom
    def save_bytes(self) -> bytes

    #####################################################################
    #                    ALL SUBSEQUENT METHODS ARE                     #
    #              EQUIVALENT TO THE CORRESPONDING METHODS              #
    #                     OF THE BUILT-IN SET TYPE                      #
    #####################################################################

    def add(self, obj)                            # add obj to self
    def __contains__(self, obj) -> bool           # check if obj in self
    def __bool__(self) -> bool                    # False if empty
    def __repr__(self) -> str                     # basic info

    def __or__(self, other: Bloom) -> Bloom       # self | other
    def __ior__(self, other: Bloom)               # self |= other
    def __and__(self, other: Bloom) -> Bloom      # self & other
    def __iand__(self, other: Bloom)              # self &= other

    # these extend the functionality of __or__, __ior__, __and__, __iand__
    def union(self, *others: Union[Iterable, Bloom]) -> Bloom        # __or__
    def update(self, *others: Union[Iterable, Bloom])                # __ior__
    def intersection(self, *others: Union[Iterable, Bloom]) -> Bloom # __and__
    def intersection_update(self, *others: Union[Iterable, Bloom])   # __iand__

    # these implement <, >, <=, >=, ==, !=
    def __lt__, __gt__, __le__, __ge__, __eq__, __ne__(self,
                                                       other: Bloom) -> bool
    def issubset(self, other: Bloom) -> bool      # self <= other
    def issuperset(self, other: Bloom) -> bool    # self >= other

    def clear(self)                               # remove all items
    def copy(self) -> Bloom                       # duplicate self

To prevent death and destruction, the bitwise set operations only work on filters where all parameters are equal (including the hash functions being the exact same object). Because this is a Bloom filter, the __contains__ and approx_items methods are probabilistic, as are all the methods that compare two filters (such as __le__ and issubset).

Cryptographic security

Python's built-in hash function is designed to be fast, not maximally collision-resistant, so if your program depends on the false positive rate being perfectly correct, you may want to supply your own hash function. This is especially the case when working with very large filters (more than a few tens of millions of items) or when false positives are very costly and could be exploited by an adversary. Just make sure that your hash function returns an integer between -2^127 and 2^127 - 1. Feel free to use the following example in your own code:

from rbloom import Bloom
from hashlib import sha256
from pickle import dumps

def hash_func(obj):
    h = sha256(dumps(obj)).digest()
    # use sys.byteorder instead of "big" for a small speedup when
    # reproducibility across machines isn't a concern
    return int.from_bytes(h[:16], "big", signed=True)

bf = Bloom(100_000_000, 0.01, hash_func)

When you throw away Python's built-in hash function and start hashing serialized representations of objects, however, you open up a breach into the scary realm of the unpythonic:

  • Numbers like 1, 1.0, 1 + 0j and True will suddenly no longer be equal.
  • Instances of classes with custom hashing logic (e.g. to stop caches inside instances from affecting their hashes) will suddenly display undefined behavior.
  • Objects that can't be serialized simply won't be hashable at all.

Making you supply your own hash function in this case is a deliberate design decision intended to show you what you're doing and prevent you from shooting yourself in the foot.

Also note that using a custom hash will incur a performance penalty over using the built-in hash.

Persistence

The save and load methods, along with their byte-oriented counterparts save_bytes and load_bytes, allow you to save and load filters to and from disk/Python bytes objects. However, as the built-in hash function's salt changes between invocations of Python, they only work on filters with custom hash functions. Note that it is your responsibility to ensure that the hash function you supply to the loading functions is the same as the one originally used by the filter you're loading!

bf = Bloom(10_000, 0.01, some_hash_func)
bf.add("hello")
bf.add("world")

# saving to a file
bf.save("bf.bloom")

# loading from a file
loaded_bf = Bloom.load("bf.bloom", some_hash_func)
assert loaded_bf == bf

# saving to bytes
bf_bytes = bf.save_bytes()

# loading from bytes
loaded_bf_from_bytes = Bloom.load_bytes(bf_bytes, some_hash_func)
assert loaded_bf_from_bytes == bf

The size of the file is bf.size_in_bits / 8 + 8 bytes.


Statement of attribution: Bloom filters were originally proposed in (Bloom, 1970). Furthermore, this implementation makes use of a constant recommended by (L'Ecuyer, 1999) for redistributing the entropy of a single hash over multiple integers using a linear congruential generator.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rbloom-1.5.2.tar.gz (21.2 kB view details)

Uploaded Source

Built Distributions

rbloom-1.5.2-cp37-abi3-win_amd64.whl (174.8 kB view details)

Uploaded CPython 3.7+ Windows x86-64

rbloom-1.5.2-cp37-abi3-win32.whl (167.1 kB view details)

Uploaded CPython 3.7+ Windows x86

rbloom-1.5.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (300.1 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ x86-64

rbloom-1.5.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (377.8 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ s390x

rbloom-1.5.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (342.7 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ppc64le

rbloom-1.5.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (315.5 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ARMv7l

rbloom-1.5.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (310.6 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ARM64

rbloom-1.5.2-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl (318.8 kB view details)

Uploaded CPython 3.7+ manylinux: glibc 2.5+ i686

rbloom-1.5.2-cp37-abi3-macosx_11_0_arm64.whl (266.4 kB view details)

Uploaded CPython 3.7+ macOS 11.0+ ARM64

rbloom-1.5.2-cp37-abi3-macosx_10_12_x86_64.whl (272.1 kB view details)

Uploaded CPython 3.7+ macOS 10.12+ x86-64

File details

Details for the file rbloom-1.5.2.tar.gz.

File metadata

  • Download URL: rbloom-1.5.2.tar.gz
  • Upload date:
  • Size: 21.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for rbloom-1.5.2.tar.gz
Algorithm Hash digest
SHA256 e83544300ca741abdf7214f8f880c05afca4b536a9d77be824c1c68625685803
MD5 4bfaac917da1cdabdd7956c2c3c81587
BLAKE2b-256 d8bbecdaf82414e7ae2d7b7db2568cd06ded008f8bd0fbaa788ddc5f2eeb4be0

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: rbloom-1.5.2-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 174.8 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7297f2040d4e8882baea29cb0c554b9c008b5db616790620278b9aaaf95cf439
MD5 316b9656a81747cfde428867ba805048
BLAKE2b-256 b816d8637d6a7840af37cf1aae2d1cd64e48840ce7e11e049ad0ee477b139856

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-win32.whl.

File metadata

  • Download URL: rbloom-1.5.2-cp37-abi3-win32.whl
  • Upload date:
  • Size: 167.1 kB
  • Tags: CPython 3.7+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.7.4

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-win32.whl
Algorithm Hash digest
SHA256 2c59bced1f3d71ad266dbeb5e2b140ed9128f64c083ceddb2a0c94244cc1c923
MD5 3f6789f0dd5356c19c80dfa17830bd90
BLAKE2b-256 ace39e39760cb8e6b1181f3732fbc456f8a844f9c000a6d16c14bcf91ae5d6ee

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fe497954ffe769ebfbc6c3df091a878a5eb650e21c955aa606fa516be7c2c6d7
MD5 874eb9fb061139f3b95a257849a01096
BLAKE2b-256 01892c7c55a419fcf037964d2280b57c87a9bd5f1d262f7c666d16fd07c1a768

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 170c34b7a1b26c1d581a6fd537e5af30c9fc18f2ce9cf51b822c390d26b176c6
MD5 5964721a3628ccb23917c0845a5e2f2e
BLAKE2b-256 65c634abe304c657a3dc4a302b16b47e731213a07e63e1131551bd8db90be7a5

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 be472c8854d4e13f4862822c3c6bcae5190bfa0b59142aba26195d8a171e0d3f
MD5 5a99d5f3d00d63c14658a97d23dce260
BLAKE2b-256 4671ecf52efad2bd571e45e26ebbd9268679bf970c0d283d28fa6e4c52cc4edb

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 16cce9f19f27465c8c1c529d4cf7e14556a11f431f63e6b7a23921375ffd18af
MD5 525eb4f4b748c73a6fb42ccd1bac7076
BLAKE2b-256 4783f803ef934c356d99daa87c4dd7a84c75ff0636307c8661c34fddb878763e

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7c6ea4862ff750337f67e0cbb705c6261a1b408d7c0721914a5d92b9c11d43df
MD5 44ea58c1316109203787b0e118eddc56
BLAKE2b-256 369dcc0ee1cb3ef73cb81f32a741d4aa6a771a8e86b460cc98e66b55120a805e

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 05a05b6b44486795cb2ee209563a4d2127f7ec07195caba0038503cd174bc7d1
MD5 f8e59ee1c85563ab660232405c46fc69
BLAKE2b-256 a4064d4c6852c4aedcd309be7c73369270719836329cce2568077df991133563

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9ae886d880c7ae9ac71cf6704a95043e1a1c826d96257cedad8b157f777581b8
MD5 1d61d3241ac086507594ed9d336dc6d4
BLAKE2b-256 309d49dbb1e0f516467cec135472fad56cff7fd6410bc4641e8ea6bb0ba4cda5

See more details on using hashes here.

File details

Details for the file rbloom-1.5.2-cp37-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.2-cp37-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d2b6879c2bade72565d5f96b4e1f64fc04ddd679574a2994a1bd70377fe674ac
MD5 882414c1072f5c4065029f354e5e63ca
BLAKE2b-256 89773f672a00d24e56915df5308a7d0ff9d637a732b844d280410df4d5bb1d61

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page