Skip to main content

Highly optimized Bloom filter that mimics the Python set API, written in Rust

Project description

rBloom

PyPI license DOI build

A fast, simple and lightweight Bloom filter library for Python, implemented in Rust. It's designed to be as pythonic as possible, mimicking the built-in set type where it can, and works with any hashable object. While it's a new library (this project was started in 2023), it's currently the fastest option for Python by a long shot (see the section Benchmarks). Releases are published on PyPI.

Quickstart

This library defines only one class, which can be used as follows:

>>> from rbloom import Bloom
>>> bf = Bloom(200, 0.01)  # 200 items max, false positive rate of 1%
>>> bf.add("hello")
>>> "hello" in bf
True
>>> "world" in bf
False
>>> bf.update(["hello", "world"])  # "hello" and "world" now in bf
>>> other_bf = Bloom(200, 0.01)

### add some items to other_bf

>>> third_bf = bf | other_bf    # third_bf now contains all items in
                                # bf and other_bf
>>> third_bf = bf.copy()
... third_bf.update(other_bf)   # same as above
>>> bf.issubset(third_bf)    # bf <= third_bf also works
True

For the full API, see the section Documentation.

Installation

On almost all platforms, simply run:

pip install rbloom

If you're on an uncommon platform, this may cause pip to build the library from source, which requires the Rust toolchain. You can also build rbloom by cloning this repository and running maturin:

maturin build --release

This will create a wheel in the target/wheels/ directory, which can subsequently also be passed to pip.

Why rBloom?

Why should you use this library instead of one of the other Bloom filter libraries on PyPI?

  • Simple: Almost all important methods work exactly like their counterparts in the built-in set type.
  • Fast: rbloom is implemented in Rust, which makes it blazingly fast. See section Benchmarks for more information.
  • Lightweight: rbloom has no dependencies of its own.
  • Maintainable: This library is very concise, and it's written in idiomatic Rust. Even if I were to stop maintaining rbloom (which I don't intend to), it would be trivially easy for you to fork it and keep it working for you.

I started rbloom because I was looking for a simple Bloom filter dependency for a project, and the only sufficiently fast option (pybloomfiltermmap3) was segfaulting on recent Python versions. rbloom ended up being twice as fast and has grown to encompass a more complete API (e.g. with set comparisons like issubset). Do note that it doesn't use mmapped files, however. This shouldn't be an issue in most cases, as the random access heavy nature of a Bloom filter negates the benefits of mmap after very few operations, but it is something to keep in mind for edge cases.

Benchmarks

The following simple benchmark was implemented in the respective API of each library (see the comparison benchmarks):

bf = Bloom(10_000_000, 0.01)

for i in range(10_000_000):
    bf.add(i + 0.5)  # floats because ints are hashed as themselves

for i in range(10_000_000):
    assert i + 0.5 in bf

This resulted in the following average runtimes on an M1 Pro (confirmed to be proportional to runtimes on an Intel machine):

Library Time Notes
rBloom 2.52s works out-of-the-box
pybloomfiltermmap3 4.78s unreliable [1]
pybloom3 46.76s works out-of-the-box
Flor 76.94s doesn't work on arbitrary objects [2]
bloom-filter2 165.54s doesn't work on arbitrary objects [2]

[1] The official package failed to install on Python 3.11 and kept segfaulting on 3.10 (Linux, January 2023). It seems to be fine for now (October 2024). [2] I was forced to convert to a byte representation, which is bad default behavior as it presents the problems mentioned below in the section "Cryptographic security".

Also note that rbloom is compiled against a stable ABI for portability, and that you can get a small but measurable speedup by removing the "abi3-py37" flag from Cargo.toml and building it yourself.

Documentation

This library defines only one class, the signature of which should be thought of as follows. Note that only the first few methods differ from the built-in set type:

class Bloom:

    # expected_items:  max number of items to be added to the filter
    # false_positive_rate:  max false positive rate of the filter
    # hash_func:  optional argument, see section "Cryptographic security"
    def __init__(self, expected_items: int, false_positive_rate: float,
                 hash_func=__builtins__.hash)

    @property
    def size_in_bits(self) -> int      # number of buckets in the filter

    @property
    def hash_func(self) -> Callable[[Any], int]   # retrieve the hash_func
                                                  # given to __init__

    @property
    def approx_items(self) -> float    # estimated number of items in
                                       # the filter

    # see section "Persistence" for more information on these four methods
    @classmethod
    def load(cls, filepath: str, hash_func) -> Bloom
    def save(self, filepath: str)
    @classmethod
    def load_bytes(cls, data: bytes, hash_func) -> Bloom
    def save_bytes(self) -> bytes

    #####################################################################
    #                    ALL SUBSEQUENT METHODS ARE                     #
    #              EQUIVALENT TO THE CORRESPONDING METHODS              #
    #                     OF THE BUILT-IN SET TYPE                      #
    #####################################################################

    def add(self, obj)                            # add obj to self
    def __contains__(self, obj) -> bool           # check if obj in self
    def __bool__(self) -> bool                    # False if empty
    def __repr__(self) -> str                     # basic info

    def __or__(self, other: Bloom) -> Bloom       # self | other
    def __ior__(self, other: Bloom)               # self |= other
    def __and__(self, other: Bloom) -> Bloom      # self & other
    def __iand__(self, other: Bloom)              # self &= other

    # these extend the functionality of __or__, __ior__, __and__, __iand__
    def union(self, *others: Union[Iterable, Bloom]) -> Bloom        # __or__
    def update(self, *others: Union[Iterable, Bloom])                # __ior__
    def intersection(self, *others: Union[Iterable, Bloom]) -> Bloom # __and__
    def intersection_update(self, *others: Union[Iterable, Bloom])   # __iand__

    # these implement <, >, <=, >=, ==, !=
    def __lt__, __gt__, __le__, __ge__, __eq__, __ne__(self,
                                                       other: Bloom) -> bool
    def issubset(self, other: Bloom) -> bool      # self <= other
    def issuperset(self, other: Bloom) -> bool    # self >= other

    def clear(self)                               # remove all items
    def copy(self) -> Bloom                       # duplicate self

To prevent death and destruction, the bitwise set operations only work on filters where all parameters are equal (including the hash functions being the exact same object). Because this is a Bloom filter, the __contains__ and approx_items methods are probabilistic, as are all the methods that compare two filters (such as __le__ and issubset).

Cryptographic security

Python's built-in hash function is designed to be fast, not maximally collision-resistant, so if your program depends on the false positive rate being perfectly correct, you may want to supply your own hash function. This is especially the case when working with very large filters (more than a few tens of millions of items) or when false positives are very costly and could be exploited by an adversary. Just make sure that your hash function returns an integer between -2^127 and 2^127 - 1. Feel free to use the following example in your own code:

from rbloom import Bloom
from hashlib import sha256
from pickle import dumps

def hash_func(obj):
    h = sha256(dumps(obj)).digest()
    # use sys.byteorder instead of "big" for a small speedup when
    # reproducibility across machines isn't a concern
    return int.from_bytes(h[:16], "big", signed=True)

bf = Bloom(100_000_000, 0.01, hash_func)

When you throw away Python's built-in hash function and start hashing serialized representations of objects, however, you open up a breach into the scary realm of the unpythonic:

  • Numbers like 1, 1.0, 1 + 0j and True will suddenly no longer be equal.
  • Instances of classes with custom hashing logic (e.g. to stop caches inside instances from affecting their hashes) will suddenly display undefined behavior.
  • Objects that can't be serialized simply won't be hashable at all.

Making you supply your own hash function in this case is a deliberate design decision intended to show you what you're doing and prevent you from shooting yourself in the foot.

Also note that using a custom hash will incur a performance penalty over using the built-in hash.

Persistence

The save and load methods, along with their byte-oriented counterparts save_bytes and load_bytes, allow you to save and load filters to and from disk/Python bytes objects. However, as the built-in hash function's salt changes between invocations of Python, they only work on filters with custom hash functions. Note that it is your responsibility to ensure that the hash function you supply to the loading functions is the same as the one originally used by the filter you're loading!

bf = Bloom(10_000, 0.01, some_hash_func)
bf.add("hello")
bf.add("world")

# saving to a file
bf.save("bf.bloom")

# loading from a file
loaded_bf = Bloom.load("bf.bloom", some_hash_func)
assert loaded_bf == bf

# saving to bytes
bf_bytes = bf.save_bytes()

# loading from bytes
loaded_bf_from_bytes = Bloom.load_bytes(bf_bytes, some_hash_func)
assert loaded_bf_from_bytes == bf

The size of the file is bf.size_in_bits / 8 + 8 bytes.


Statement of attribution: Bloom filters were originally proposed in (Bloom, 1970). Furthermore, this implementation makes use of a constant recommended by (L'Ecuyer, 1999) for redistributing the entropy of a single hash over multiple integers using a linear congruential generator.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rbloom-1.5.4.tar.gz (20.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rbloom-1.5.4-cp37-abi3-win_amd64.whl (173.3 kB view details)

Uploaded CPython 3.7+Windows x86-64

rbloom-1.5.4-cp37-abi3-win32.whl (167.7 kB view details)

Uploaded CPython 3.7+Windows x86

rbloom-1.5.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (309.0 kB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

rbloom-1.5.4-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (345.7 kB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ s390x

rbloom-1.5.4-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (433.2 kB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ppc64le

rbloom-1.5.4-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (316.6 kB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ARMv7l

rbloom-1.5.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (306.8 kB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ARM64

rbloom-1.5.4-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl (333.9 kB view details)

Uploaded CPython 3.7+manylinux: glibc 2.5+ i686

rbloom-1.5.4-cp37-abi3-macosx_11_0_arm64.whl (276.0 kB view details)

Uploaded CPython 3.7+macOS 11.0+ ARM64

rbloom-1.5.4-cp37-abi3-macosx_10_12_x86_64.whl (284.5 kB view details)

Uploaded CPython 3.7+macOS 10.12+ x86-64

File details

Details for the file rbloom-1.5.4.tar.gz.

File metadata

  • Download URL: rbloom-1.5.4.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for rbloom-1.5.4.tar.gz
Algorithm Hash digest
SHA256 ea6804f837c14d8ff041b07df8666798ac1ddcab709e139d06b0ea69d627f827
MD5 f0e2c197a4b288230b5efdd3a8bb2312
BLAKE2b-256 07774ad89e84269a9282860bf04ba80ea487c749dd0e013465dd9697136a9335

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: rbloom-1.5.4-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 173.3 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 48576b9d5bddcb8b4e89f61164e4d06a69bdb60d14c365624262db2fb6986f97
MD5 9763b0a9042930eba64ba8e664262132
BLAKE2b-256 b26770f3d4afed87894cd7e37a9c490ab7f324d7ccaba0819a46e1a405dc71d5

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-win32.whl.

File metadata

  • Download URL: rbloom-1.5.4-cp37-abi3-win32.whl
  • Upload date:
  • Size: 167.7 kB
  • Tags: CPython 3.7+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-win32.whl
Algorithm Hash digest
SHA256 d4d166ffc034af4af88de9af9a7a061dce2a769edf1f29b9ff0aeabd59995d53
MD5 a5368d4eee4b6837055f44638e16fdeb
BLAKE2b-256 d8e3d70f41404f7ff6529a3b7360ad0a40586270466a494e9dfd3e6503e8aefe

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 91f25b4832986bf11f2255f963e6fa26eb98f1d71a84d8ff519b6649cd70f871
MD5 fa49b5550d130d80ab94aed9d4313a88
BLAKE2b-256 2ce6b4077d6a6a6a0fc186b8f509be22bb5aebfd7677d251dbdfb4d9bd193cf9

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm Hash digest
SHA256 f143744229a1ab046251fb0dab5940b5f71aa40a6e3f1e0e9acc86d5cbf3e56f
MD5 abf202db342346f7d9c4f2f02c7428f0
BLAKE2b-256 ec13224f06fba32815acc4e999383cf3c03fdd314212430f7966cb3bdc4a8c18

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 6f92d140c4d79e59244804397f84fa359ee5b9a614739194a7addaae6f0d64e6
MD5 44dd8d8bbff83f118149053844a412e1
BLAKE2b-256 7b9215cc7214097304dfa68bc995eaff47dabee98b461a909fde9a5b7fdb8987

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm Hash digest
SHA256 74e573b59e7e36eaa05fd76a9023e6ec8aa0b0cd756c5a36b608e83fe4e33ce6
MD5 b1f1d9e83fb79e68cd4349694eba9f93
BLAKE2b-256 cb9437fd4adda878aad2a3f7933d2db25e386a4ec6adaa743586f4ef16d75aca

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 374e3b4c2c01c9a269442e01c25564cc45bec237c5b9c67586784380298984ff
MD5 5e9074e6bce273a67f460768f4818e23
BLAKE2b-256 85b07401185e38d047c41de646bece1824608feafedf2b2736711fe9475449bc

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 a8dd11ba9d3d86e9b60db099fa9c76790db02b935599390cd99d6a9c2980e2fe
MD5 2d6473dcd0ac1bb6b6633435d587e709
BLAKE2b-256 36a6ec87a8152e75e41deefa7047034ca178693a853010a9c7f797530e7acdfc

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ac7b4e30fb9333ee83b325c3bbfe870ea93e6260442ada6679c42d9030e4d007
MD5 e9150b86e96f6868ec6fbbec5a16d029
BLAKE2b-256 ee062b7e83e7d33951e8ddeb9283ecde3e5d6ac136f54e55b241746787c8b39e

See more details on using hashes here.

File details

Details for the file rbloom-1.5.4-cp37-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rbloom-1.5.4-cp37-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9c3235ada8ce33303212bf66d96ca328021e404c9c3094b9e42725163de8bd77
MD5 e7d4de81e506fc4867d740e88c3048cc
BLAKE2b-256 2fc8012f1c8aa635d551f2364e764b332c76f581dd61a6145900afa85f5ba338

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page