Skip to main content

No project description provided

Project description

rBloom

PyPI build license

A fast, simple and lightweight Bloom filter library for Python, fully implemented in Rust. It's designed to be as pythonic as possible, mimicking the built-in set type where it can. While it's a new kid on the block (this project was started in 2023), it's also currently the fastest kid on the block by a long shot (see the section Benchmarks). Releases are published on PyPI.

Quickstart

This library defines only one class, which can be used as follows:

>>> from rbloom import Bloom
>>> bf = Bloom(200, 0.01)  # 200 items max, false positive rate of 1%
>>> bf.add("hello")
>>> "hello" in bf
True
>>> "world" in bf
False
>>> bf.update(["hello", "world"])  # "hello" and "world" now in bf
>>> other_bf = Bloom(200, 0.01)

### add some items to other_bf

>>> third_bf = bf | other_bf  # third_bf now contains all items in
                              # bf and other_bf
>>> third_bf = bf.copy()
... third_bf.update(other_bf)  # same as above

Unlike BF's in real life, you can have as many of these as you want, and they all work well together!

For the full API, see the section Documentation.

Installation

On almost all platforms, simply run:

pip install rbloom

If you're on an uncommon platform, this may cause pip to build the library from source, which requires the Rust toolchain. You can also build rbloom by cloning this repository and running maturin:

maturin build --release

This will create a wheel in the target/wheels/ directory, which can subsequently also be passed to pip.

Why rBloom?

Why should you use this library instead of one of the other Bloom filter libraries on PyPI?

  • Simple: Almost all important methods work exactly like their counterparts in the built-in set type.
  • Fast: rbloom is implemented in Rust, which makes it blazingly fast. See section Benchmarks for more information.
  • Lightweight: rbloom has no dependencies of its own.
  • Maintainable: The entire library fits comfortably in less than 400 lines of code, and it's written in idiomatic Rust. Even if I were to stop maintaining rbloom (which I don't intend to), it would be trivially easy for you to fork it and keep it working for you.

I started rbloom because I was looking for a simple Bloom filter dependency for a project, but the pure Python implementations were too slow. The only maintained fast alternative I could find, pybloomfiltermmap3 (which is written in C and is a great library), failed to work on recent versions of Python (see below), so I felt very uncomfortable using it as a dependency. I also felt like the thousands of lines of code in that library were a bit hard to handle should it stop being maintained (which is what happened to the original pybloomfiltermmap). However, please note that pybloomfiltermmap3 implements persistent filters, while rbloom currently does not, so if that's something you require, you should definitely give that library a try.

Benchmarks

I implemented the following simple benchmark in the respective API of each library:

bf = Bloom(10_000_000, 0.01)

for i in range(10_000_000):
    bf.add(i + 0.5)  # floats because ints are hashed as themselves

for i in range(10_000_000):
    assert i + 0.5 in bf

This resulted in the following runtimes:

Library Time Notes
rBloom 5.956s works out-of-the-box
pybloomfiltermmap3 11.280s surprisingly hard to get working [1]
pybloom3 75.871s works out-of-the-box
Flor 128.837s doesn't work on arbitrary objects [2]
bloom-filter2 325.044s doesn't work on arbitrary objects [2]

[1] It refused to install on Python 3.11 and kept segfaulting on 3.10, so I installed 3.7 on my machine for this benchmark.
[2] I tested both converting to bytes and pickling, and chose the faster time.

The benchmark was run on a 2019 Dell XPS 15 7590 with an Intel Core i5-9300H. It was run 5 times for each library, and the average time was used.

Also note that rbloom is compiled against a stable ABI for portability, and that you can get a small but measurable speedup by removing the "abi3-py37" flag from Cargo.toml and building it yourself.

Documentation

This library defines only one class, the signature of which should be thought of as:

class Bloom:

    # expected_items: max number of items to be added to the filter
    # false_positive_rate: max false positive rate of the filter
    # hash_func: optional argument, see section "Cryptographic security"
    def __init__(self, expected_items: int, false_positive_rate: float,
                 hash_func=__builtins__.hash)

    @property
    def size_in_bits(self) -> int  # number of buckets in the filter

    @property
    def hash_func(self) -> Callable[[Any], int]  # retrieve the hash_func
                                                 # given to __init__

    @property
    def approx_items(self) -> float  # estimated number of items in
                                     # the filter

    #                all subsequent methods are
    # -------- equivalent to the corresponding methods ---------
    #                 of the built-in set type

    def add(self, object)

    def __contains__(self, object) -> bool    # object in self

    def __or__(self, other: Bloom) -> Bloom   # self | other

    def __ior__(self, other: Bloom)           # self |= other

    def __and__(self, other: Bloom) -> Bloom  # self & other

    def __iand__(self, other: Bloom)          # self &= other

    def update(self, *others: Union[Iterable, Bloom])

    def intersection_update(self, *others: Union[Iterable, Bloom])

    def clear(self)                           # remove all items

    def copy(self) -> Bloom

To prevent death and destruction, the bitwise set operations only work on filters where all parameters are equal (including the hash functions being the exact same object). Because this is a Bloom filter, the __contains__ and approx_items methods are probabilistic.

Cryptographic security

Python's built-in hash function isn't maximally collision-resistant, so if your program depends on the false positive rate being perfectly correct, you may want to supply your own hash function. This is especially the case when working with very large filters (more than a few tens of millions of items) or when false positives are very costly and could be exploited by an adversary. Just make sure that your hash function returns an integer between -2^127 and 2^127 - 1. Feel free to use the following example in your own code:

from rbloom import Bloom
from hashlib import sha256
from pickle import dumps

def hash_func(obj):
    h = sha256(dumps(obj)).digest()
    return int.from_bytes(h[:16], "big") - 2**127

bf = Bloom(100_000_000, 0.01, hash_func=hash_func)

When you throw away Python's built-in hash function and start hashing serialized representations of objects, however, you open up a breach into the scary realm of the unpythonic:

  • Numbers like 2, 2.0 and 2 + 0j will suddenly no longer be equal.
  • Instances of classes with custom hashing logic (e.g. to stop caches inside instances from affecting their hashes) will suddenly display undefined behavior.
  • Objects that can't be serialized simply won't be hashable at all.

Making you supply your own hash function in this case is a deliberate design decision, because it shows you what you're doing and prevents you from shooting yourself in the foot.


This implementation of a Bloom filter doesn't use multiple hash functions, but instead works by redistributing the entropy of a single hash over multiple integers by using the single hash as the seed of a simple linear congruential generator (LCG). Those integers are then used as indexes for the bit array that makes up the filter. The constant used for the LCG is one proposed by (L'Ecuyer, 1999).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rbloom-1.0.4.tar.gz (11.5 kB view details)

Uploaded Source

Built Distributions

rbloom-1.0.4-cp37-abi3-win_amd64.whl (138.4 kB view details)

Uploaded CPython 3.7+Windows x86-64

rbloom-1.0.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

rbloom-1.0.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ARM64

rbloom-1.0.4-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl (1.1 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.5+ i686

rbloom-1.0.4-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (508.7 kB view details)

Uploaded CPython 3.7+macOS 10.9+ universal2 (ARM64, x86-64)macOS 10.9+ x86-64macOS 11.0+ ARM64

File details

Details for the file rbloom-1.0.4.tar.gz.

File metadata

  • Download URL: rbloom-1.0.4.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.14.10

File hashes

Hashes for rbloom-1.0.4.tar.gz
Algorithm Hash digest
SHA256 2b5dbca1d58b3fa2c2e0c06832d8465943b0ca87b261217d9b0b50baade3022d
MD5 27f5d31b643c755f1380aede60c2ebdb
BLAKE2b-256 95b5c0ceaf0de50148be8c7c8c30dce0beb88453ff7204db61606219ebc71275

See more details on using hashes here.

File details

Details for the file rbloom-1.0.4-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: rbloom-1.0.4-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 138.4 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.14.10

File hashes

Hashes for rbloom-1.0.4-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a20154f643b91962196650cd1be05555905c2e21732f36500ad4ddf3e2f0d817
MD5 a564b26402d0d8a7ce57cdd6a1ac3034
BLAKE2b-256 c690f6db1c5363270a3d51d6a88114c79fd73613c2085b6814108f23c702f0b3

See more details on using hashes here.

File details

Details for the file rbloom-1.0.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rbloom-1.0.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 19af7e881bc06405f8d0d9155f2f1a3ce41c82d7b7394c0d194ed5be76f2937d
MD5 4d71cbbc2b3639a13af30bdbab3f873f
BLAKE2b-256 9d7c968c69fbc4d8eabb39084a982a5f8f3f22f4d3f8f1275c8ff981e28b3463

See more details on using hashes here.

File details

Details for the file rbloom-1.0.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rbloom-1.0.4-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 81c38ec6c94f7d8021d0355d5a6c66a65e0f89aee0f945f368b74e0776f53e57
MD5 6bec41468fb6a8bfc32c0e93347ba80e
BLAKE2b-256 0e8a4320744dfc4c334477586296e74e6772877af0fef0ca965c0b1b5ad2d29f

See more details on using hashes here.

File details

Details for the file rbloom-1.0.4-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl.

File metadata

File hashes

Hashes for rbloom-1.0.4-cp37-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm Hash digest
SHA256 aebe196b02cec9c4b0c181dd14a931559a41485f9b9310025e50d9e19b2c0613
MD5 f33bc16db1d4e8d7f204659627b0341d
BLAKE2b-256 596b5c85a0ce452413f190d3bc3b97e5fff0da8b974e93c65c18631bcefa394f

See more details on using hashes here.

File details

Details for the file rbloom-1.0.4-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for rbloom-1.0.4-cp37-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 61477c3c64a64646c81cef55013fa5e9969be7d5e16d7dbe1bd92d15fa36bdae
MD5 2376ff7539b96efdaea80a43478eac38
BLAKE2b-256 7f5eb877c132097cf4f0b8d9bd1dc1dc476c9c01dc56f37d86a05e3af4639ddf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page