Skip to main content

Some fast bloom filter implemented by Rust for Python and Rust! 10x faster than pybloom!

Project description

fastbloom

OSCS Status docs.rs Test Rust Test Python Benchmark Crates Latest Release PyPI Latest Release Sonatype Nexus (Snapshots)

A fast bloom filter | counting bloom filter implemented by Rust for Rust and Python!

Language: 简体中文

setup

Python

requirements

Python >= 3.7

install

Install the latest fastbloom version with:

pip install fastbloom-rs

Rust

fastbloom-rs = "{latest}"

Java

maven

<dependency>
    <groupId>io.github.yankun1992</groupId>
    <artifactId>fastbloom</artifactId>
    <version>{latest-version}</version>
</dependency>

Examples

BloomFilter

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

Reference: Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426. Full text article

Python

basic usage

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)

bloom.add_str('hello')
bloom.add_bytes(b'world')
bloom.add_int(9527)

assert bloom.contains('hello')
assert bloom.contains(b'world')
assert bloom.contains(9527)

assert not bloom.contains('hello world')

build bloom filter from bytes or list

from fastbloom_rs import BloomFilter

bloom = BloomFilter(100_000_000, 0.01)
bloom.add_str('hello')
assert bloom.contains('hello')

bloom2 = BloomFilter.from_bytes(bloom.get_bytes(), bloom.hashes())
assert bloom2.contains('hello')

bloom3 = BloomFilter.from_int_array(bloom.get_int_array(), bloom.hashes())
assert bloom3.contains('hello')

there are some bulk api for python to reduce ffi cost between python and rust

bloom = BloomFilter(100_000_000, 0.01)
inserts = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100]
checks = [1, 2, 3, 4, 5, 6, 7, 9, 18, 68, 90, 100, 190, 290, 390]
results = [True, True, True, True, True, True, True, True, True, True, True, True, False, False, False]

bloom.add_int_batch(inserts)
contains = bloom.contains_int_batch(checks)
assert contains == results

bloom.add_str_batch(list(map(lambda x: str(x), inserts)))
assert bloom.contains_str_batch(list(map(lambda x: str(x), checks))) == results

bloom.add_bytes_batch(list(map(lambda x: bytes(x), inserts)))
assert bloom.contains_bytes_batch(list(map(lambda x: bytes(x), checks))) == results

more examples at py_tests.

Rust

use fastbloom_rs::{BloomFilter, FilterBuilder};

let mut bloom = FilterBuilder::new(100_000_000, 0.01).build_bloom_filter();

bloom.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);
assert_eq!(bloom.contains(b"helloworld!"), false);

more examples at docs.rs

CountingBloomFilter

A Counting Bloom filter works in a similar manner as a regular Bloom filter; however, it is able to keep track of insertions and deletions. In a counting Bloom filter, each entry in the Bloom filter is a small counter associated with a basic Bloom filter bit.

Reference: F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An Improved Construction for Counting Bloom Filters,” in 14th Annual European Symposium on Algorithms, LNCS 4168, 2006

Python

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01)
cbf.add('hello')
cbf.add('hello')
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' in cbf  # because 'hello' added twice. 
# If add same element larger than 15 times, then remove 15 times the filter will not contain the element.
cbf.remove('hello')
assert 'hello' not in cbf

A CountingBloomFilter has a four bits counter to save hash index, so when insert an element repeatedly, the counter will spill over quickly. So, you can set enable_repeat_insert to False to check whether the element has added. if it has added, it will not add again. enable_repeat_insert default set to True.

from fastbloom_rs import CountingBloomFilter

cbf = CountingBloomFilter(1000_000, 0.01, False)
cbf.add('hello')
cbf.add('hello')  # because enable_repeat_insert=False, this addition will not take effect. 
assert 'hello' in cbf
cbf.remove('hello')
assert 'hello' not in cbf 

more examples at py_tests.

Rust

use fastbloom_rs::{CountingBloomFilter, FilterBuilder};

let mut builder = FilterBuilder::new(100_000, 0.01);
let mut cbf = builder.build_counting_bloom_filter();
cbf.add(b"helloworld");
assert_eq!(bloom.contains(b"helloworld"), true);

benchmark

computer info

CPU Memory OS
AMD Ryzen 7 5800U with Radeon Graphics 16G Windows 10

add one str to bloom filter

Benchmark insert one str to bloom filter:

bloom_add_test          time:   [41.168 ns 41.199 ns 41.233 ns]
                        change: [-0.4891% -0.0259% +0.3417%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) high mild
  12 (12.00%) high severe

add one million to bloom filter

Benchmark loop insert (1..1_000_000).map(|n| { n.to_string() }) to bloom filter:

bloom_add_all_test      time:   [236.24 ms 236.86 ms 237.55 ms]
                        change: [-3.4346% -2.9050% -2.3524%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

check one contains in bloom filter

bloom_contains_test     time:   [42.065 ns 42.102 ns 42.156 ns]
                        change: [-0.7830% -0.5901% -0.4029%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  9 (9.00%) high severe

check one not contains in bloom filter

bloom_not_contains_test time:   [22.695 ns 22.727 ns 22.773 ns]
                        change: [-3.1948% -2.9695% -2.7268%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

add one str to counting bloom filter

counting_bloom_add_test time:   [60.822 ns 60.861 ns 60.912 ns]
                        change: [+0.2427% +0.3772% +0.5579%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

add one million to counting bloom filter

Benchmark loop insert (1..1_000_000).map(|n| { n.to_string() }) to counting bloom filter:

counting_bloom_add_million_test
                        time:   [272.48 ms 272.58 ms 272.68 ms]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fastbloom_rs-0.5.9-cp37-abi3-win_amd64.whl (177.5 kB view hashes)

Uploaded CPython 3.7+ Windows x86-64

fastbloom_rs-0.5.9-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282.1 kB view hashes)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ x86-64

fastbloom_rs-0.5.9-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (379.7 kB view hashes)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ppc64le

fastbloom_rs-0.5.9-cp37-abi3-manylinux_2_17_ppc64.manylinux2014_ppc64.whl (399.4 kB view hashes)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ppc64

fastbloom_rs-0.5.9-cp37-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (269.1 kB view hashes)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ARMv7l

fastbloom_rs-0.5.9-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (270.3 kB view hashes)

Uploaded CPython 3.7+ manylinux: glibc 2.17+ ARM64

fastbloom_rs-0.5.9-cp37-abi3-macosx_11_0_arm64.whl (238.2 kB view hashes)

Uploaded CPython 3.7+ macOS 11.0+ ARM64

fastbloom_rs-0.5.9-cp37-abi3-macosx_10_7_x86_64.whl (243.0 kB view hashes)

Uploaded CPython 3.7+ macOS 10.7+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page