Skip to main content

Scalable Cuckoo filter

Project description

Cuckoo filters

A Cuckoo filter is a data structure for probabilistic set-membership queries with a low false positive probability (FPP). As an improvement over the classic Bloom filter, items can be added or removed into Cuckoo filters at will. A Cuckoo filter also utilizes space more efficiently.

Cuckoo filters were originally described in: Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014, December). Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (pp. 75-88). ACM.

This package implements the basic Cuckoo filter as well as several derivations built on top of it.

Installation

The package can be installed from PyPI

pip install scalable-cuckoo-filter

Classic Cuckoo filter

import math

from random import randrange
from cuckoo.filter import CuckooFilter

capacity = 1000000
error_rate = 0.000001
# Create a classic Cuckoo filter with a fixed capacity of 1000000
# buckets
cuckoo = CuckooFilter(capacity=capacity, error_rate=error_rate)

bucket_size = 6
# Setting the bucket size is optional, the bigger the bucket,
# the more number of items a filter can hold, and the longer
# the fingerprint needs to be to stay at the same error rate
cuckoo = CuckooFilter(capacity=capacity, error_rate=error_rate, bucket_size=bucket_size)

# The fingerprint length is computed using the following formula:
fingerprint_size = int(math.ceil(math.log(1.0 / error_rate, 2) + math.log(2 * bucket_size, 2)))

for _ in range(1, 100000):
    item = str(randrange(1, 1000000000))
    cuckoo.insert(item)

    if cuckoo.contains(item):
        print '{} has been added'.format(item)

    cuckoo.delete(item)

    if not cuckoo.contains(item):
        print '{} has been removed'.format(item)

Bitarray Cuckoo filter

A classic Cuckoo filter is implemented using a Python list of Bucket objects. Each Bucket, again, stores a fixed list of fingerprints. The implementation is easy and straightforward but unnecessary wasteful for applications that needs to keep hundreds of millions of items.

The bitarray Cuckoo filter is built to deal with such situation. The whole filter is compressed into a bitarray to minimize memory usage. For example, a bitarray Cuckoo filter with capacity of 100.000.000, bucket size of 4, and error rate of 0.000001 will requires:

  • 23-bit fingerprint, computed using the above formula.
  • 9.200.000.000 bits = 1.08 GB = capacity * bucket size * fingerprint.

And it can theoretically store up to 400.000.000 items at full capacity.

import math

from random import randrange
from cuckoo.filter import BCuckooFilter

capacity = 1000000
error_rate = 0.000001
# Create a bit array Cuckoo filter with a fixed capacity of 1000000
# buckets
cuckoo = BCuckooFilter(capacity=capacity, error_rate=error_rate)

bucket_size = 6
# Setting the bucket size is optional, the bigger the bucket,
# the more number of items a filter can hold, and the longer
# the fingerprint needs to be to stay at the same error rate
cuckoo = BCuckooFilter(capacity=capacity, error_rate=error_rate, bucket_size=bucket_size)

# The fingerprint length is computed using the following formula:
fingerprint_size = int(math.ceil(math.log(1.0 / error_rate, 2) + math.log(2 * bucket_size, 2)))

for _ in range(1, 100000):
    item = str(randrange(1, 1000000000))
    cuckoo.insert(item)

    if cuckoo.contains(item):
        print '{} has been added'.format(item)

    cuckoo.delete(item)

    if not cuckoo.contains(item):
        print '{} has been removed'.format(item)

Scalable Cuckoo filter

Having a fix capacity is a problem when using Cuckoo filter in practice. Allocating too much capacity and it goes to waste while allocating too little and it degrades the filter performance and causes FP. Therefore, the scalable Cuckoo filter is introduced as an attempt to solve the fix capacity issue.

Inspired by Scalable Bloom filter, Scalable Cuckoo filter utilizes multiple filters to scale the capacity dynamically. When an existing filter approaches its capacity, a new one, double in size, will be created. A scalable Cuckoo filter will handle all usual operations seamlessly and transparently.

Internally, scalable Cuckoo filter uses bitarray Cuckoo filter for efficiency although it can be changed easily.

import math

from random import randrange
from cuckoo.filter import ScalableCuckooFilter

initial_capacity = 1000000
error_rate = 0.000001
# Create a scalable Cuckoo filter with an initial capacity of 1000000
# buckets
cuckoo = ScalableCuckooFilter(initial_capacity=initial_capacity, error_rate=error_rate)

bucket_size = 6
# Setting the bucket size is optional, the bigger the bucket,
# the more number of items a filter can hold, and the longer
# the fingerprint needs to be to stay at the same error rate
cuckoo = ScalableCuckooFilter(initial_capacity=initial_capacity, error_rate=error_rate, bucket_size=bucket_size)

# The fingerprint length is computed using the following formula:
fingerprint_size = int(math.ceil(math.log(1.0 / error_rate, 2) + math.log(2 * bucket_size, 2)))

for _ in range(1, 100000):
    item = str(randrange(1, 1000000000))
    cuckoo.insert(item)

    if cuckoo.contains(item):
        print '{} has been added'.format(item)

    cuckoo.delete(item)

    if not cuckoo.contains(item):
        print '{} has been removed'.format(item)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scalable-cuckoo-filter-1.1.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

scalable_cuckoo_filter-1.1-py2.py3-none-any.whl (10.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scalable-cuckoo-filter-1.1.tar.gz.

File metadata

  • Download URL: scalable-cuckoo-filter-1.1.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for scalable-cuckoo-filter-1.1.tar.gz
Algorithm Hash digest
SHA256 73053812b16a41c7c3c4312b3dc1fb9eb8ae71905493decb062ec538e2a13054
MD5 282ea13c99b0ef525caa9ba4fecbea8a
BLAKE2b-256 08f7bcbcf03f662d37746b48ea15c2bcd0cf8c99207a1bb16f3c67d0d25b256e

See more details on using hashes here.

File details

Details for the file scalable_cuckoo_filter-1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: scalable_cuckoo_filter-1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for scalable_cuckoo_filter-1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c05e6e2eb1c74736e8fa7cf5c8049df67bd9b1f81d961a14fb231d8553636dd5
MD5 a5168fb13c5a27b38638d324ce18060f
BLAKE2b-256 e40ba65d3820bae71fe3724272e30d0ca0f25853ab8458aefaf9a867267d6f33

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page