Skip to main content

Probabilistic data structures in python

Project description

License GitHub release Build Status Test Coverage Documentation Status Pypi Release Downloads

pyprobables is a pure-python library for probabilistic data structures. The goal is to provide the developer with a pure-python implementation of common probabilistic data-structures to use in their work.

To achieve better raw performance, it is recommended supplying an alternative hashing algorithm that has been compiled in C. This could include using the md5 and sha512 algorithms provided or installing a third party package and writing your own hashing strategy. Some options include the murmur hash mmh3 or those from the pyhash library. Each data object in pyprobables makes it easy to pass in a custom hashing function.

Read more about how to use Supplying a pre-defined, alternative hashing strategies or Defining hashing function using the provided decorators.

Installation

Pip Installation:

$ pip install pyprobables

To install from source:

To install pyprobables, simply clone the repository on GitHub, then run from the folder:

$ python setup.py install

pyprobables supports python 3.6 - 3.11+

For python 2.7 support, install release 0.3.2

$ pip install pyprobables==0.3.2

API Documentation

The documentation of is hosted on readthedocs.io

You can build the documentation locally by running:

$ pip install sphinx
$ cd docs/
$ make html

Automated Tests

To run automated tests, one must simply run the following command from the downloaded folder:

$ python setup.py test

Quickstart

Import pyprobables and setup a Bloom Filter

from probables import BloomFilter
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05)
blm.add('google.com')
blm.check('facebook.com')  # should return False
blm.check('google.com')  # should return True

Import pyprobables and setup a Count-Min Sketch

from probables import CountMinSketch
cms = CountMinSketch(width=1000, depth=5)
cms.add('google.com')  # should return 1
cms.add('facebook.com', 25)  # insert 25 at once; should return 25

Import pyprobables and setup a Cuckoo Filter

from probables import CuckooFilter
cko = CuckooFilter(capacity=100, max_swaps=10)
cko.add('google.com')
cko.check('facebook.com')  # should return False
cko.check('google.com')  # should return True

Import pyprobables and setup a Quotient Filter

from probables import QuotientFilter
qf = QuotientFilter(quotient=24)
qf.add('google.com')
qf.check('facebook.com')  # should return False
qf.check('google.com')  # should return True

Supplying a pre-defined, alternative hashing strategies

from probables import BloomFilter
from probables.hashes import default_sha256
blm = BloomFilter(est_elements=1000, false_positive_rate=0.05,
                  hash_function=default_sha256)
blm.add('google.com')
blm.check('facebook.com')  # should return False
blm.check('google.com')  # should return True

Defining hashing function using the provided decorators

import mmh3  # murmur hash 3 implementation (pip install mmh3)
from probables.hashes import hash_with_depth_bytes
from probables import BloomFilter

@hash_with_depth_bytes
def my_hash(key, depth):
    return mmh3.hash_bytes(key, seed=depth)

blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)
import hashlib
from probables.hashes import hash_with_depth_int
from probables.constants import UINT64_T_MAX
from probables import BloomFilter

@hash_with_depth_int
def my_hash(key, seed=0, encoding="utf-8"):
    max64mod = UINT64_T_MAX + 1
    val = int(hashlib.sha512(key.encode(encoding)).hexdigest(), 16)
    val += seed  # not a good example, but uses the seed value
    return val % max64mod

blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

See the API documentation for other data structures available and the quickstart page for more examples!

Changelog

Please see the changelog for a list of all changes.

Backward Compatible Changes

If you are using previously exported probablistic data structures (v0.4.1 or below) and used the default hashing strategy, you will want to use the following code to mimic the original default hashing algorithm.

from probables import BloomFilter
from probables.hashes import hash_with_depth_int

@hash_with_depth_int
def old_fnv1a(key, depth=1):
    return tmp_fnv_1a(key)

def tmp_fnv_1a(key):
    max64mod = UINT64_T_MAX + 1
    hval = 14695981039346656073
    fnv_64_prime = 1099511628211
    tmp = map(ord, key)
    for t_str in tmp:
        hval ^= t_str
        hval *= fnv_64_prime
        hval %= max64mod
    return hval

blm = BloomFilter(filpath="old-file-path.blm", hash_function=old_fnv1a)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyprobables-0.6.2.tar.gz (36.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyprobables-0.6.2-py3-none-any.whl (42.5 kB view details)

Uploaded Python 3

File details

Details for the file pyprobables-0.6.2.tar.gz.

File metadata

  • Download URL: pyprobables-0.6.2.tar.gz
  • Upload date:
  • Size: 36.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pyprobables-0.6.2.tar.gz
Algorithm Hash digest
SHA256 9a1ddce3c59e89bf8fb918b507adb72b4f526e897f7129d75f786b38e8c254f5
MD5 e5103d035d96f4fb9dad45ff7df95850
BLAKE2b-256 7290963f3dff6081688b08d4a9dbf21f4e1d7f29e5d0b86faa604754a4bc8a47

See more details on using hashes here.

File details

Details for the file pyprobables-0.6.2-py3-none-any.whl.

File metadata

  • Download URL: pyprobables-0.6.2-py3-none-any.whl
  • Upload date:
  • Size: 42.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for pyprobables-0.6.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc16ae9a847a85eff2dd82fedc2dff953aa41ac2c128905ee6d9233d945b006d
MD5 cb2cec04011dbb99fbb86f275a8ae7e6
BLAKE2b-256 9efc6fcf939e5e876882cabed5c2e1c5777bbf0689a27e7dc7f527d0867c95f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page