Skip to main content

Python library for the HyperLogLog algorithm

Project description

python-hll

https://img.shields.io/pypi/v/python_hll.svg Documentation Status https://img.shields.io/badge/github-python--hll-yellow

A Python implementation of HyperLogLog whose goal is to be storage compatible with java-hll, js-hll and postgresql-hll.

NOTE: This is a fairly literal translation/port of java-hll to Python. Internally, bytes are represented as Java-style bytes (-128 to 127) rather than Python-style bytes (0 to 255). Also this implementation is quite slow: for example, in Java HLLSerializationTest takes 12 seconds to run while in Python test_hll_serialization takes 1.5 hours to run (about 400x slower).

Overview

See java-hll for an overview of what HLLs are and how they work.

Usage

Hashing and adding a value to a new HLL:

from python_hll.hll import HLL
import mmh3
value_to_hash = 'foo'
hashed_value = mmh3.hash(value_to_hash)

hll = HLL(13, 5) # log2m=13, regwidth=5
hll.add_raw(hashed_value)

Retrieving the cardinality of an HLL:

cardinality = hll.cardinality()

Unioning two HLLs together (and retrieving the resulting cardinality):

hll1 = HLL(13, 5) # log2m=13, regwidth=5
hll2 = HLL(13, 5) # log2m=13, regwidth=5

# ... (add values to both sets) ...

hll1.union(hll2) # modifies hll1 to contain the union
cardinalityUnion = hll1.cardinality()

Reading an HLL from a hex representation of storage specification, v1.0.0 (for example, retrieved from a PostgreSQL database):

from python_hll.util import NumberUtil
input = '\\x128D7FFFFFFFFFF6A5C420'
hex_string = input[2:]
hll = HLL.from_bytes(NumberUtil.from_hex(hex_string, 0, len(hex_string)))

Writing an HLL to its hex representation of storage specification, v1.0.0 (for example, to be inserted into a PostgreSQL database):

bytes = hll.to_bytes()
output = "\\x" + NumberUtil.to_hex(bytes, 0, len(bytes))

Also see the API documentation.

Development

See Contributing for how to get started building, testing, and deploying the code.

History

0.0.0 (2019-06-14)

  • Submitted to AdRoll HackWeek.

0.1.0 (2019-09-12)

  • First release on PyPI.

0.1.1 (2019-09-12)

  • Add missing install_requires: numpy

0.1.2 (2019-12-12)

0.1.3 (2021-01-22)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_hll-0.1.3.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

python_hll-0.1.3-py2.py3-none-any.whl (27.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file python_hll-0.1.3.tar.gz.

File metadata

  • Download URL: python_hll-0.1.3.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.10.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for python_hll-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a8f04cccfc2dccd595016871f12c693fa428c0ae2c8d54f7adaa155fa522c65e
MD5 0ac5e333ee0d4785468c154b7181e72a
BLAKE2b-256 87747d40391579a7eac01ec7b51035fd4bcf38169c176dfd1a371b85b547814e

See more details on using hashes here.

File details

Details for the file python_hll-0.1.3-py2.py3-none-any.whl.

File metadata

  • Download URL: python_hll-0.1.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.10.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for python_hll-0.1.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 65c3bb7cefbab542221b02e03559599721e25fef841da12cc252ac97bbdd5dd1
MD5 67c88ea9fbfb5fead8c9d5cddd1b89ed
BLAKE2b-256 2d9c6c1f1c59ecb0107dfc4d083d9b0a36f756d9a30e3e9084d2d44841ef1fa7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page