Python library for the HyperLogLog algorithm
Project description
python-hll
A Python implementation of HyperLogLog whose goal is to be storage compatible with java-hll, js-hll and postgresql-hll.
NOTE: This is a fairly literal translation/port of java-hll to Python. Internally, bytes are represented as Java-style bytes (-128 to 127) rather than Python-style bytes (0 to 255). Also this implementation is quite slow: for example, in Java HLLSerializationTest takes 12 seconds to run while in Python test_hll_serialization takes 1.5 hours to run (about 400x slower).
Runs on: Python 2.7 and 3
Free software: MIT license
Documentation: https://python-hll.readthedocs.io
Overview
See java-hll for an overview of what HLLs are and how they work.
Usage
Hashing and adding a value to a new HLL:
from python_hll.hll import HLL import mmh3 value_to_hash = 'foo' hashed_value = mmh3.hash(value_to_hash) hll = HLL(13, 5) # log2m=13, regwidth=5 hll.add_raw(hashed_value)
Retrieving the cardinality of an HLL:
cardinality = hll.cardinality()
Unioning two HLLs together (and retrieving the resulting cardinality):
hll1 = HLL(13, 5) # log2m=13, regwidth=5 hll2 = HLL(13, 5) # log2m=13, regwidth=5 # ... (add values to both sets) ... hll1.union(hll2) # modifies hll1 to contain the union cardinalityUnion = hll1.cardinality()
Reading an HLL from a hex representation of storage specification, v1.0.0 (for example, retrieved from a PostgreSQL database):
from python_hll.util import NumberUtil input = '\\x128D7FFFFFFFFFF6A5C420' hex_string = input[2:] hll = HLL.from_bytes(NumberUtil.from_hex(hex_string, 0, len(hex_string)))
Writing an HLL to its hex representation of storage specification, v1.0.0 (for example, to be inserted into a PostgreSQL database):
bytes = hll.to_bytes() output = "\\x" + NumberUtil.to_hex(bytes, 0, len(bytes))
Also see the API documentation.
Development
See Contributing for how to get started building, testing, and deploying the code.
History
0.0.0 (2019-06-14)
Submitted to AdRoll HackWeek.
0.1.0 (2019-09-12)
First release on PyPI.
0.1.1 (2019-09-12)
Add missing install_requires: numpy
0.1.2 (2019-12-12)
Fix alpha_m_squared for m=32: https://github.com/AdRoll/python-hll/pull/2
0.1.3 (2021-01-22)
Fix AttributeError: ‘HLL’ object has no attribute ‘_sparse_probabilistic_storage’: https://github.com/AdRoll/python-hll/pull/4
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for python_hll-0.1.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65c3bb7cefbab542221b02e03559599721e25fef841da12cc252ac97bbdd5dd1 |
|
MD5 | 67c88ea9fbfb5fead8c9d5cddd1b89ed |
|
BLAKE2b-256 | 2d9c6c1f1c59ecb0107dfc4d083d9b0a36f756d9a30e3e9084d2d44841ef1fa7 |