Skip to main content

Very fast, constant memory-footprint cardinality approximation, including intersection

Project description

Hyperminhash for Python

Docs PyPI

Very fast, constant memory-footprint cardinality approximation, including intersection and union operation.

The class Sketch can be used to count unique elements that were encountered during the instance's lifetime. Unlike e.g. when using a set, the memory consumed by the Sketch-instance does not grow as elements are added; each Sketch-instance consumes approximately 32kb of memory, independent of the number of elements.

Adding new elements to a Sketch is preferably done using .add_bytes(), akin to .digest() for hashlib-objects. It is also possible to use .add() if a bytes-like object can't be provided; the object's hash() is then used to provide a unique identifier for that object.

# Construct an empty Sketch
sk = pyhyperminhash.Sketch()
assert not bool(sk)  # The Sketch is empty
sk.add_bytes(b'Foo')  # Add elements
sk.add_bytes(b'Bar')
sk.add(42)
assert bool(sk)  # The Sketch is not empty
assert float(sk) == sk.cardinality()  # Approximately 3.0
assert int(sk) == len(sk)  # Approximately 3

sk2 = pyhyperminhash.Sketch()
sk2.add_bytes(b'Foobar')
sk2 &= sk1  # sk2 now holds all elements that were in `sk` or `sk2`

sk3 = pyhyperminhash.Sketch()
sk3.add(42)
sk3.add_bytes(b'Foo')
assert sk3.intersection(sk1) > 0.0  # Approximately 1.0, due to `Foo`

buf = sk.save()  # Serialize the Sketch into a bytes-buffer
new_sk = pyhyperminhash.Sketch.load(buf)  # Deserialize a Sketch
assert len(new_sk) == len(sk)

# Construct a Sketch from any iterator directly.
# This is faster than adding elements one by one.
sk = pyhyperminhash.Sketch.from_iter(iter(range(100)))

with open('complaints.txt', 'rb') as f:
    sk = pyhyperminhash.Sketch.from_iter_bytes(f)

Although convenient, using .add() over .add_bytes() has two downsides:

  • It may be slower, as the object's own hash() may have to be computed
  • The Python interpreter probably randomizes hash-values at interpreter-startup, in order to prevent certain DOS-attacks. The randomness introduced here means that the count-approximation provided by this package may not be stable from run to run.

See the documentation for the underlying implementation for additional information.

Performance examples

Reading a file of 55 million lines, 725.940 unique elements, 100 characters per line on average.

Method Wall-clock time Memory consumed Count
no work 9.5 seconds nil nil
Sketch.from_iter_bytes() 10.13 seconds ~32 kilobytes 734,628
set() 14.99 seconds ~164 megabytes 725,940

FAQ

  • Can I extract an element that was previously added from a Sketch?
    • No.
  • Can I check if an element has previously been added to a Sketch?
    • No.
  • Wow, why use this, then?
    • It's very fast at counting unique things. And it does not use much memory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhyperminhash-0.1.3.tar.gz (11.8 kB view hashes)

Uploaded Source

Built Distributions

pyhyperminhash-0.1.3-cp38-abi3-win_amd64.whl (152.4 kB view hashes)

Uploaded CPython 3.8+ Windows x86-64

pyhyperminhash-0.1.3-cp38-abi3-win32.whl (149.8 kB view hashes)

Uploaded CPython 3.8+ Windows x86

pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (239.6 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ x86-64

pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl (321.6 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ s390x

pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (259.2 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ppc64le

pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (238.0 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARMv7l

pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (226.6 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.17+ ARM64

pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl (254.6 kB view hashes)

Uploaded CPython 3.8+ manylinux: glibc 2.5+ i686

pyhyperminhash-0.1.3-cp38-abi3-macosx_11_0_arm64.whl (207.1 kB view hashes)

Uploaded CPython 3.8+ macOS 11.0+ ARM64

pyhyperminhash-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl (219.1 kB view hashes)

Uploaded CPython 3.8+ macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page