Very fast, constant memory-footprint cardinality approximation, including intersection
Project description
Hyperminhash for Python
Very fast, constant memory-footprint cardinality approximation, including intersection and union operation.
The class Sketch
can be used to count unique elements that were encountered during
the instance's lifetime. Unlike e.g. when using a set
, the memory consumed by the
Sketch
-instance does not grow as elements are added; each Sketch
-instance
consumes approximately 32kb of memory, independent of the number of elements.
Adding new elements to a Sketch
is preferably done using .add_bytes()
, akin
to .digest()
for hashlib
-objects. It is also possible to use .add()
if a bytes-like
object can't be provided; the object's hash()
is then used to provide a unique
identifier for that object.
# Construct an empty Sketch
sk = pyhyperminhash.Sketch()
assert not bool(sk) # The Sketch is empty
sk.add_bytes(b'Foo') # Add elements
sk.add_bytes(b'Bar')
sk.add(42)
assert bool(sk) # The Sketch is not empty
assert float(sk) == sk.cardinality() # Approximately 3.0
assert int(sk) == len(sk) # Approximately 3
sk2 = pyhyperminhash.Sketch()
sk2.add_bytes(b'Foobar')
sk2 &= sk1 # sk2 now holds all elements that were in `sk` or `sk2`
sk3 = pyhyperminhash.Sketch()
sk3.add(42)
sk3.add_bytes(b'Foo')
assert sk3.intersection(sk1) > 0.0 # Approximately 1.0, due to `Foo`
buf = sk.save() # Serialize the Sketch into a bytes-buffer
new_sk = pyhyperminhash.Sketch.load(buf) # Deserialize a Sketch
assert len(new_sk) == len(sk)
# Construct a Sketch from any iterator directly.
# This is faster than adding elements one by one.
sk = pyhyperminhash.Sketch.from_iter(iter(range(100)))
with open('complaints.txt', 'rb') as f:
sk = pyhyperminhash.Sketch.from_iter_bytes(f)
Although convenient, using .add()
over .add_bytes()
has two downsides:
- It may be slower, as the object's own
hash()
may have to be computed - The Python interpreter probably randomizes hash-values at interpreter-startup, in order to prevent certain DOS-attacks. The randomness introduced here means that the count-approximation provided by this package may not be stable from run to run.
See the documentation for the underlying implementation for additional information.
Performance examples
Reading a file of 55 million lines, 725.940 unique elements, 100 characters per line on average.
Method | Wall-clock time | Memory consumed | Count |
---|---|---|---|
no work | 9.5 seconds | nil | nil |
Sketch.from_iter_bytes() |
10.13 seconds | ~32 kilobytes | 734,628 |
set() |
14.99 seconds | ~164 megabytes | 725,940 |
FAQ
- Can I extract an element that was previously added from a
Sketch
?- No.
- Can I check if an element has previously been added to a
Sketch
?- No.
- Wow, why use this, then?
- It's very fast at counting unique things. And it does not use much memory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pyhyperminhash-0.1.2-cp38-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d053ee74674be5b6e0de779c80c37c52142fdc66b24e4bb118b7727dfa5a23e |
|
MD5 | a4c96b9809f93704b517d5b9a8157ad5 |
|
BLAKE2b-256 | 26aeb8e93e8924647539bdd70a21cab14792b0606161dd793317da7c6833ae79 |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 962227883f858eecae91e049cf40fee8007f9c5b5a9442bfaa03c93b44e026cd |
|
MD5 | 9d27ecf333f13b74da5ad4db1ad365a3 |
|
BLAKE2b-256 | 47e218e53c713ed761da4f22fd2051d060a213c0ba0cdbc5e68ccbb23c8967fe |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f1b93d462fd0ec62dbb51514032171534a2282dee8419603091dc2eef10677c |
|
MD5 | eb3db1b36fc015f9aefc86d37f54b6cc |
|
BLAKE2b-256 | 7020b7cfff12e526cc6a6f7d7b8deacdc16481e6ae7f6f90e28f61183933008f |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e791358ca6d8c428a7b976cae066ee2859f9df89c6c3b120d8d2c1cdda9869d6 |
|
MD5 | 710942bf9c85254e81eb5147600640f5 |
|
BLAKE2b-256 | 0dff107cec75eadee2f485f6c84585ec9048f490a5d3bb76f48d5a05e6504621 |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4fc655a73474129d1eff8dd40436c1ecf2bb71e3bcb2a396ea8c29b1053470da |
|
MD5 | 83c26073ea9469fd79a0c2eb024f418a |
|
BLAKE2b-256 | 1ffd8b3574ccf950d022502c28a84052fcbe954b8a9888e03cac8f0085e91ab7 |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 40a577570a0b7ffe42b17bbd41a0bc638a91440aaca2aa58a1189d839879ca01 |
|
MD5 | 15bbe0cc9cd426028cfbe52627229afd |
|
BLAKE2b-256 | d31f4c7758563514a7d0f27fcd4aaf55d3b0a86b7434353863c3ccd909b11edd |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44afb1a3437dddcb54bea22e16ba8737f01860c7df848dfef85bc7f28541cc8e |
|
MD5 | ac192c7ba10e29e2beb0c1b7915fb6ef |
|
BLAKE2b-256 | 75867a50db08b2b186143719d8d2f771171891c57d3218c3fddefeb5546eddd1 |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35bb69dbe4aa1898c223f628291b7b60090727d20d289406d1c602a4ee0d00a7 |
|
MD5 | 4dbe7796096c68563a5c2ac310ad7d2b |
|
BLAKE2b-256 | 174beaf9f4d7cd13e26cf688ffb2f0903d0bbf941df7e0e3d3ac72dbaa9ecb91 |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5964c81a02b2105e2e353017f50167823727ccca50317fa0ed52161bd19b3c1 |
|
MD5 | a660df66fadc8da0b1206b32ee719947 |
|
BLAKE2b-256 | add8180da89e0b1fcc3ee239e89d46cb6db89cde3b6ec485ef555d5fbeedf825 |
Hashes for pyhyperminhash-0.1.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a0d3b8e29f56ddd54f77a5e52e141087df4264141789517fd3a83e37a20d863 |
|
MD5 | 8fbc69a2d44894bdb176bd7cab407cca |
|
BLAKE2b-256 | 69172ef096acf7d501cd0f48e9637e5aea4a383185c331a2e0738ec226d10a7c |