Very fast, constant memory-footprint cardinality approximation, including intersection
Project description
Hyperminhash for Python
Very fast, constant memory-footprint cardinality approximation, including intersection and union operation.
The class Sketch
can be used to count unique elements that were encountered during
the instance's lifetime. Unlike e.g. when using a set
, the memory consumed by the
Sketch
-instance does not grow as elements are added; each Sketch
-instance
consumes approximately 32kb of memory, independent of the number of elements.
Adding new elements to a Sketch
is preferably done using .add_bytes()
, akin
to .digest()
for hashlib
-objects. It is also possible to use .add()
if a bytes-like
object can't be provided; the object's hash()
is then used to provide a unique
identifier for that object.
# Construct an empty Sketch
sk = pyhyperminhash.Sketch()
assert not bool(sk) # The Sketch is empty
sk.add_bytes(b'Foo') # Add elements
sk.add_bytes(b'Bar')
sk.add(42)
assert bool(sk) # The Sketch is not empty
assert float(sk) == sk.cardinality() # Approximately 3.0
assert int(sk) == len(sk) # Approximately 3
sk2 = pyhyperminhash.Sketch()
sk2.add_bytes(b'Foobar')
sk2 &= sk1 # sk2 now holds all elements that were in `sk` or `sk2`
sk3 = pyhyperminhash.Sketch()
sk3.add(42)
sk3.add_bytes(b'Foo')
assert sk3.intersection(sk1) > 0.0 # Approximately 1.0, due to `Foo`
buf = sk.save() # Serialize the Sketch into a bytes-buffer
new_sk = pyhyperminhash.Sketch.load(buf) # Deserialize a Sketch
assert len(new_sk) == len(sk)
# Construct a Sketch from any iterator directly.
# This is faster than adding elements one by one.
sk = pyhyperminhash.Sketch.from_iter(iter(range(100)))
with open('complaints.txt', 'rb') as f:
sk = pyhyperminhash.Sketch.from_iter_bytes(f)
Although convenient, using .add()
over .add_bytes()
has two downsides:
- It may be slower, as the object's own
hash()
may have to be computed - The Python interpreter probably randomizes hash-values at interpreter-startup, in order to prevent certain DOS-attacks. The randomness introduced here means that the count-approximation provided by this package may not be stable from run to run.
See the documentation for the underlying implementation for additional information.
Performance examples
Reading a file of 55 million lines, 725.940 unique elements, 100 characters per line on average.
Method | Wall-clock time | Memory consumed | Count |
---|---|---|---|
no work | 9.5 seconds | nil | nil |
Sketch.from_iter_bytes() |
10.13 seconds | ~32 kilobytes | 734,628 |
set() |
14.99 seconds | ~164 megabytes | 725,940 |
FAQ
- Can I extract an element that was previously added from a
Sketch
?- No.
- Can I check if an element has previously been added to a
Sketch
?- No.
- Wow, why use this, then?
- It's very fast at counting unique things. And it does not use much memory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for pyhyperminhash-0.1.3-cp38-abi3-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b21427f14e7d1b234b85f8145eb36aa7fd2b0a46c41b4e97150b7724776958fc |
|
MD5 | 3059769300dbd9e86cd843e22860497f |
|
BLAKE2b-256 | 36fe4397b077631e4533aa568a207a8cd09fc6ce04c97d56b8256b5032dc3c86 |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 795fff712c625bd01ecad77a49662421b38c2fc944e6ea1906c06cf35b4b0253 |
|
MD5 | 8bd306e0f4d726f2ecb97a79d21c9282 |
|
BLAKE2b-256 | ae5ab008cd69ae01bef9cb59266c6e380c311992da00bd90cb0533a50ba1812c |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 860c479e11dc2902cf87a423128e9261798a546c0eff4536f45caffbfe45ad07 |
|
MD5 | c6ffa5e1c52badab71b27bd28786c1a3 |
|
BLAKE2b-256 | a52c9c10a97c4daa62034d326c6e77b5e3575f67f29667c3ea93aafb8de8d48f |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0dfe154e00a48e976c522e2710807e1e53b779a544ea9edc9668eff5d683d41c |
|
MD5 | 6ab622a188a883f0fb23b81b986b9cd6 |
|
BLAKE2b-256 | 023f36ceeb45b6bbccf4edfef52f17f01fc0c80ff125f49c3fc5a7afe5727c46 |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f13d7474ec5d9fa58c926873465ee844a08c9d4b98fb226d64c4105bafbdac01 |
|
MD5 | 5ca33d5a45165f7495c9dd1881092374 |
|
BLAKE2b-256 | 046b47494e68c8c9f24a71468b1d52db3417710453f74ccb4ee9111fc108f4af |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8c333627dd76b7c1eb0b25502611aa0ba6605411d9e3a0fbb10b3c30ab5d01f9 |
|
MD5 | a7f176363999d538f1421e3f47364a6c |
|
BLAKE2b-256 | dbd87bd1367a7052f9dd24d1892d19fac542ab9decc6dc0e114e73778946797d |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 642fcc442330bb347fa92f8ef08642c940c23f003ed9ae117578c2efb7c6c895 |
|
MD5 | 8b85de68de51a3f5866a63dac23fe38c |
|
BLAKE2b-256 | 5ec1af1bf0db0e38afa6841ba492341cfc571816bc23117c546c108679bad519 |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52f3abba225767e72499be7cd8f3b78846bf1c8b015ae752d3f3acbe57804b5c |
|
MD5 | 9ff05503b13bff3cddddd985aeaf7df5 |
|
BLAKE2b-256 | 3f8b0aaf778e728a1755a29116f4211d890b8fe8829ba0ac74ac14e6112f59ca |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2358b0c00fb5e300ca772bd648aa32a19fa942a6e69f08df5eb29498910d9eb |
|
MD5 | d5145e8172b70211d3c152dc514e4625 |
|
BLAKE2b-256 | c9d57747d8764b9bc0087b315f971fbf058fa2fc0b2945936f051bf621a594bb |
Hashes for pyhyperminhash-0.1.3-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc045343ca0afa209c21894c1c7445c72540d9524b24edaca394f531cecb2874 |
|
MD5 | 7afc68790601ba7db71bbd82dd2d7f51 |
|
BLAKE2b-256 | 115166a898e94bd33a5f3927c2bd4d591ad0b957d06f75de76770cf820f94c38 |