The Python multiset hashing library
Project description
pymsh
pymsh is a Python library that provides multiset hash (MSH) functions. These functions let you hash collections of items without worrying about the order of those items.
What is a Multiset Hash?
A multiset is like a set, except it can contain multiple instances of the same item. For example:
- A set might be
[apple, banana, cherry](with no duplicates). - A multiset could be
[apple, apple, apple, banana, banana, cherry](duplicates matter).
Multiset hashing (MSH) produces a hash value that reflects both the types of items you have and the quantities of each item, but not their order. If we hash the following using an MSH scheme, then the same hash values will be produced: hash(apple, banana, banana, apple, apple, cherry) will equal hash(apple, apple, apple, banana, banana, cherry)
We can see that the order does not matter as long as the elements and their quantities are the same.
Why Is This Useful?
If you have a collection of elements where order does not matter (e.g., tags on a file, items in a shopping cart), a normal hash function, such as SHA256 or MD5, will give different results depending on how you list the items. A multiset hash ensures the same final hash regardless of item ordering.
Furthermore, some MSHs in this library can be updated one item at a time. This is especially handy if you handle large or streaming data and want to maintain a running hash without reprocessing everything.
Installation
pip install pymsh
Basic Usage
For most general use cases, we recommend using the additive multiset hash (accessible via the shortcut class Hasher).
- Prepare a multiset: You can use our helper
list_to_multisetif you have a Python list containing repeated elements. - Hash it: Pass the resulting dictionary (
element -> count) into a hasher.
Note: pymsh expects your inputs to be passed as bytes.
Example:
from pymsh import list_to_multiset, Hasher
# Suppose you have this list with repeated elements
fruit_list = [b"apple", b"apple", b"banana", b"cherry", b"banana", b"apple"]
# 1) Convert the list to a multiset:
multiset = list_to_multiset(fruit_list)
# => {b"apple": 3, b"banana": 2, b"cherry": 1}
# 2) Hash your multiset (Hasher is an alias for MSetAddHash)
msh = Hasher().hash(multiset)
print("Multiset hash:", msh)
That’s it! You’ll get a tuple representing the multiset, independent of how you ordered "apple, banana, cherry."
If you are using Hasher (MSetAddHash). The first element of the tuple is the hash and the second is a nonce. If you want to reproduce the hash, like on another device, then you will need to know the nonce and the key.
Advanced Usage
pymsh implements multiple MSH constructions, each with its own tradeoffs in security, performance, and whether it requires a secret key. Below is a quick overview; skip to Incremental vs. One-shot Hashing if you don’t need these details right now.
MSetXORHash (Keyed, Set-collision Resistant)
- What it does: A keyed hash using XOR operations internally.
- Best for: Cases where you only need to detect changes in the set of items (ignores the exact count of each item, though).
- Supports incremental hashing?: Yes.
- Uses a secret key: Yes.
- It is NOT multiset collision-resistant; if some of your elements repeat, then the same hash values may be produced for different orderings.
MSetAddHash (Keyed, Multiset-collision Resistant)
- What it does: Uses an additive approach under a secret key to ensure that different multisets produce distinct hashes.
- Best for: Most general-purpose scenarios. This is the same as the default
Hasherclass. - Supports incremental hashing?: Yes.
- Uses a secret key: Yes.
MSetMuHash (Keyless, Multiset-collision Resistant)
- What it does: Uses multiplication in a finite field with a large prime modulus.
- Best for: Keyless scenarios. Good when you want collision resistance without managing keys.
- Supports incremental hashing?: No.
- Uses a secret key: No.
MSetVAddHash (Keyless, Multiset-collision Resistant)
- What it does: Uses vector addition space.
- Best for: Keyless scenarios with incremental updates; yields a larger hash compared to MuHash, but often simpler to handle incrementally.
- Supports incremental hashing?: Yes.
- Requires a Key: No.
Examples
import secrets
from pymsh import (
MSetXORHash,
MSetAddHash,
MSetMuHash,
MSetVAddHash
)
# Sample secret key for keyed hashes
key = secrets.token_bytes(32)
# A sample multiset: item -> count
multiset = {
b"apple": 3,
b"banana": 2,
b"cherry": 1
}
# 1) XOR Hash (Keyed, set-collision resistant)
xor_hasher = MSetXORHash(key)
xor_result = xor_hasher.hash(multiset)
print("XOR Hash (one-shot):", xor_result)
# 2) Additive Hash (Keyed, multiset-collision resistant)
add_hasher = MSetAddHash(key)
one_shot = add_hasher.hash(multiset)
print("Additive Hash (one-shot):", one_shot)
# Incremental usage of Additive Hash
add_hasher.update(b"apple", 3)
add_hasher.update(b"banana", 2)
add_hasher.update(b"cherry", 1)
incremental_hash = add_hasher.digest()
print("Additive Hash (incremental):", incremental_hash)
assert one_shot == incremental_hash # They should match
# 3) MuHash (Keyless)
mu_hasher = MSetMuHash()
print("MuHash:", mu_hasher.hash(multiset))
# 4) Vector Add Hash (Keyless)
vadd_hasher = MSetVAddHash()
print("VAdd Hash:", vadd_hasher.hash(multiset))
Incremental vs. One-shot Hashing
One‐shot: Pass a full dictionary (e.g., {item: count}) at once using .hash(multiset).
Incremental: Create an instance, then call .update(item, count) for each new element as needed, and finally call .digest() to get the final hash.
Which Should I Pick?
For most general-purpose tasks, use MSetAddHash (the default Hasher).
If you prefer keyless usage without the incremental feature, then consider MSetMuHash.
However, if you need incremental and keyless, try MSetVAddHash. Here's a comparison table:
| Hash Type | Security | Key Required | Incremental | Notes |
|---|---|---|---|---|
MSetXORHash |
Set-collision resistance | Yes | Yes | Fast set verification |
MSetAddHash |
Multiset-collision resistance | Yes | Yes | General purpose |
MSetMuHash |
Multiset-collision | No | No | Keyless; short outputs |
MSetVAddHash |
Multiset-collision | No | Yes | Efficient, but longer hashes |
References
- D. Clarke, S. Devadas, M. van Dijk, B. Gassend, and G.E. Suh. “Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking,” ASIACRYPT 2003.
Contribute
Feel free to open an issue or pull request if you have questions or suggestions!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pymsh-1.2.2.tar.gz.
File metadata
- Download URL: pymsh-1.2.2.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2193f041ac0eabbe96ef4b6ca5a39da9ae69264e87ac7aeaf85e9f7393ea3ec
|
|
| MD5 |
40ec2e79f36fcee97ae90ea97c2f5508
|
|
| BLAKE2b-256 |
8e5586156419873f45aefacfde7166d0b2997f39b632447e7ac353861528e312
|
File details
Details for the file pymsh-1.2.2-py3-none-any.whl.
File metadata
- Download URL: pymsh-1.2.2-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44a67a58f330f5dc73b0c4e10bd336b7252ec089a907d83b21b96fcddf9b1693
|
|
| MD5 |
0043617fd156569d0a2af2f7f5ee0dd0
|
|
| BLAKE2b-256 |
c0242ff57b0aa28492603647d0f36b9dfcc8905f596df84d94515ce031ed00be
|