Stable, keyed hashing for Python objects and columnar data. Think `stablehash`, but with SipHash-like keyed PRF semantics so hashes are deterministic for a given key and resistant to adversarial inputs.
Project description
keyedstablehash
Deterministic, cryptographically secure hashing for complex Python objects and columnar data.
| Testing | |
| Package | |
| Meta |
keyedstablehash solves the problem of generating reproducible, secure hashes for arbitrary Python structures (dicts,
lists, primitives) across different processes and machines. Think of it as stablehash meets hashlib, powered by the
SipHash-2-4 algorithm to prevent hash-flooding attacks.
Why use keyedstablehash?
Standard Python hash() is randomized per process for security. hashlib (md5/sha) is stable but requires manual
byte-encoding of objects. keyedstablehash gives you the best of both worlds:
- 🔒 Secure & Keyed: Uses SipHash-2-4 (a keyed pseudorandom function). By keeping your key secret, you prevent adversarial inputs (HashDoS attacks) and ensure hashes cannot be predicted externally.
- Reproducible: Guaranteed deterministic output for a given key and input, regardless of Python version or architecture.
- 🧠 Smart Canonicalization: Automatically handles nested dictionaries, sets (order-independent), mixed types, and
NumPy scalars.
{a: 1, b: 2}hashes the same as{b: 2, a: 1}. - 🐼 Big Data Ready: First-class support for Pandas, Polars, and PyArrow. Hash millions of rows efficiently without writing fragile loops.
- 🛠 Type-Safe: Fully typed with
py.typedfor a seamless IDE experience.
Installation
Install the core library:
pip install keyedstablehash
Optional High-Performance Extras: For vectorization support with your favorite dataframe library:
pip install "keyedstablehash[dataframes]" # Support for Pandas
pip install "keyedstablehash[arrow]" # Support for PyArrow
pip install "keyedstablehash[polars]" # Support for Polars
Quick Start
1. Hashing Python Objects
Generate stable hashes for complex, nested structures.
from keyedstablehash import stable_keyed_hash
# Your secret key (must be 16 bytes)
secret_key = b"\x01" * 16
# A complex, messy object
data = {
"id": 101,
"tags": {"python", "data", "secure"}, # Sets are auto-sorted
"meta": {"created_at": 167888, "active": True}
}
h = stable_keyed_hash(data, key=secret_key)
print(f"Hex: {h.hexdigest()}")
# -> Hex: 4a1b... (Deterministic across runs)
print(f"Int: {h.intdigest()}")
# -> Int: 8392... (uint64)
2. Streaming API
Mirrors the standard hashlib interface for data streams.
from keyedstablehash import siphash24
secret_key = b"\x01" * 16
s = siphash24(key=secret_key)
s.update(b"chunk_one")
s.update(b"chunk_two")
print(s.hexdigest())
3. Dataframe Vectorization (The Power Feature)
Hash entire columns in Pandas, Polars, or Arrow. This is essential for data de-duplication, shuffling, or anonymization pipelines.
import pandas as pd
import pyarrow as pa
from keyedstablehash import hash_pandas_series, hash_arrow_array
secret_key = b"\x01" * 16
# --- Pandas ---
df = pd.DataFrame({"user_id": ["u1", "u2", "u1"]})
df["hash"] = hash_pandas_series(df["user_id"], key=secret_key)
# Result: A Series of uint64 hashes
print(df["hash"])
# --- PyArrow ---
arr = pa.array(["alpha", "beta", "gamma"])
hashes = hash_arrow_array(arr, key=secret_key)
# Result: A pyarrow.Array(uint64)
print(hashes)
Canonicalization Rules
To ensure stability, keyedstablehash strictly defines how types are converted to bytes before hashing.
| Type | Handling Strategy |
|---|---|
| None / Bool | Tagged with unique type markers. |
| Numbers | int (arbitrary precision) and float (IEEE-754) are length-prefixed and tagged. |
| Strings | Encoded as UTF-8, length-prefixed. |
| Sequences | list and tuple are order-sensitive. |
| Sets | set and frozenset are order-independent. Elements are hashed individually, sorted by their encoded bytes, and then hashed. |
| Mappings | dict is order-independent. Key-value pairs are canonically encoded, and items are sorted by the encoded key before hashing. |
| Numpy | Scalars are coerced to native Python equivalents. |
| Others | Falls back to __dict__ if available; otherwise raises TypeError. |
API Reference
Core Functions
-
stable_keyed_hash(obj, key: bytes, algo="siphash24") -> KeyedStableHash -
One-shot hashing of an object.
-
Returns an object with
.digest(),.hexdigest(), and.intdigest(). -
siphash24(key: bytes) -> SipHash24 -
Stateful hasher.
-
Methods:
.update(data),.digest(),.hexdigest(),.intdigest(),.copy().
Vectorized Helpers
hash_pandas_series(series, key, ...)pandas.Series[uint64]hash_arrow_array(array, key, ...)pyarrow.Array[uint64]hash_polars_series(series, key, ...)polars.Series
Roadmap
Note: Current implementation is pure Python. While optimized, it involves Python loop overhead for complex structures.
- C/Rust Backend: Replace the inner loop with a compiled extension (Rust or C) for significant speedups.
- Contract Tests: Add cross-version compatibility contracts to ensure hash stability across future library updates.
- Vectorized Kernels: Move columnar hashing entirely to C/Rust to avoid per-row Python overhead.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file keyedstablehash-0.0.4-py3-none-any.whl.
File metadata
- Download URL: keyedstablehash-0.0.4-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22954f720995c557bedca1791b471339905a5ba8886c436c5a1838f632b99303
|
|
| MD5 |
99d7f4ce64aeda7d2b2cb3dcddb72a44
|
|
| BLAKE2b-256 |
59ec464766c3fbe6041815e2c1a77a710353d0d9b2e2ebe9ba0ee43e5baa1d4e
|