Skip to main content

Stable, keyed hashing for Python objects and columnar data (SipHash-like PRF semantics).

Project description

keyedstablehash Logo

keyedstablehash

Deterministic, cryptographically secure hashing for complex Python objects and columnar data.

Testing Build, Test & Coverage codecov
Package
Meta License
                                          |

keyedstablehash solves the problem of generating reproducible, secure hashes for arbitrary Python structures (dicts, lists, primitives) across different processes and machines. Think of it as stablehash meets hashlib, powered by the SipHash-2-4 algorithm to prevent hash-flooding attacks.

Why use keyedstablehash?

Standard Python hash() is randomized per process for security. hashlib (md5/sha) is stable but requires manual byte-encoding of objects. keyedstablehash gives you the best of both worlds:

  • 🔒 Secure & Keyed: Uses SipHash-2-4 (a keyed pseudorandom function). By keeping your key secret, you prevent adversarial inputs (HashDoS attacks) and ensure hashes cannot be predicted externally.
  • Reproducible: Guaranteed deterministic output for a given key and input, regardless of Python version or architecture.
  • 🧠 Smart Canonicalization: Automatically handles nested dictionaries, sets (order-independent), mixed types, and NumPy scalars. {a: 1, b: 2} hashes the same as {b: 2, a: 1}.
  • 🐼 Big Data Ready: First-class support for Pandas, Polars, and PyArrow. Hash millions of rows efficiently without writing fragile loops.
  • 🛠 Type-Safe: Fully typed with py.typed for a seamless IDE experience.

Installation

Install the core library:

pip install keyedstablehash

Optional High-Performance Extras: For vectorization support with your favorite dataframe library:

pip install "keyedstablehash[dataframes]"   # Support for Pandas
pip install "keyedstablehash[arrow]"        # Support for PyArrow
pip install "keyedstablehash[polars]"       # Support for Polars

Quick Start

1. Hashing Python Objects

Generate stable hashes for complex, nested structures.

from keyedstablehash import stable_keyed_hash

# Your secret key (must be 16 bytes)
secret_key = b"\x01" * 16

# A complex, messy object
data = {
    "id": 101,
    "tags": {"python", "data", "secure"},  # Sets are auto-sorted
    "meta": {"created_at": 167888, "active": True}
}

h = stable_keyed_hash(data, key=secret_key)

print(f"Hex: {h.hexdigest()}") 
# -> Hex: 4a1b... (Deterministic across runs)
print(f"Int: {h.intdigest()}") 
# -> Int: 8392... (uint64)

2. Streaming API

Mirrors the standard hashlib interface for data streams.

from keyedstablehash import siphash24

s = siphash24(key=secret_key)
s.update(b"chunk_one")
s.update(b"chunk_two")

print(s.hexdigest())

3. Dataframe Vectorization (The Power Feature)

Hash entire columns in Pandas, Polars, or Arrow. This is essential for data de-duplication, shuffling, or anonymization pipelines.

import pandas as pd
import pyarrow as pa
from keyedstablehash import hash_pandas_series, hash_arrow_array

# --- Pandas ---
df = pd.DataFrame({"user_id": ["u1", "u2", "u1"]})
df["hash"] = hash_pandas_series(df["user_id"], key=secret_key)
# Result: A Series of uint64 hashes

# --- PyArrow ---
arr = pa.array(["alpha", "beta", "gamma"])
hashes = hash_arrow_array(arr, key=secret_key)
# Result: A pyarrow.Array(uint64)

Canonicalization Rules

To ensure stability, keyedstablehash strictly defines how types are converted to bytes before hashing.

Type Handling Strategy
None / Bool Tagged with unique type markers.
Numbers int (arbitrary precision) and float (IEEE-754) are length-prefixed and tagged.
Strings Encoded as UTF-8, length-prefixed.
Sequences list and tuple are order-sensitive.
Sets set and frozenset are order-independent. Elements are hashed individually, sorted by their encoded bytes, and then hashed.
Mappings dict is order-independent. Key-value pairs are canonically encoded, and items are sorted by the encoded key before hashing.
Numpy Scalars are coerced to native Python equivalents.
Others Falls back to __dict__ if available; otherwise raises TypeError.

API Reference

Core Functions

  • stable_keyed_hash(obj, key: bytes, algo="siphash24") -> KeyedStableHash

  • One-shot hashing of an object.

  • Returns an object with .digest(), .hexdigest(), and .intdigest().

  • siphash24(key: bytes) -> SipHash24

  • Stateful hasher.

  • Methods: .update(data), .digest(), .hexdigest(), .intdigest(), .copy().

Vectorized Helpers

  • hash_pandas_series(series, key, ...) pandas.Series[uint64]
  • hash_arrow_array(array, key, ...) pyarrow.Array[uint64]
  • hash_polars_series(series, key, ...) polars.Series

Roadmap

Note: Current implementation is pure Python. While optimized, it involves Python loop overhead for complex structures.

  1. C/Rust Backend: Replace the inner loop with a compiled extension (Rust or C) for significant speedups.
  2. Contract Tests: Add cross-version compatibility contracts to ensure hash stability across future library updates.
  3. Vectorized Kernels: Move columnar hashing entirely to C/Rust to avoid per-row Python overhead.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

keyedstablehash-0.0.1-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file keyedstablehash-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for keyedstablehash-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c7980973b3b8cd4c708d1f56513597d4b38a6302cc825dc39231977983581692
MD5 22724f25f1840f95b070039368e09415
BLAKE2b-256 62a0cdb7049ff63ab8fb9ebc5931cc6c551540e4f43d4479ce5e7e06d3999672

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page