ML model weight verification via hierarchical Merkle trees — O(1) integrity check, layer-aware diff, incremental sync. The missing verification layer for safetensors, PyTorch, and HuggingFace Hub.

These details have not been verified by PyPI

Project links

Project description

merkle-weight-verify

Hierarchical Merkle tree for verifying the integrity of large files -- ML model weights, datasets, or any binary blob.

Why

Downloading a 70B model from HuggingFace? Fine-tuning and sharing weights across a team? You need to answer:

"Is this file exactly what I expect?" -- O(1) root hash comparison
"What changed between two versions?" -- O(k log C) tree-walk diff, where k = changed chunks
"How much do I need to re-download?" -- incremental sync estimates (50-68% bandwidth savings for partial fine-tuning)

Zero ML dependencies. Only Python standard library (hashlib, dataclasses, json).

Test Suite

142 tests covering every public API surface, including real-model benchmarks with ResNet18:

pip install pytest torch torchvision   # test dependencies
python -m pytest tests/ -v

Module	Tests	What's covered
`test_hasher`	15	All 4 hash algorithms, default switching, `verify_hash`, `detect_algorithm`
`test_chunking`	10	`chunk_bytes`, iterator, tensor, `estimate_chunk_count`, edge cases
`test_merkle_tree`	30	Tree construction (1/2/4/odd chunks), proofs, `diff_tree`, serialisation, `LayerMerkleTree`, `ModelMerkleTree`, `build_model_merkle_tree`, `_extract_layer_name`
`test_comparison`	10	`compare_model_trees` (detailed/not, layer-only-in-one), `estimate_sync_savings`
`test_strategies`	12	Fixed, Adaptive, TheoreticalOptimal strategies, bounds and edge cases
`test_flat_tree`	7	`FlatModelTree`, `flat_compare`, `TwoLevelModelTree` alias
`test_benchmark_resnet18`	15	Real ResNet18: build, verify, proof, JSON roundtrip, fine-tuning diff, sync savings, parallel vs serial, timing benchmarks

ResNet18 benchmark results (11.7M params):

Operation	Time	Detail
Build tree (parallel)	0.034s	All 60 param tensors
Build tree (serial)	0.034s	Single-threaded baseline
Compare (1 param changed)	0.1ms	264 hash comparisons vs 2953 total chunks
Verify	0.1ms	Full re-hash and root check

Install

pip install merkle-weight-verify

Quick Start

Hash and verify a file

from merkle_verify import MerkleTree, compute_hash, chunk_bytes

data = open("model.safetensors", "rb").read()
chunks = chunk_bytes(data, chunk_size=16384)  # 16KB chunks
hashes = [compute_hash(c) for c in chunks]

tree = MerkleTree(hashes)
print(tree.root_hash)  # single hash represents entire file

Compare two versions

tree_v1 = MerkleTree(hashes_v1)
tree_v2 = MerkleTree(hashes_v2)

changed_indices, comparisons = tree_v1.diff_tree(tree_v2)
print(f"{len(changed_indices)} chunks changed, {comparisons} hash comparisons")
# vs. naive linear scan: would need len(hashes) comparisons

Merkle proofs

proof = tree.get_proof(chunk_index=42)
assert proof.verify()  # cryptographic proof that chunk 42 is part of this tree

Multi-layer model trees

from merkle_verify import ModelMerkleTree, LayerMerkleTree

model_tree = ModelMerkleTree(model_name="llama-3-70b")
# ... build layer trees from state_dict ...
model_tree.compute_model_root()

# Compare two model versions
changed_layers = model_tree.get_changed_layers(other_tree)
# Only re-download changed layers

Estimate sync savings

from merkle_verify import estimate_sync_savings

savings = estimate_sync_savings(old_tree, new_tree)
print(f"Save {savings['savings_percentage']:.1f}% bandwidth with incremental sync")

Features

Feature	Description
O(1) verification	Compare root hashes to verify entire file integrity
O(k log C) diff	Tree-walk finds only changed chunks without scanning all
Merkle proofs	Cryptographic proof that a chunk belongs to a tree
4 hash algorithms	SHA-256, SHA-512, SHA3-256, BLAKE2b
Hierarchical trees	Model > Layer > Parameter > Chunk (4-level hierarchy)
Chunking strategies	Fixed, adaptive (size-based), theoretical optimal (c*=1/p)
Incremental sync	Estimate bandwidth savings for partial updates
Serialization	JSON import/export for tree persistence
Parallel builds	ThreadPoolExecutor for large models

Chunking Strategies

from merkle_verify import FixedChunkStrategy, AdaptiveChunkStrategy, TheoreticalOptimalStrategy

# Fixed: 16KB chunks for everything (default)
fixed = FixedChunkStrategy(chunk_size=16384)

# Adaptive: chunk size scales with parameter size
# Small params (3KB bias) -> small chunks, large params (150MB embedding) -> larger chunks
adaptive = AdaptiveChunkStrategy(target_chunks=64)

# Theoretical optimal: c* = 1/p where p = modification probability per byte
optimal = TheoreticalOptimalStrategy(modification_prob=0.001)

API Reference

Core Classes

MerkleTree(chunk_hashes, chunk_size) -- Build tree from chunk hashes
LayerMerkleTree(layer_name) -- Group parameter trees by layer
ModelMerkleTree(model_name) -- Top-level model tree

Hashing

compute_hash(data, algorithm=None) -- Hash bytes/string
hash_pair(left, right) -- Hash two child hashes (internal node)
verify_hash(data, expected_hash) -- Check data matches hash
HashAlgorithm.SHA256 | SHA512 | SHA3_256 | BLAKE2B

Comparison

compare_model_trees(tree_a, tree_b) -- Full model diff with 3-level pruning
estimate_sync_savings(tree_a, tree_b) -- Bandwidth savings estimate

Performance

Tested on GPT-2 (124M params) through LLaMA-3-70B:

Operation	GPT-2	LLaMA-7B	LLaMA-70B
Build tree	0.3s	2.1s	18s
Root compare	<1ms	<1ms	<1ms
Full diff (1% change)	5ms	12ms	45ms
Full diff (naive)	50ms	340ms	2.8s

Roadmap

v0.2 — Production Hardening + Ecosystem Foundations

Tier 1: Immediate (high feasibility, high impact)

BLAKE3 fast hash backend — 5-10x speedup via blake3 package (AVX2/NEON SIMD, multithreading). Optional dep: pip install merkle-weight-verify[fast]
Safetensors plugin — merkle_verify.safetensors_adapter: sign(), verify(), diff(), verify_tensor(). Leverages safe_open() per-tensor mmap access. Merkle manifest stored in sidecar .merkle.json. Fills gap left by safetensors Issue #220 (closed "not planned").
PyTorch integration — merkle_verify.pytorch_adapter: merkle_save(), merkle_load(), verify_checkpoint(). Addresses PyTorch Issue #126952 (open, unassigned).
CLI tool — merkle-verify hash|sign|verify|diff|info with auto-detection of safetensors/PyTorch/generic files
Streaming build — MerkleTree.from_file() and build_file_merkle_tree() with O(chunk_size) memory
Fix Merkle proof for odd-count edge nodes — all leaves (including duplicated last-node) now produce valid proofs
Optimize get_changed_chunks() — replaced O(C) linear scan with O(k log C) diff_tree() call

v0.3 — Ecosystem Integration

Tier 2: Strategic (medium feasibility, high value)

HuggingFace Hub / Xet complement — merkle-verify hf-check <repo-id>, ModelMerkleTree.from_hf_cache(). Post-download semantic verification on top of Xet's byte-level transport. xet-core is Apache-2.0 but hf-xet has no public chunk API, so we operate on downloaded safetensors files.
Sigstore model signing integration — Generate .merkle.json sidecar compatible with model-signing (v1.1.1) workflow. Sign Merkle manifest alongside model files for combined provenance + fine-grained verification.
Delta sync protocol — given diff, produce minimal binary patch for incremental transfer
Adaptive strategy auto-tuning — profile modification density from git history to pick optimal c*

v0.4+ — Scale

Tier 3: Future (low feasibility now, revisit when ecosystem matures)

NVIDIA cuPQC GPU backend — 388-891 GB/s hash throughput on Blackwell GPUs. Currently blocked: C++/CUDA only, no Python bindings, v0.4 pre-release. Revisit when NVIDIA ships pip package.
Distributed tree construction — split across workers for 100B+ models
Persistent tree store — SQLite/LevelDB backend for caching trees across runs
Formal verification — prove diff_tree correctness with property-based testing (Hypothesis)

Research Directions

Empirical evaluation of c* = 1/p theory across fine-tuning regimes (LoRA, full, distillation)
Comparison with content-defined chunking (CDC/Rabin) vs fixed-size for weight files
Integration with federated learning: per-client diff aggregation via Merkle proofs

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

merkle_weight_verify-0.2.0.tar.gz (48.3 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

merkle_weight_verify-0.2.0-py3-none-any.whl (33.9 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file merkle_weight_verify-0.2.0.tar.gz.

File metadata

Download URL: merkle_weight_verify-0.2.0.tar.gz
Upload date: Mar 26, 2026
Size: 48.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for merkle_weight_verify-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0815ec6019fda8e2d1cb17ac518db87bb4fb85f97657b05ef1c1b71133afa10d`
MD5	`31b9a306f3db5ed65752dc9c5f2e6fb3`
BLAKE2b-256	`b4f59b4ef3a4e52223143e35c5d60ac890daf3a1e72982ec91b09fcfc6ce7f29`

See more details on using hashes here.

File details

Details for the file merkle_weight_verify-0.2.0-py3-none-any.whl.

File metadata

Download URL: merkle_weight_verify-0.2.0-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 33.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for merkle_weight_verify-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8889c941b019cdfbbcecb31d6f177d5685149087ebf1425f4b5a03c6b7c89f4a`
MD5	`ebcd95cfe3a0148242e0b0f97b350434`
BLAKE2b-256	`e006dc086d05441d82fdaf772133f2261854e11a11f364fecdfc50db64241283`

See more details on using hashes here.

merkle-weight-verify 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

merkle-weight-verify

Why

Test Suite

Install

Quick Start

Hash and verify a file

Compare two versions

Merkle proofs

Multi-layer model trees

Estimate sync savings

Features

Chunking Strategies

API Reference

Core Classes

Hashing

Comparison

Performance

Roadmap

v0.2 — Production Hardening + Ecosystem Foundations

v0.3 — Ecosystem Integration

v0.4+ — Scale

Research Directions

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes