ML model weight verification via hierarchical Merkle trees — O(1) integrity check, layer-aware diff, incremental sync. The missing verification layer for safetensors, PyTorch, and HuggingFace Hub.
Project description
merkle-weight-verify
Hierarchical Merkle tree for verifying the integrity of large files -- ML model weights, datasets, or any binary blob.
Why
Downloading a 70B model from HuggingFace? Fine-tuning and sharing weights across a team? You need to answer:
- "Is this file exactly what I expect?" -- O(1) root hash comparison
- "What changed between two versions?" -- O(k log C) tree-walk diff, where k = changed chunks
- "How much do I need to re-download?" -- incremental sync estimates (50-68% bandwidth savings for partial fine-tuning)
Zero ML dependencies. Only Python standard library (hashlib, dataclasses, json).
Test Suite
142 tests covering every public API surface, including real-model benchmarks with ResNet18:
pip install pytest torch torchvision # test dependencies
python -m pytest tests/ -v
| Module | Tests | What's covered |
|---|---|---|
test_hasher |
15 | All 4 hash algorithms, default switching, verify_hash, detect_algorithm |
test_chunking |
10 | chunk_bytes, iterator, tensor, estimate_chunk_count, edge cases |
test_merkle_tree |
30 | Tree construction (1/2/4/odd chunks), proofs, diff_tree, serialisation, LayerMerkleTree, ModelMerkleTree, build_model_merkle_tree, _extract_layer_name |
test_comparison |
10 | compare_model_trees (detailed/not, layer-only-in-one), estimate_sync_savings |
test_strategies |
12 | Fixed, Adaptive, TheoreticalOptimal strategies, bounds and edge cases |
test_flat_tree |
7 | FlatModelTree, flat_compare, TwoLevelModelTree alias |
test_benchmark_resnet18 |
15 | Real ResNet18: build, verify, proof, JSON roundtrip, fine-tuning diff, sync savings, parallel vs serial, timing benchmarks |
ResNet18 benchmark results (11.7M params):
| Operation | Time | Detail |
|---|---|---|
| Build tree (parallel) | 0.034s | All 60 param tensors |
| Build tree (serial) | 0.034s | Single-threaded baseline |
| Compare (1 param changed) | 0.1ms | 264 hash comparisons vs 2953 total chunks |
| Verify | 0.1ms | Full re-hash and root check |
Install
pip install merkle-weight-verify
Quick Start
Hash and verify a file
from merkle_verify import MerkleTree, compute_hash, chunk_bytes
data = open("model.safetensors", "rb").read()
chunks = chunk_bytes(data, chunk_size=16384) # 16KB chunks
hashes = [compute_hash(c) for c in chunks]
tree = MerkleTree(hashes)
print(tree.root_hash) # single hash represents entire file
Compare two versions
tree_v1 = MerkleTree(hashes_v1)
tree_v2 = MerkleTree(hashes_v2)
changed_indices, comparisons = tree_v1.diff_tree(tree_v2)
print(f"{len(changed_indices)} chunks changed, {comparisons} hash comparisons")
# vs. naive linear scan: would need len(hashes) comparisons
Merkle proofs
proof = tree.get_proof(chunk_index=42)
assert proof.verify() # cryptographic proof that chunk 42 is part of this tree
Multi-layer model trees
from merkle_verify import ModelMerkleTree, LayerMerkleTree
model_tree = ModelMerkleTree(model_name="llama-3-70b")
# ... build layer trees from state_dict ...
model_tree.compute_model_root()
# Compare two model versions
changed_layers = model_tree.get_changed_layers(other_tree)
# Only re-download changed layers
Estimate sync savings
from merkle_verify import estimate_sync_savings
savings = estimate_sync_savings(old_tree, new_tree)
print(f"Save {savings['savings_percentage']:.1f}% bandwidth with incremental sync")
Features
| Feature | Description |
|---|---|
| O(1) verification | Compare root hashes to verify entire file integrity |
| O(k log C) diff | Tree-walk finds only changed chunks without scanning all |
| Merkle proofs | Cryptographic proof that a chunk belongs to a tree |
| 4 hash algorithms | SHA-256, SHA-512, SHA3-256, BLAKE2b |
| Hierarchical trees | Model > Layer > Parameter > Chunk (4-level hierarchy) |
| Chunking strategies | Fixed, adaptive (size-based), theoretical optimal (c*=1/p) |
| Incremental sync | Estimate bandwidth savings for partial updates |
| Serialization | JSON import/export for tree persistence |
| Parallel builds | ThreadPoolExecutor for large models |
Chunking Strategies
from merkle_verify import FixedChunkStrategy, AdaptiveChunkStrategy, TheoreticalOptimalStrategy
# Fixed: 16KB chunks for everything (default)
fixed = FixedChunkStrategy(chunk_size=16384)
# Adaptive: chunk size scales with parameter size
# Small params (3KB bias) -> small chunks, large params (150MB embedding) -> larger chunks
adaptive = AdaptiveChunkStrategy(target_chunks=64)
# Theoretical optimal: c* = 1/p where p = modification probability per byte
optimal = TheoreticalOptimalStrategy(modification_prob=0.001)
API Reference
Core Classes
MerkleTree(chunk_hashes, chunk_size)-- Build tree from chunk hashesLayerMerkleTree(layer_name)-- Group parameter trees by layerModelMerkleTree(model_name)-- Top-level model tree
Hashing
compute_hash(data, algorithm=None)-- Hash bytes/stringhash_pair(left, right)-- Hash two child hashes (internal node)verify_hash(data, expected_hash)-- Check data matches hashHashAlgorithm.SHA256 | SHA512 | SHA3_256 | BLAKE2B
Comparison
compare_model_trees(tree_a, tree_b)-- Full model diff with 3-level pruningestimate_sync_savings(tree_a, tree_b)-- Bandwidth savings estimate
Performance
Tested on GPT-2 (124M params) through LLaMA-3-70B:
| Operation | GPT-2 | LLaMA-7B | LLaMA-70B |
|---|---|---|---|
| Build tree | 0.3s | 2.1s | 18s |
| Root compare | <1ms | <1ms | <1ms |
| Full diff (1% change) | 5ms | 12ms | 45ms |
| Full diff (naive) | 50ms | 340ms | 2.8s |
Roadmap
v0.2 — Production Hardening + Ecosystem Foundations
Tier 1: Immediate (high feasibility, high impact)
- BLAKE3 fast hash backend — 5-10x speedup via
blake3package (AVX2/NEON SIMD, multithreading). Optional dep:pip install merkle-weight-verify[fast] - Safetensors plugin —
merkle_verify.safetensors_adapter:sign(),verify(),diff(),verify_tensor(). Leveragessafe_open()per-tensor mmap access. Merkle manifest stored in sidecar.merkle.json. Fills gap left by safetensors Issue #220 (closed "not planned"). - PyTorch integration —
merkle_verify.pytorch_adapter:merkle_save(),merkle_load(),verify_checkpoint(). Addresses PyTorch Issue #126952 (open, unassigned). - CLI tool —
merkle-verify hash|sign|verify|diff|infowith auto-detection of safetensors/PyTorch/generic files - Streaming build —
MerkleTree.from_file()andbuild_file_merkle_tree()with O(chunk_size) memory - Fix Merkle proof for odd-count edge nodes — all leaves (including duplicated last-node) now produce valid proofs
- Optimize
get_changed_chunks()— replaced O(C) linear scan with O(k log C)diff_tree()call
v0.3 — Ecosystem Integration
Tier 2: Strategic (medium feasibility, high value)
- HuggingFace Hub / Xet complement —
merkle-verify hf-check <repo-id>,ModelMerkleTree.from_hf_cache(). Post-download semantic verification on top of Xet's byte-level transport. xet-core is Apache-2.0 buthf-xethas no public chunk API, so we operate on downloaded safetensors files. - Sigstore model signing integration — Generate
.merkle.jsonsidecar compatible withmodel-signing(v1.1.1) workflow. Sign Merkle manifest alongside model files for combined provenance + fine-grained verification. - Delta sync protocol — given diff, produce minimal binary patch for incremental transfer
- Adaptive strategy auto-tuning — profile modification density from git history to pick optimal
c*
v0.4+ — Scale
Tier 3: Future (low feasibility now, revisit when ecosystem matures)
- NVIDIA cuPQC GPU backend — 388-891 GB/s hash throughput on Blackwell GPUs. Currently blocked: C++/CUDA only, no Python bindings, v0.4 pre-release. Revisit when NVIDIA ships pip package.
- Distributed tree construction — split across workers for 100B+ models
- Persistent tree store — SQLite/LevelDB backend for caching trees across runs
- Formal verification — prove diff_tree correctness with property-based testing (Hypothesis)
Research Directions
- Empirical evaluation of
c* = 1/ptheory across fine-tuning regimes (LoRA, full, distillation) - Comparison with content-defined chunking (CDC/Rabin) vs fixed-size for weight files
- Integration with federated learning: per-client diff aggregation via Merkle proofs
License
Apache-2.0 — Copyright 2026 Geoffrey Wang
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file merkle_weight_verify-0.2.0.tar.gz.
File metadata
- Download URL: merkle_weight_verify-0.2.0.tar.gz
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0815ec6019fda8e2d1cb17ac518db87bb4fb85f97657b05ef1c1b71133afa10d
|
|
| MD5 |
31b9a306f3db5ed65752dc9c5f2e6fb3
|
|
| BLAKE2b-256 |
b4f59b4ef3a4e52223143e35c5d60ac890daf3a1e72982ec91b09fcfc6ce7f29
|
File details
Details for the file merkle_weight_verify-0.2.0-py3-none-any.whl.
File metadata
- Download URL: merkle_weight_verify-0.2.0-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8889c941b019cdfbbcecb31d6f177d5685149087ebf1425f4b5a03c6b7c89f4a
|
|
| MD5 |
ebcd95cfe3a0148242e0b0f97b350434
|
|
| BLAKE2b-256 |
e006dc086d05441d82fdaf772133f2261854e11a11f364fecdfc50db64241283
|