Skip to main content

ML model weight verification via hierarchical Merkle trees — O(1) integrity check, layer-aware diff, incremental sync. The missing verification layer for safetensors, PyTorch, and HuggingFace Hub.

Project description

merkle-weight-verify

Hierarchical Merkle tree for verifying the integrity of large files -- ML model weights, datasets, or any binary blob.

Why

Downloading a 70B model from HuggingFace? Fine-tuning and sharing weights across a team? You need to answer:

  • "Is this file exactly what I expect?" -- O(1) root hash comparison
  • "What changed between two versions?" -- O(k log C) tree-walk diff, where k = changed chunks
  • "How much do I need to re-download?" -- incremental sync estimates (50-68% bandwidth savings for partial fine-tuning)

Zero ML dependencies. Only Python standard library (hashlib, dataclasses, json).

Test Suite

142 tests covering every public API surface, including real-model benchmarks with ResNet18:

pip install pytest torch torchvision   # test dependencies
python -m pytest tests/ -v
Module Tests What's covered
test_hasher 15 All 4 hash algorithms, default switching, verify_hash, detect_algorithm
test_chunking 10 chunk_bytes, iterator, tensor, estimate_chunk_count, edge cases
test_merkle_tree 30 Tree construction (1/2/4/odd chunks), proofs, diff_tree, serialisation, LayerMerkleTree, ModelMerkleTree, build_model_merkle_tree, _extract_layer_name
test_comparison 10 compare_model_trees (detailed/not, layer-only-in-one), estimate_sync_savings
test_strategies 12 Fixed, Adaptive, TheoreticalOptimal strategies, bounds and edge cases
test_flat_tree 7 FlatModelTree, flat_compare, TwoLevelModelTree alias
test_benchmark_resnet18 15 Real ResNet18: build, verify, proof, JSON roundtrip, fine-tuning diff, sync savings, parallel vs serial, timing benchmarks

ResNet18 benchmark results (11.7M params):

Operation Time Detail
Build tree (parallel) 0.034s All 60 param tensors
Build tree (serial) 0.034s Single-threaded baseline
Compare (1 param changed) 0.1ms 264 hash comparisons vs 2953 total chunks
Verify 0.1ms Full re-hash and root check

Install

pip install merkle-weight-verify

Quick Start

Hash and verify a file

from merkle_verify import MerkleTree, compute_hash, chunk_bytes

data = open("model.safetensors", "rb").read()
chunks = chunk_bytes(data, chunk_size=16384)  # 16KB chunks
hashes = [compute_hash(c) for c in chunks]

tree = MerkleTree(hashes)
print(tree.root_hash)  # single hash represents entire file

Compare two versions

tree_v1 = MerkleTree(hashes_v1)
tree_v2 = MerkleTree(hashes_v2)

changed_indices, comparisons = tree_v1.diff_tree(tree_v2)
print(f"{len(changed_indices)} chunks changed, {comparisons} hash comparisons")
# vs. naive linear scan: would need len(hashes) comparisons

Merkle proofs

proof = tree.get_proof(chunk_index=42)
assert proof.verify()  # cryptographic proof that chunk 42 is part of this tree

Multi-layer model trees

from merkle_verify import ModelMerkleTree, LayerMerkleTree

model_tree = ModelMerkleTree(model_name="llama-3-70b")
# ... build layer trees from state_dict ...
model_tree.compute_model_root()

# Compare two model versions
changed_layers = model_tree.get_changed_layers(other_tree)
# Only re-download changed layers

Estimate sync savings

from merkle_verify import estimate_sync_savings

savings = estimate_sync_savings(old_tree, new_tree)
print(f"Save {savings['savings_percentage']:.1f}% bandwidth with incremental sync")

Features

Feature Description
O(1) verification Compare root hashes to verify entire file integrity
O(k log C) diff Tree-walk finds only changed chunks without scanning all
Merkle proofs Cryptographic proof that a chunk belongs to a tree
4 hash algorithms SHA-256, SHA-512, SHA3-256, BLAKE2b
Hierarchical trees Model > Layer > Parameter > Chunk (4-level hierarchy)
Chunking strategies Fixed, adaptive (size-based), theoretical optimal (c*=1/p)
Incremental sync Estimate bandwidth savings for partial updates
Serialization JSON import/export for tree persistence
Parallel builds ThreadPoolExecutor for large models

Chunking Strategies

from merkle_verify import FixedChunkStrategy, AdaptiveChunkStrategy, TheoreticalOptimalStrategy

# Fixed: 16KB chunks for everything (default)
fixed = FixedChunkStrategy(chunk_size=16384)

# Adaptive: chunk size scales with parameter size
# Small params (3KB bias) -> small chunks, large params (150MB embedding) -> larger chunks
adaptive = AdaptiveChunkStrategy(target_chunks=64)

# Theoretical optimal: c* = 1/p where p = modification probability per byte
optimal = TheoreticalOptimalStrategy(modification_prob=0.001)

API Reference

Core Classes

  • MerkleTree(chunk_hashes, chunk_size) -- Build tree from chunk hashes
  • LayerMerkleTree(layer_name) -- Group parameter trees by layer
  • ModelMerkleTree(model_name) -- Top-level model tree

Hashing

  • compute_hash(data, algorithm=None) -- Hash bytes/string
  • hash_pair(left, right) -- Hash two child hashes (internal node)
  • verify_hash(data, expected_hash) -- Check data matches hash
  • HashAlgorithm.SHA256 | SHA512 | SHA3_256 | BLAKE2B

Comparison

  • compare_model_trees(tree_a, tree_b) -- Full model diff with 3-level pruning
  • estimate_sync_savings(tree_a, tree_b) -- Bandwidth savings estimate

Performance

Tested on GPT-2 (124M params) through LLaMA-3-70B:

Operation GPT-2 LLaMA-7B LLaMA-70B
Build tree 0.3s 2.1s 18s
Root compare <1ms <1ms <1ms
Full diff (1% change) 5ms 12ms 45ms
Full diff (naive) 50ms 340ms 2.8s

Roadmap

v0.2 — Production Hardening + Ecosystem Foundations

Tier 1: Immediate (high feasibility, high impact)

  • BLAKE3 fast hash backend — 5-10x speedup via blake3 package (AVX2/NEON SIMD, multithreading). Optional dep: pip install merkle-weight-verify[fast]
  • Safetensors pluginmerkle_verify.safetensors_adapter: sign(), verify(), diff(), verify_tensor(). Leverages safe_open() per-tensor mmap access. Merkle manifest stored in sidecar .merkle.json. Fills gap left by safetensors Issue #220 (closed "not planned").
  • PyTorch integrationmerkle_verify.pytorch_adapter: merkle_save(), merkle_load(), verify_checkpoint(). Addresses PyTorch Issue #126952 (open, unassigned).
  • CLI toolmerkle-verify hash|sign|verify|diff|info with auto-detection of safetensors/PyTorch/generic files
  • Streaming buildMerkleTree.from_file() and build_file_merkle_tree() with O(chunk_size) memory
  • Fix Merkle proof for odd-count edge nodes — all leaves (including duplicated last-node) now produce valid proofs
  • Optimize get_changed_chunks() — replaced O(C) linear scan with O(k log C) diff_tree() call

v0.3 — Ecosystem Integration

Tier 2: Strategic (medium feasibility, high value)

  • HuggingFace Hub / Xet complementmerkle-verify hf-check <repo-id>, ModelMerkleTree.from_hf_cache(). Post-download semantic verification on top of Xet's byte-level transport. xet-core is Apache-2.0 but hf-xet has no public chunk API, so we operate on downloaded safetensors files.
  • Sigstore model signing integration — Generate .merkle.json sidecar compatible with model-signing (v1.1.1) workflow. Sign Merkle manifest alongside model files for combined provenance + fine-grained verification.
  • Delta sync protocol — given diff, produce minimal binary patch for incremental transfer
  • Adaptive strategy auto-tuning — profile modification density from git history to pick optimal c*

v0.4+ — Scale

Tier 3: Future (low feasibility now, revisit when ecosystem matures)

  • NVIDIA cuPQC GPU backend — 388-891 GB/s hash throughput on Blackwell GPUs. Currently blocked: C++/CUDA only, no Python bindings, v0.4 pre-release. Revisit when NVIDIA ships pip package.
  • Distributed tree construction — split across workers for 100B+ models
  • Persistent tree store — SQLite/LevelDB backend for caching trees across runs
  • Formal verification — prove diff_tree correctness with property-based testing (Hypothesis)

Research Directions

  • Empirical evaluation of c* = 1/p theory across fine-tuning regimes (LoRA, full, distillation)
  • Comparison with content-defined chunking (CDC/Rabin) vs fixed-size for weight files
  • Integration with federated learning: per-client diff aggregation via Merkle proofs

License

Apache-2.0 — Copyright 2026 Geoffrey Wang

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

merkle_weight_verify-0.2.0.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

merkle_weight_verify-0.2.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file merkle_weight_verify-0.2.0.tar.gz.

File metadata

  • Download URL: merkle_weight_verify-0.2.0.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for merkle_weight_verify-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0815ec6019fda8e2d1cb17ac518db87bb4fb85f97657b05ef1c1b71133afa10d
MD5 31b9a306f3db5ed65752dc9c5f2e6fb3
BLAKE2b-256 b4f59b4ef3a4e52223143e35c5d60ac890daf3a1e72982ec91b09fcfc6ce7f29

See more details on using hashes here.

File details

Details for the file merkle_weight_verify-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for merkle_weight_verify-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8889c941b019cdfbbcecb31d6f177d5685149087ebf1425f4b5a03c6b7c89f4a
MD5 ebcd95cfe3a0148242e0b0f97b350434
BLAKE2b-256 e006dc086d05441d82fdaf772133f2261854e11a11f364fecdfc50db64241283

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page