Skip to main content

Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams

Project description

UHC — Unified Hash-Compression Engine

A Python framework for compressed-domain hashing over LZ77 streams.

UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4), without ever materializing the decompressed bytes. This means you can verify the integrity of a 1 TB compressed archive while reading only the compressed data — no decompression needed.

Why?

Traditional integrity checking requires: read compressed → decompress → hash the raw bytes. For a 1 TB file compressed to 100 GB, that's 1.1 TB of data movement.

UHC's compressed-domain hashing (CDH) reads only the 100 GB compressed stream and computes the hash algebraically from the LZ77 tokens. The result is mathematically identical to hashing the decompressed data (Theorem 12), with speedup proportional to the compression ratio (Theorem 17).

Scenario Decompress-then-hash UHC
1 TB file, 10:1 compression 1.1 TB I/O 100 GB I/O
DEFLATE, CR > 23 Baseline Faster (CPU)
Zstandard, CR > 44 Baseline Faster (CPU)

Installation

pip install uhc

For format support (LZ4, Zstandard, BLAKE3):

pip install uhc[formats]

Quick Start

Hash raw data

from uhc.engine.pipeline import uhc_hash

h = uhc_hash(b"hello world")

Hash compressed data without decompressing

import zlib
from uhc.engine.pipeline import uhc_hash_compressed, Format

# Compress with raw DEFLATE (no zlib header)
data = b"the quick brown fox " * 1000
c = zlib.compressobj(6, zlib.DEFLATED, -15)
compressed = c.compress(data) + c.flush()

# Hash the decompressed content — without decompressing
h = uhc_hash_compressed(compressed, Format.DEFLATE)

# Verify: same as hashing the raw data directly
assert h == uhc_hash(data)

Cross-format verification

import lz4.block
from uhc.engine.pipeline import uhc_verify, Format

data = b"test data " * 100

# Compress the same data with two different formats
c = zlib.compressobj(6, zlib.DEFLATED, -15)
deflate_bytes = c.compress(data) + c.flush()
lz4_bytes = lz4.block.compress(data, store_size=False)

# Verify they contain the same content — without decompressing either
assert uhc_verify(deflate_bytes, lz4_bytes,
                  fmt_a=Format.DEFLATE, fmt_b=Format.LZ4_BLOCK)

Multi-hash for stronger collision resistance

from uhc.engine.pipeline import uhc_hash_multi

# k=2 independent hashes — collision probability drops to (N/p)^2
h_k = uhc_hash_multi(b"important data", bases=[131, 257])
# Returns tuple: (hash_with_base_131, hash_with_base_257)

Content-defined chunking for deduplication

from uhc.chunking.cdc import cdc_chunk
from uhc.core.polynomial_hash import PolynomialHash

h = PolynomialHash(base=131)
data = open("large_file.bin", "rb").read()

# Split into variable-size chunks at content-determined boundaries
chunks = cdc_chunk(data, min_size=2048, avg_size=8192, max_size=65536)

# Hash each chunk — identical content produces identical hashes
chunk_hashes = {h.hash(c): c for c in chunks}

# Reconstruct whole-file hash from chunks via Theorem 1
running = 0
for chunk in chunks:
    running = h.hash_concat(running, len(chunk), h.hash(chunk))
assert running == h.hash(data)

Lower-level: direct token extraction and CDH

from uhc.core.deflate import deflate_extract_tokens
from uhc.core.compressed_verifier import compressed_domain_hash, CDHMethod

# Extract LZ77 tokens from a DEFLATE stream
tokens = deflate_extract_tokens(compressed)

# Compute hash via compressed-domain hashing (three strategies available)
h_rope = compressed_domain_hash(tokens, method=CDHMethod.ROPE)
h_sliding = compressed_domain_hash(tokens, method=CDHMethod.SLIDING_ROPE,
                                    d_max=32768, m_max=258)
h_prefix = compressed_domain_hash(tokens, method=CDHMethod.PREFIX_ARRAY)

# All three produce identical results (Theorem 12)
assert h_rope == h_sliding == h_prefix

Architecture

uhc/
├── core/
│   ├── polynomial_hash.py     # Mersenne arithmetic, PolynomialHash, Φ()
│   ├── lz77.py                # Literal/Reference tokens, encode/decode
│   ├── rope.py                # Hash rope: Leaf, Internal, RepeatNode, BB[2/7]
│   ├── compressed_verifier.py # CDH: prefix_array, rope, sliding_rope strategies
│   ├── multihash.py           # k-tuple multi-hash (Theorem 21)
│   ├── deflate.py             # DEFLATE token extractor (Lemma 9)
│   └── lz4_parser.py          # LZ4 block + frame parser (Lemma 10)
├── chunking/
│   └── cdc.py                 # Gear-based FastCDC chunking
└── engine/
    └── pipeline.py            # Unified public API

CDH Strategies

Strategy Space Best for
PREFIX_ARRAY O(N) Small files, debugging
ROPE O(n·log N) General use, unbounded streams
SLIDING_ROPE O(W) Large files, bounded memory (W = d_max + m_max)

Mathematical Foundation

The complete mathematical framework is in theory/compressed_domain_hashing_framework.md:

  • 10 definitions, 14 lemmas, 24 theorems, 5 corollaries — all with complete proofs
  • Theorem 12 (Main Correctness): CDH(τ) = H(T) for all valid LZ77 encodings
  • Theorem 17 (Speedup): CDH is faster than decompress-then-hash when compression ratio exceeds the format constant c_F
  • Theorem 20 (Collision Bound): Pr[collision] ≤ (N/p)^k with k independent bases
  • Corollary 5: p = 2^127−1, k = 2 gives collision probability below 2^(−174) for 1 TB files

Development

git clone https://github.com/jemsbhai/uhc.git
cd uhc/code
pip install -e ".[dev,formats]"
python -m pytest tests/ -v

Test Coverage

504 tests covering:

  • Algebraic foundations (polynomial hash, Φ, Mersenne arithmetic)
  • LZ77 encode/decode with overlapping references
  • Hash rope operations (concat, split, repeat, substr_hash, balance)
  • Three-way CDH cross-validation (prefix_array, rope, sliding_rope)
  • DEFLATE and LZ4 token extraction against real compressor output
  • CDC chunking properties (determinism, boundary stability, size constraints)
  • End-to-end pipeline integration with cross-format verification
  • Randomized and seeded property-based tests

Roadmap

Done

  • Algebraic core (polynomial hash, geometric accumulator)
  • Hash rope data structure with RepeatNode
  • Compressed-domain hash verifier (three strategies)
  • k-tuple multi-hash
  • DEFLATE token extractor
  • LZ4 token extractor (block + frame)
  • Content-defined chunking (FastCDC)
  • Unified pipeline API

Next

  • Zstandard token extractor (Lemma 11)
  • BLAKE3 Merkle wrapper for adversarial integrity
  • MPHF index for chunk lookup
  • Performance benchmarks
  • C++ acceleration for inner loops

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uhc-0.1.3.tar.gz (57.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uhc-0.1.3-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file uhc-0.1.3.tar.gz.

File metadata

  • Download URL: uhc-0.1.3.tar.gz
  • Upload date:
  • Size: 57.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for uhc-0.1.3.tar.gz
Algorithm Hash digest
SHA256 8cafc4b4ca4203046c2659a388c51cce8bbe26a70f073b1ebdc6a9df71f4602d
MD5 a2b201ff55bdb8a3cf82ea465c885057
BLAKE2b-256 0d172b536c497d0cad86ff71d08eb6869667001c011b8d9975165e6ea1dcc96a

See more details on using hashes here.

File details

Details for the file uhc-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: uhc-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for uhc-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 acaddbc7bccdb5a7bb2caff909efee8a3166be0017edc229f2b72ac1f8338518
MD5 d9b00d6eda092114c44d27d3cae8f24f
BLAKE2b-256 1fe16470eef9038d3f4f92ad80129ad7bc97e735e24099977da74d0d33ddf61b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page