Unified Hash-Compression Engine: compressed-domain hashing over LZ77 streams

These details have not been verified by PyPI

Project links

Project description

UHC — Unified Hash-Compression Engine

A Python framework for compressed-domain hashing over LZ77 streams.

UHC computes the polynomial hash of uncompressed data by operating directly on compressed token streams (DEFLATE, LZ4), without ever materializing the decompressed bytes. This means you can verify the integrity of a 1 TB compressed archive while reading only the compressed data — no decompression needed.

Why?

Traditional integrity checking requires: read compressed → decompress → hash the raw bytes. For a 1 TB file compressed to 100 GB, that's 1.1 TB of data movement.

UHC's compressed-domain hashing (CDH) reads only the 100 GB compressed stream and computes the hash algebraically from the LZ77 tokens. The result is mathematically identical to hashing the decompressed data (Theorem 12), with speedup proportional to the compression ratio (Theorem 17).

Scenario	Decompress-then-hash	UHC
1 TB file, 10:1 compression	1.1 TB I/O	100 GB I/O
DEFLATE, CR > 23	Baseline	Faster (CPU)
Zstandard, CR > 44	Baseline	Faster (CPU)

Installation

pip install uhc

For format support (LZ4, Zstandard, BLAKE3):

pip install uhc[formats]

Quick Start

Hash raw data

from uhc.engine.pipeline import uhc_hash

h = uhc_hash(b"hello world")

Hash compressed data without decompressing

import zlib
from uhc.engine.pipeline import uhc_hash_compressed, Format

# Compress with raw DEFLATE (no zlib header)
data = b"the quick brown fox " * 1000
c = zlib.compressobj(6, zlib.DEFLATED, -15)
compressed = c.compress(data) + c.flush()

# Hash the decompressed content — without decompressing
h = uhc_hash_compressed(compressed, Format.DEFLATE)

# Verify: same as hashing the raw data directly
assert h == uhc_hash(data)

Cross-format verification

import lz4.block
from uhc.engine.pipeline import uhc_verify, Format

data = b"test data " * 100

# Compress the same data with two different formats
c = zlib.compressobj(6, zlib.DEFLATED, -15)
deflate_bytes = c.compress(data) + c.flush()
lz4_bytes = lz4.block.compress(data, store_size=False)

# Verify they contain the same content — without decompressing either
assert uhc_verify(deflate_bytes, lz4_bytes,
                  fmt_a=Format.DEFLATE, fmt_b=Format.LZ4_BLOCK)

Multi-hash for stronger collision resistance

from uhc.engine.pipeline import uhc_hash_multi

# k=2 independent hashes — collision probability drops to (N/p)^2
h_k = uhc_hash_multi(b"important data", bases=[131, 257])
# Returns tuple: (hash_with_base_131, hash_with_base_257)

Content-defined chunking for deduplication

from uhc.chunking.cdc import cdc_chunk
from uhc.core.polynomial_hash import PolynomialHash

h = PolynomialHash(base=131)
data = open("large_file.bin", "rb").read()

# Split into variable-size chunks at content-determined boundaries
chunks = cdc_chunk(data, min_size=2048, avg_size=8192, max_size=65536)

# Hash each chunk — identical content produces identical hashes
chunk_hashes = {h.hash(c): c for c in chunks}

# Reconstruct whole-file hash from chunks via Theorem 1
running = 0
for chunk in chunks:
    running = h.hash_concat(running, len(chunk), h.hash(chunk))
assert running == h.hash(data)

Lower-level: direct token extraction and CDH

from uhc.core.deflate import deflate_extract_tokens
from uhc.core.compressed_verifier import compressed_domain_hash, CDHMethod

# Extract LZ77 tokens from a DEFLATE stream
tokens = deflate_extract_tokens(compressed)

# Compute hash via compressed-domain hashing (three strategies available)
h_rope = compressed_domain_hash(tokens, method=CDHMethod.ROPE)
h_sliding = compressed_domain_hash(tokens, method=CDHMethod.SLIDING_ROPE,
                                    d_max=32768, m_max=258)
h_prefix = compressed_domain_hash(tokens, method=CDHMethod.PREFIX_ARRAY)

# All three produce identical results (Theorem 12)
assert h_rope == h_sliding == h_prefix

Architecture

uhc/
├── core/
│   ├── polynomial_hash.py     # Mersenne arithmetic, PolynomialHash, Φ()
│   ├── lz77.py                # Literal/Reference tokens, encode/decode
│   ├── rope.py                # Hash rope: Leaf, Internal, RepeatNode, BB[2/7]
│   ├── compressed_verifier.py # CDH: prefix_array, rope, sliding_rope strategies
│   ├── multihash.py           # k-tuple multi-hash (Theorem 21)
│   ├── deflate.py             # DEFLATE token extractor (Lemma 9)
│   └── lz4_parser.py          # LZ4 block + frame parser (Lemma 10)
├── chunking/
│   └── cdc.py                 # Gear-based FastCDC chunking
└── engine/
    └── pipeline.py            # Unified public API

CDH Strategies

Strategy	Space	Best for
`PREFIX_ARRAY`	O(N)	Small files, debugging
`ROPE`	O(n·log N)	General use, unbounded streams
`SLIDING_ROPE`	O(W)	Large files, bounded memory (W = d_max + m_max)

Mathematical Foundation

The complete mathematical framework is in theory/compressed_domain_hashing_framework.md:

10 definitions, 14 lemmas, 24 theorems, 5 corollaries — all with complete proofs
Theorem 12 (Main Correctness): CDH(τ) = H(T) for all valid LZ77 encodings
Theorem 17 (Speedup): CDH is faster than decompress-then-hash when compression ratio exceeds the format constant c_F
Theorem 20 (Collision Bound): Pr[collision] ≤ (N/p)^k with k independent bases
Corollary 5: p = 2^127−1, k = 2 gives collision probability below 2^(−174) for 1 TB files

Development

git clone https://github.com/jemsbhai/uhc.git
cd uhc/code
pip install -e ".[dev,formats]"
python -m pytest tests/ -v

Test Coverage

504 tests covering:

Algebraic foundations (polynomial hash, Φ, Mersenne arithmetic)
LZ77 encode/decode with overlapping references
Hash rope operations (concat, split, repeat, substr_hash, balance)
Three-way CDH cross-validation (prefix_array, rope, sliding_rope)
DEFLATE and LZ4 token extraction against real compressor output
CDC chunking properties (determinism, boundary stability, size constraints)
End-to-end pipeline integration with cross-format verification
Randomized and seeded property-based tests

Roadmap

Done

Algebraic core (polynomial hash, geometric accumulator)
Hash rope data structure with RepeatNode
Compressed-domain hash verifier (three strategies)
k-tuple multi-hash
DEFLATE token extractor
LZ4 token extractor (block + frame)
Content-defined chunking (FastCDC)
Unified pipeline API

Zstandard token extractor (Lemma 11)
BLAKE3 Merkle wrapper for adversarial integrity
MPHF index for chunk lookup
Performance benchmarks
C++ acceleration for inner loops

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.6

Apr 7, 2026

0.1.5

Apr 6, 2026

0.1.4

Apr 6, 2026

This version

0.1.3

Apr 6, 2026

0.1.2

Apr 6, 2026

0.1.1

Apr 6, 2026

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uhc-0.1.3.tar.gz (57.4 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

uhc-0.1.3-py3-none-any.whl (31.8 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file uhc-0.1.3.tar.gz.

File metadata

Download URL: uhc-0.1.3.tar.gz
Upload date: Apr 6, 2026
Size: 57.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for uhc-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`8cafc4b4ca4203046c2659a388c51cce8bbe26a70f073b1ebdc6a9df71f4602d`
MD5	`a2b201ff55bdb8a3cf82ea465c885057`
BLAKE2b-256	`0d172b536c497d0cad86ff71d08eb6869667001c011b8d9975165e6ea1dcc96a`

See more details on using hashes here.

File details

Details for the file uhc-0.1.3-py3-none-any.whl.

File metadata

Download URL: uhc-0.1.3-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 31.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for uhc-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`acaddbc7bccdb5a7bb2caff909efee8a3166be0017edc229f2b72ac1f8338518`
MD5	`d9b00d6eda092114c44d27d3cae8f24f`
BLAKE2b-256	`1fe16470eef9038d3f4f92ad80129ad7bc97e735e24099977da74d0d33ddf61b`

See more details on using hashes here.

uhc 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

UHC — Unified Hash-Compression Engine

Why?

Installation

Quick Start

Hash raw data

Hash compressed data without decompressing

Cross-format verification

Multi-hash for stronger collision resistance

Content-defined chunking for deduplication

Lower-level: direct token extraction and CDH

Architecture

CDH Strategies

Mathematical Foundation

Development

Test Coverage

Roadmap

Done

Next

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes