Skip to main content

Per-file-type trained lossless compressor — dictionary + LZ + arithmetic coding, with specialist image/audio/video/numeric/columnar codecs

Project description

Per-File-Type Trained Lossless Compressor

A lossless compressor built on one idea: learn the common patterns of a file type, then encode files as short references to those patterns plus short codes for frequent bytes.

Install

Standalone binary (no Python) — Linux / Windows / macOS. The pertype CLI is a single self-contained, statically-linked executable. Grab it from the Releases page (a per-OS build is attached to each tag), or build it from source with Rust:

# prebuilt: download + extract the archive for your platform, e.g. Linux x86_64:
tar xzf pertype-x86_64-unknown-linux-musl.tar.gz
./pertype compress myfile.json -o myfile.cmp   # then: ./pertype decompress myfile.cmp -o out.json
# or build from source (needs the Rust toolchain) — installs `pertype` into ~/.cargo/bin:
git clone https://github.com/cbeech/pertype && cargo install --path pertype/rust --bin pertype

The standalone binary covers train / compress / decompress + auto-routing (text / byte / CSV / columnar / telemetry). For the image / audio / video / scientific codecs, use the Python package below.

Quickstart

pip install .                 # core (zero-dependency); add extras for specialist codecs:
pip install ".[all]"          # image / audio / video / scientific support (numpy, pillow, …)

# Compress anything — auto-detects the type and routes to the best codec, verified byte-exact.
# The output is self-describing, so decompress needs no flags to route it:
pertype compress data.csv              # -> data.csv.cmp   (picks csv/columnar/…)
pertype decompress data.csv.cmp        # -> data.csv

# Have a trained model for this file type? Add --model: the trained codec is tried too and
# the smaller result wins (on json that's ~8x vs ~2x for generic compression):
pertype train json corpus/json/train -o json.model     # learn once, reuse across files
pertype compress page.json --model json.model          # -> page.json.cmp  [trained-model]
pertype decompress page.json.cmp --model json.model    # -> page.json

# Not sure what a file is? Ask (like `file`, but it names the ideal codec):
pertype identify mystery.bin

python -m pertype … works identically without installing. The text/byte core has no dependencies; the image/audio/video/science codecs pull theirs via the matching extra. Native (C) acceleration builds itself on first use and falls back to pure Python if gcc isn't available. A standalone Rust build (byte-identical, no Python) lives in rust/.

Two intuitions, realized honestly:

  • "256 patterns make up a file" → a trained dictionary of common multi-byte chunks, with a literal-byte fallback so any file rebuilds byte-for-byte.
  • "compress 8 bits to 4 bits"arithmetic entropy coding: frequent patterns/bytes cost a fraction of a bit, rare ones more, so the average drops well below 8 bits/byte — without losing anything. (A Huffman coder is also in the tree as a tested building block, but the pipeline uses arithmetic coding, which spends fractional bits and tracks the true entropy more closely.)

On top of the cross-file dictionary, the codec also uses LZ77 back-references. When LZ is enabled, training prepends a learned blob to each file's history, so matches can reach into arbitrary substrings of trained content — the way zstd uses a dictionary — as well as in-file repetition. Two blob builders are available: naive (whole training files concatenated, preserving long contiguous runs) and coverage (zstd-COVER-style: pack the most frequently-referenced content, deduplicated, most-useful nearest the data). Training tries dict-only plus both builders at several sizes and keeps whichever is cheapest on a held-out validation slice (so the blob can't overfit the choice). Different types land on different strategies — see results below.

The twist that beats general-purpose tools: the model is trained per file type and shipped separately, not embedded in every compressed file the way gzip is. That cost is paid once and amortized across many files. The honest win-scenario is therefore many smallish files of a known type (API responses, log lines, HTML pages).

Beyond text, the same predict-then-entropy-code idea — a per-type reversible transform, a context-adaptive residual coder, a dedicated adaptive-filter audio codec, and a motion-compensated video codec — extends across domains.

Results at a glance (every number on real data, every result round-trip verified):

domain data our result vs the standard codec
text JSON / logs / HTML / XML / code (held-out) beats plain gzip/zstd 29–62%; beats zstd --train (best dict) on logs +7%, html +6%, XML +6%; ~6–7% behind on json & Python source (cross-file-repetitive — zstd's COVER+FSE niche)
public text (enwik8) Wikipedia, held-out 3.06× — beats gzip 2.60×, zstd 2.70×, xz 2.76×, bzip2 2.83× (all standard tools); ~6% behind zstd --train 3.25× — the same trained-dict holdout, on a named benchmark
IoT/MQTT telemetry Intel Lab sensor, per-message JSON (held-out) trained model 3.55× — beats zstd --train 2.09× by +41%; generic per-message gzip/zstd/xz are ≤1.05× (overhead sinks them on ~100 B messages). The many-small-messages Mode-B win, validated measure-first (scripts/iot_benchmark.py)
lossless image (Kodak) 24 standard photos beats PNG on 24/24 (+29%, 2.51× vs 1.79×), matches WebP-LL (2.51×) and within −4% of JPEG-XL — the named lossless-image benchmark
Silesia (routed) the modern general corpus per-type routing: mr MR-volume +21% / x-ray +18% vs xz; held-out text (1 MB train) beats every standard tool on dickens/webster/reymont/samba/nci (trails only zstd --train); loses on xml (repetitive markup — LZ/BWT niche) and sao floats; binaries not our design
raw image Canon CR2 Bayer / RGB photo dedicated MED/GAP/CALIC codec: Bayer 2.22× (beats Canon's own lossless +41%), RGB photo 2.64× (beats PNG +13%)
medical image real DICOM CT/MR (16-bit) beats all: 4.79× vs PNG-16 3.33×, xz 2.78× (+44% over PNG) — dense continuous-tone is the predictor's domain
astronomy (FITS) NASA int16 / float32 int16 beats all: 5.54× vs xz 5.01×, PNG 3.94×; float32 near the entropy floor (~1.2× for everyone)
terrain (DEM) SRTM int16 elevation beats all: 4.56× vs PNG-16 2.81×, xz 2.64×, zstd 2.21× (1.62× over the best) — smooth height fields are the predictor's domain
hyperspectral AVIRIS cube (200 bands) inter-band delta (3D volume codec): 2.41× vs xz 1.83×, zstd 1.65×, +14% over per-band
LiDAR point cloud LAS (airborne, 110K pts) columnar codec (pertype/columnar.py — de-interleave fields + per-column raw/delta/Δ²): 4.88× vs xz 2.88×, zstd 2.54×, beats general codecs (LAZ specialist ~5–15×)
tabular CSV UCI power (2M-row numeric) columnar transpose (pertype/csvcolumnar.py — per column: fixed-decimal→scaled-int Δ, low-cardinality text→value dictionary, else deflate): 16.5× vs xz 11.3×, zstd 10.1×, gzip 7.0× (+32% over the best general tool)
sparse / volumes masks, CT/MR/FITS stacks an RLE coder wins on sparse/label data (auto-selected); 3D inter-slice delta adds +31% on correlated volumes
audio 16-bit PCM music beats FLAC +7.4% (9/10), and beats xz +59% (1.96× vs 1.24×)
biosignal ECG (PhysioNet) beats xz +7% (3.06× vs 2.94×)
seismic broadband waveforms (IRIS) beats xz 2–3× (6.6–7.4× vs 2.3–3.7×)
sensor numeric UCI power (int columns) 6.27× — beats gzip; xz wins (repetition-heavy)
float64 UCI power / synth (held-out) beats xz/zstd on all (4.90× / 5.32× / 1.30×) via Gorilla XOR-delta
video CIF clips + real movies (full YUV) beats FFV1: animation +16–55% (peak on stop-motion), live action +3–12%; loses on high-motion (intra-bound). Motion compensation is the lever
genome (DNA) E. coli FASTA boundary — a near-uniform 4-symbol source (~1.95 bits/base); 2-bit packing (4.05×) is the floor and prediction adds nothing. xz 3.72×, ours no edge — honestly not our niche
protein (AA) E. coli FASTA boundary — a ~20-symbol near-i.i.d. source (~4.15 bits/residue); order-0 entropy coding beats the LZ tools (no repetition) but prediction adds nothing. Completes the DNA→protein→text alphabet story
climate grid (HDF5) NCEP reanalysis float32 beats all: 4.51× vs xz 3.20×, zstd 2.70× (+29%) — pertype/floatcodec.py maps the few distinct values (0.18%) to a dictionary and delta-codes the smooth index field. Closes the lossless-float boundary where prediction/XOR fail

The unifying result, and the dividing line: predict per type, then entropy-code. Where a signal is smooth or structured (audio, ECG, raw images, video, slowly varying sensors), an adaptive predictor + context-adaptive arithmetic beats the general-purpose tools and even the domain specialists. Where data is repetition- dominated (long constant runs, exact repeats), LZ-family coders (xz, zstd --train) win, and we don't pretend otherwise. Floating-point — once a boundary — is now handled: a Gorilla XOR-delta transform plus our trained LZ + ctxcoder beats xz/zstd on real float64 measurement data.

Everything runs through a native C hot path (via ctypes), bit-identical to the pure-Python reference with a fallback: compress of a 0.8 MB text file went 111 s → 0.78 s (~140×), so the whole family — text, audio, video, numeric — is fast enough to use, not just a ratio demo.

How it works

train(corpus)                         compress(file, model)
  select transform      ─┐              apply transform (decorrelate)
  mine patterns + blob   ├─ model        └─ cost-optimal parse (DP over tokens)
  price modes on val set ┘                   └─ arithmetic-code the token stream
  pick cheapest                                  └─ container = header + bitstream

A reversible transform runs first (and is inverted last), chosen per file type by the validation gate: generic byte-stream ops — delta (predict from the byte N back) and split (deinterleave into N byte-planes) — that decorrelate numeric/image data so the coder has far less to encode. Text selects identity.

For LZ types the parser is cost-optimal: a dynamic program finds the minimum-cost path through the token graph, pricing every candidate (literal, dict ref, each LZ match) by its actual arithmetic-coded bit cost. Dict-only types keep the cheap greedy longest-match parse. The match-finder's search depth is adaptive (tokenizer.adaptive_max_chain): deep on small files — where the optimal parse is ~1% denser and the absolute cost is tiny — tapering to the fixed default on large/match-rich inputs. It never goes below the default, so it's a Pareto improvement (never worse on ratio) with bounded compress cost.

Tokens are literals, dictionary references, or (length, distance) LZ matches. Match lengths and distances are bucketed into slots (one coded symbol + a few raw "extra" bits each), with a separate frequency model for distances. LZ matches use lazy parsing (one-byte lookahead: defer a match if the next position offers a longer one); dictionary matches commit greedily since they're the cheapest token. Recently-used match distances are cached as repeat offsets, so a match reusing one codes a tiny index instead of a full distance.

Decompression reverses it and verifies a CRC32, so losslessness is checked on every file.

Modules

file responsibility
pertype/bitio.py MSB-first bit reader/writer
pertype/arithmetic.py integer arithmetic coder (Witten–Neal–Cleary)
pertype/freqmodel.py static frequency model driving the coder
pertype/huffman.py canonical Huffman (package-merge) — tested building block
pertype/transform.py reversible per-type decorrelating transforms (delta/split)
pertype/dictionary.py pattern miner + longest-match lookup
pertype/tokenizer.py reversible file ↔ token stream (dict + LZ)
pertype/model.py train / save / load a per-type model
pertype/codec.py compress / decompress + container + checksum
pertype/audiocodec.py standalone lossless audio codec that beats FLAC (numpy)
pertype/ctxcoder.py context-adaptive arithmetic residual coder (beats xz on ECG)
pertype/videocodec.py lossless video codec: motion-compensated inter-frame (numpy)
pertype/predictors.py shared 2D intra predictors: MED / Paeth / GAP / CALIC (image + video)
pertype/imagecodec.py lossless raw/photo/medical image + volume codec: MED/CALIC/RLE per-plane, 3D inter-slice delta (numpy)
pertype/detect.py file-like type detection → recommends the ideal codec (magic + content)
pertype/auto.py detect → route to any specialist (image/float/csv/columnar/video/audio) → verify byte-exact → keep smallest; self-describing .az blob
pertype/columnar.py columnar codec for fixed-width binary records (de-interleave fields + per-column delta)
pertype/csvcolumnar.py columnar codec for delimited-text tables (transpose + per-column numeric/text coding)
pertype/floatcodec.py lossless low-cardinality float codec (value dictionary + delta-coded indices)
pertype/y4m.py byte-exact YUV4MPEG2 (.y4m) container parse/serialize (shared by CLI + auto)
pertype/native.py + _native/audio.c C hot loops (ctypes), auto-built, with Python fallback
pertype/benchmark.py comparison vs gzip / zstd / zstd-trained-dict
pertype/cli.py train / compress / decompress / benchmark / video-{encode,decode} / image-{encode,decode} / identify / auto-{compress,decompress} / columnar-{encode,decode} / csv-{encode,decode}

Usage

# Generate sample corpora (disjoint train/test) for json, logs, html
python3 scripts/make_corpus.py                 # synthetic, reproducible
python3 scripts/collect_corpus.py              # real files from this machine -> corpus_real/

# Train a model for one type
python3 -m pertype train json corpus/json/train -o json.model

# Compress / decompress a single file
python3 -m pertype compress some.json -m json.model -o some.json.cz
python3 -m pertype decompress some.json.cz -m json.model -o roundtrip.json

# Benchmark against gzip and zstd on the held-out test set
python3 -m pertype benchmark json                      # synthetic corpus
python3 -m pertype benchmark json --root corpus_real   # real-world corpus

# Lossless video: encode/decode a .y4m (4:2:0/4:2:2/4:4:4/mono, byte-exact)
python3 -m pertype video-encode clip.y4m -o clip.vid
python3 -m pertype video-decode clip.vid -o roundtrip.y4m

# Identify a file's type + the codec that suits it (like `file`)
python3 -m pertype identify image.fits data.npy api.json

# Auto: detect → route to the best codec → verify byte-exact → keep smallest (.az)
python3 -m pertype auto-compress image.fits -o image.az
python3 -m pertype auto-decompress image.az -o roundtrip.fits

# Columnar: compress a fixed-width binary record stream (LiDAR point data, etc.)
python3 -m pertype columnar-encode points.bin --schema 4,4,4,2 -o points.col
python3 -m pertype columnar-decode points.col -o roundtrip.bin

# CSV: compress a delimited-text table column-major (auto delimiter / line-ending)
python3 -m pertype csv-encode data.csv -o data.csvc
python3 -m pertype csv-decode data.csvc -o roundtrip.csv

Cross-domain benchmark scripts (each compares ours vs the domain's standard codec). They read their data directories from environment variables (no machine-specific paths are baked in) — point them at your own local copies:

export SCI_DATA=/path/to/sci_data    # ecg, enwik8, power CSV, seismic, video/*.y4m, …
export CR2_DIR=/path/to/raws         # Canon CR2 / photo benchmarks
export MUSIC_DIR=/path/to/music      # audio benchmarks
export MOVIES_DIR=/path/to/movies    # lossless-video-vs-FFV1 benchmark
script domain competitors needs
scripts/image_benchmark.py icons / graphics gzip, zstd, PNG Pillow
scripts/image_med_benchmark.py 2D MED/Paeth prediction PNG, zstd, xz Pillow, numpy
scripts/cr2_med_benchmark.py Bayer MED on raw photos PNG-16, zstd, xz rawpy, numpy
scripts/imagecodec_benchmark.py shipped image codec (Bayer+RGB+gray) PNG, zstd, xz, Canon rawpy, Pillow
scripts/scientific_image_benchmark.py real DICOM / FITS (medical+astronomy) PNG-16, zstd, xz pydicom, Pillow
scripts/dem_benchmark.py SRTM terrain elevation (int16) PNG-16, zstd, xz Pillow, numpy
scripts/hyperspectral_benchmark.py AVIRIS cube (inter-band delta) zstd, xz scipy, numpy
scripts/genome_benchmark.py DNA FASTA (boundary) zstd, xz, bzip2, 2-bit numpy
scripts/protein_benchmark.py protein FASTA (boundary) zstd, xz, bzip2, order-k numpy
scripts/lidar_benchmark.py LiDAR LAS point cloud (col+delta) zstd, xz (LAZ ref) numpy
scripts/csv_benchmark.py delimited-text tables (columnar transpose) gzip, xz, zstd numpy
scripts/weather_benchmark.py climate float32 grids (HDF5) → floatcodec gzip, xz, zstd h5py, numpy
scripts/enwik_benchmark.py enwik8 Wikipedia (amortized held-out) gzip, bzip2, xz, zstd, zstd --train (stdlib + the codec)
scripts/kodak_benchmark.py Kodak 24 lossless image set PNG, JPEG-XL, WebP-LL Pillow, imagecodecs
scripts/silesia_benchmark.py Silesia corpus, routed per-type gzip, bzip2, xz, zstd, zstd --train pydicom, numpy
scripts/cr2_benchmark.py Canon raw crops gzip, zstd, PNG-16 rawpy, numpy
scripts/full_raw_benchmark.py full raw frame gzip, zstd, PNG-16 rawpy, numpy
scripts/cr2_multiframe.py raw, many frames JPEG XL rawpy, numpy, imagecodecs
scripts/audio_benchmark.py audio (generic codec) FLAC soundfile, numpy
scripts/audio_codec_benchmark.py audio (dedicated codec) FLAC soundfile, numpy
scripts/ecg_ctx_coder.py biosignal (ECG) xz numpy
scripts/scidata_ctx_benchmark.py sensor numeric (int) gzip, xz numpy
scripts/float_benchmark.py floating-point (transform proxy) gzip, zstd, xz numpy
scripts/float_codec_benchmark.py floating-point (full codec) zstd, xz numpy
scripts/video_ffv1_benchmark.py video (full YUV) FFV1, JPEG XL imagecodecs, imageio-ffmpeg, numpy
scripts/movie_lossless_benchmark.py real movie frames (+ block-mode mix) FFV1, JPEG XL imagecodecs, imageio-ffmpeg, numpy

License

Dual-licensed. Free and open-source under the GNU Affero General Public License v3.0 or later (LICENSE) — use it, modify it, redistribute it; if you convey it or run it as a network service, you must offer your users the corresponding source under the AGPL.

For proprietary / closed-source / hosted-SaaS use without the AGPL's copyleft and network-source obligations, a commercial license is available — see COMMERCIAL.md. Contributions are accepted under a lightweight CLA so the project can keep offering both (CLA.md). Third-party dependency notices (and the GPL/LGPL caveats for the optional media extras) are in THIRD-PARTY-NOTICES.md.

Dependencies

  • Core text/byte compressor and tests: zero external dependencies (Python 3 stdlib only — codec.py, model.py, tokenizer.py, ctxcoder.py, etc.).
  • audiocodec.py / videocodec.py (the media codecs) and the ctxcoder native path: need numpy. The native hot path also needs gcc (built on import; falls back to pure Python if absent — see below).
  • CLI: video-encode / video-decode need numpy; benchmark uses the gzip and zstd command-line tools. The text train / compress / decompress commands stay zero-dependency.
  • Cross-domain benchmark scripts need the libraries in the table above — install with: pip install pillow rawpy numpy imagecodecs soundfile imageio-ffmpeg (imagecodecs bundles libjxl for JPEG XL; soundfile bundles libsndfile for FLAC; imageio-ffmpeg bundles a static ffmpeg for the FFV1 video baseline). These are only for the optional benchmarks, never the codec itself.

Running the tests

python3 -m pytest -q                 # full suite (106 tests)
python3 -m pytest tests/test_auto.py # one module

Use a Python 3 interpreter that has numpy — the media/image/array test modules (test_auto, test_imagecodec, test_predictors, test_videocodec) import it at module load, so without numpy those modules fail to collect. On some machines bare python is Python 2; prefer python3 (e.g. /usr/bin/python3). pytest is the only test-time dependency beyond numpy: pip install --user pytest. To run just the stdlib-only text/byte core, deselect the numpy modules: python3 -m pytest -q --ignore=tests/test_auto.py --ignore=tests/test_imagecodec.py --ignore=tests/test_predictors.py --ignore=tests/test_videocodec.py.

Native acceleration (the optimised port)

Pure Python validated the ratios; for speed, the hot loops are ported to C (pertype/_native/audio.c), compiled to a shared library by gcc on first import and called via ctypes (no Python.h needed) — see pertype/native.py. Each native function is bit-identical and byte-interchangeable with its pure-Python reference (verified in tests), so output is unchanged and a file compressed on one path decompresses on the other. If gcc/numpy is absent, everything falls back to pure Python (native.HAVE_NATIVE == False), and the text/byte core stays zero-dependency (native is imported lazily).

Ported so far, with measured speedups:

primitive speedup effect
audio LMS filter (256-tap) ~25× the audio codec's dominant cost
audio fixed-2 predictor + adaptive Rice removes the remaining Python loops
byte-stream delta transform ~133× raw/numeric path (42 MB frame delta: seconds → ms)
context-adaptive arithmetic coder (ctxcoder) ~45–60× the coder that beats xz on ECG: a record went 12.6 s → 0.28 s to encode
text/LZ codec arithmetic loop (codec.py) enc ~27× / dec ~46× the per-symbol token coder (3 freq models + repeat-offset cache + slot bits), byte-identical
LZ match-finder (lz_forward) ~15× (whole optimal parse) the 3-byte hash-chain search + _match_len, 61% of the parse; integer-exact candidates → identical tokens. compress of 0.8 MB text: 111 s → 7.6 s
video MED reconstruction (med_fill) ~2.6× decode (motion clips; more on intra-heavy) the causal per-pixel intra-reconstruction loop in videocodec.decode, byte-identical
greedy match-finder + dict matcher (lz_best, dict_match_all) compress 7.6 s → 2.9 s; train 103 s → 67 s the per-position search for the greedy/lazy parse (training) and the trained-dictionary longest-match; integer-exact → identical tokens
cost-optimal backward DP (lz_dp) compress 2.9 s → 0.78 s the parse's DP, on a match-cost lookup table; double arithmetic bit-identical → identical tokens. End-to-end compress of 0.8 MB: 111 s → 0.78 s (~140×).

The arithmetic coder is pure integer math, so the C port reproduces the Witten–Neal–Cleary state machine and MSB-first bit I/O exactly — its output is byte-identical to the Python coder (verified both directions on random and real data). The same WNC machine now also drives the text/LZ codec (codec.py): its whole per-symbol token loop — three frequency models, the repeat-offset cache, and the length/distance slot bits — is in C, so the entropy stage encodes ~27× / decodes ~46× faster, byte-identical. Net: the FLAC-beating audio codec now does ~12 s of audio in ~0.4 s each way (was minutes), and the context coder is fast enough to use in anger. The entire LZ parse is now native too — the match-finder (lz_forward/lz_best), the trained-dictionary matcher (dict_match_all), and the cost-optimal backward DP (lz_dp) — every stage integer- or bit-identical to the Python reference, so the produced tokens are the same. End-to-end compress of a 0.8 MB text file went from 111 s to 0.78 s (~140×), and the whole compress/decompress hot path now runs in C. The only remaining pure-Python cost is training-side (pattern mining + blob building), not compression.

Tests

python3 -m tests.run            # all tests (no dependencies)
python3 -m tests.run codec      # one module

The codec tests include property-style round-trips over random bytes, empty input, bytes never seen in training, and a numeric/transform round-trip — proving the lossless guarantee.

Results

Ratio = raw ÷ compressed (higher is better). Two corpora: synthetic (scripts/make_corpus.py, reproducible) and real-world files collected from this machine (scripts/collect_corpus.py). The two tell different stories — read both.

Real-world corpora (real files, held-out) — the honest test

zstd --train is given its best dictionary size here (the benchmark trains dictionaries at 110 / 256 / 512 KB and reports zstd's cheapest), the symmetric counterpart to our own per-type blob-size validation — so the column below is zstd at its strongest, not a fixed default.

type gzip -9 zstd -19 zstd --train (best dict) ours
json 5.70x 6.18x 9.95x (256 KB) 9.39x
logs 7.40x 7.76x 14.06x (110 KB) 15.12x
html 3.86x 3.98x 7.08x (110 KB) 7.55x
code (Python) 3.67x 3.75x 6.26x (512 KB) 5.82x
xml 3.43x 3.46x 7.80x (256 KB) 8.29x

On real, heterogeneous files we beat plain gzip / zstd -19 by 29–62%, and — after scaling the trained blob to the 512 KB LZ match window — we beat zstd --train on logs (+7%), html (+6%) and XML (+6%) even when zstd picks its best dictionary (verbose, tag/line-repetitive markup is the blob's sweet spot). The blob is prepended to each file's history and shipped once (amortised, like zstd's dictionary), so a larger one just means more cross-file content to match; the validation gate picks the size per type. On logs/html zstd's larger dictionaries are actually worse (110 KB is its best), so we beat its best.

json and code are where zstd still wins — both are cross-file-repetitive text where zstd's COVER dictionary + FSE shine: json 49.7 KB (256 KB dict) vs our 52.7 KB (a 6% gap), Python source 79.4 KB vs our 85.5 KB (~7%); both still beat plain gzip/zstd by 55–62%. A controlled experiment pins down why (for json), and it is not the dictionary: feeding zstd's own 256 KB COVER dictionary into our codec gives 54.1 KB, still behind zstd using the identical dictionary, so the gap is our codec's coding efficiency. Two fixes have since closed part of it — a compact varint container header (26 → ~12 B/file) and a deeper repeat-offset cache (depth 3 → 16, catching ~27% of json's ~30% recurring distances) — taking json from 54.5 KB to 52.7 KB and narrowing the gap to zstd from 4.8 KB to 3.0 KB (−38%). What remains is fundamental, and a per-token breakdown pins it to one cause — the parser, not the entropy coder. Our literals are already near-optimal (order-0 arithmetic; order-1 context doesn't help on the residual unique strings/numbers), and the distance "extra" bits are provably ~incompressible (a per-slot context model over them recovers only ~178 B of 11.2 KB — they are genuinely uniform within each octave), so an "FSE offset coder" would buy almost nothing. zstd's edge is its repeat-offset-aware optimal parser: json is fragmented (avg match ~44 B, so ~9.7 K offsets must be coded), and zstd restructures the token sequence to turn more of those matches into near-free repeat-offset hits. Ours prices every match as a full distance, so it can't. (Deepening the hash-chain search — the parse is search-limited — recovers up to ~1 KB on this larger json, shrinking the gap to zstd-train; we now do this adaptively per file, see above, for a measured ~0.5–1.4% on held-out text at bounded cost.) Closing the last ~2 KB needs a rep-aware cost-optimal parser — a substantial rewrite of the DP, with no guaranteed win. The shipped model is large (real html ~1.5 MB), so it only amortizes over many files.

Synthetic corpora — where we win (but it's partly overfit)

type gzip -9 zstd -19 zstd -19 +dict ours
json 1.98x 2.02x 5.46x 6.50x
logs 3.80x 3.99x 5.95x 6.27x
html 2.72x 2.70x 10.70x 11.41x

On the synthetic corpus we beat zstd +dict on all three types — but the synthetic files are highly homogeneous, which flatters our approach. The real-world numbers above are the truer measure; the gap between the two tables is itself the lesson: validate on real data.

Takeaways:

  • We beat standard zstd -19 everywhere (real data: +29–62%), and on the synthetic corpus we beat even zstd --train. The pipeline compounds: trained dictionary, contiguous LZ blob, cost-optimal parse, repeat offsets, arithmetic coding.
  • We do not beat zstd --train on real, heterogeneous data — we reach 77–91% of it. Our synthetic wins were partly overfit; real files corrected the picture. zstd's remaining edge is a more byte-efficient (COVER-trained) dictionary plus FSE coding.
  • The blob builder and size are chosen per type on a validation slice (naive vs COVER-style coverage, 32–128 KB), so a strategy only helps where it helps and never regresses a type.

Honest costs:

  • Model size grows with the blob and dictionary (real html ~1.1 MB). It ships once and amortizes across many files, but on heterogeneous data that amortizes less well — and it is much larger than zstd's 110 KB dictionary.
  • Training is slow and cost-optimal parsing doesn't scale to large files in pure Python (real html — ~16 KB/file — took many minutes). Compression and decompression of small files are fine; large-file throughput needs work.

Image domain — a cross-domain stress test

Images map out exactly where the approach has value. Each image is decoded to raw pixel bytes and every method compresses identical data; PNG is the lossless-image baseline. Tools: scripts/image_benchmark.py (PIL), scripts/cr2_benchmark.py and scripts/full_raw_benchmark.py (rawpy/LibRaw).

data gzip zstd -19 zstd +dict PNG ours rank
tiny icons (16–96 px, homogeneous) 3.43x 3.60x 4.82x 2.37x 5.39x 1st
flat UI graphics (256 px) 25.90x 30.90x 30.54x 25.70x 30.70x tied top
Canon CR2 raw Bayer (photographic) 1.46x 1.56x 1.52x 1.39x (PNG-16) 2.22x 1st
demosaiced RGB photo (8-bit) 1.73x 1.88x (xz) 2.33x 2.64x 1st
16-bit grayscale (DICOM/FITS-like) 1.27x 1.37x (xz) 1.24x (PNG-16) 1.45x 1st

Both image rows are the dedicated image codec (pertype/imagecodec.py): 2D prediction → adaptive arithmetic coding, no LZ, no trained model (sensor/photo noise has no exact repeats for LZ; prediction + adaptive arithmetic is what helps). It has three modes, each measured on real Canon data, round-trip verified. Every plane picks the best of three coders (1-byte selector): MED and GAP (CALIC's gradient-adjusted predictor) feed the order-2 ctxcoder, while CALIC is a full integrated codec — GAP + per-context bias correction (a running mean prediction error per gradient/texture context, removing GAP's systematic bias) + energy- conditional entropy coding (the magnitude-bucket model is selected by the local gradient energy rather than scan-order history). CALIC wins most planes:

  • Bayer raw — deinterleave RGGB into 4 same-colour sub-planes. 10 full-frame raws (423 MB): 2.22× vs xz 1.81×, Canon's own lossless .CR2 1.57×, PNG-16 1.33× (beats the camera's encoder by +41%).
  • RGB photo — a reversible green-subtract colour transform (G, R−G, B−G) decorrelates the channels, then predict per plane. 8 full-frame demosaiced photos (507 MB): 2.64× vs PNG 2.33×, xz 1.88× (beats PNG by +13%, xz +40%).
  • gray — a single predicted plane, with a per-plane choice of MED / CALIC / RLE and a data-driven threshold scale (so 8-bit, 12/16-bit, and small deltas all track). On real DICOM CT/MR it reaches 4.79× (PNG-16 3.33×, xz 2.78×; +44% over PNG) and on real FITS int16 astronomy 5.54× (PNG 3.94×, xz 5.01×) — both beat everything. The RLE coder is the LZ-style pre-pass: it auto-wins on sparse / label / mask planes (large constant regions, e.g. 127× on a 99.5%-zero image) that a pure predictor can't beat, while CALIC keeps the dense continuous-tone planes. Signed int16 (CT/FITS often go negative) is handled correctly. (scripts/scientific_image_benchmark.py.)
  • volume — a stack of slices (encode_volume): slice 0 direct, each later slice as its inter-slice delta from the previous one. Adjacent CT/MR/FITS slices are highly redundant, so this adds +31% over coding each slice independently.

The MED/GAP paths use a native reconstruction (~2 s enc / ~3 s dec per 21-MP frame); CALIC's predict+bias+code loop is sequential (native, ~3 s dec). Exposed on the CLI as image-encode / image-decode (.npy 2D/3D, or .CR2.rimg).

(The raw row is crop-level, ranked among the columns shown; the full-frame comparison against JPEG XL — the real state-of-the-art — is in the bullet below. Canon's own full-frame lossless ≈ 1.6–1.75x. Raw sensor noise is near-incompressible: these ratios are close to the information-theoretic floor.)

The result is consistent with the text findings: we win where redundancy exists — and the transform stage now exposes redundancy we previously couldn't.

  • Icons — we beat everything, including zstd --train and PNG. Tiny files drown PNG in per-file overhead, and PNG compresses each image independently, so it cannot use the shared palette/style across an icon theme; our cross-image trained dictionary can. A genuine niche (sprite atlases, icon themes, map tiles).

  • Flat graphics — we tie zstd and beat PNG, thanks to large LZ-able regions.

  • 2D MED/Paeth prediction — a loss on graphics, a clear win on photographic raw. A shared intra predictor (pertype/predictors.py, MED + Paeth) plus two measure-first benchmarks (scripts/image_med_benchmark.py, cr2_med_benchmark.py) show the data decides, exactly along the predict-vs-LZ line:

    • Graphics (icons): MED hurts. MED→full-codec beats PNG (5.94× vs 4.98×), but our generic codec with no prediction is 6.18× — prediction breaks the exact cross-image repetition the dictionary exploits, so LZ alone wins.
    • Photographic raw (real Canon CR2 Bayer, held-out): MED wins decisively. Deinterleaving the RGGB mosaic into same-colour sub-planes, MED + ctxcoder (pure prediction, no LZ, no trained model) hits 1.99× vs our generic codec 1.76×, xz 1.68×, PNG-16 1.28× — and routing the MED residuals through the LZ codec instead drops to 1.74×, because sensor noise has no exact repeats for LZ to find. Continuous-tone data is where spatial prediction was always meant to win, and on the real raws it does (+13% over our prior best, no model to ship). So the predictor earned a dedicated raw-image path — now built (pertype/imagecodec.py, MED/GAP/CALIC, no LZ; see the raw table above, 2.22× beating Canon's own lossless); on graphics the existing LZ+dictionary codec stays the right tool. (The video intra path uses the same MED via the shared predictors.py, where post-motion-compensation residuals suit it.)
  • Photographic raw — from dead-last to parity with JPEG XL. Raw was our worst case (1.51x, last) until the transform stage: we measured the entropy (10.27 bits/pixel order-0, 6.87 after prediction) and added a reversible per-type transform (here delta(4) then byte-plane split(2)) that decorrelates the 16-bit mosaic before coding. zstd/gzip/PNG can't infer that structure from opaque bytes; our per-type gate discovers it from the data.

    A full 8-frame sweep of real Canon raw vs JPEG XL lossless (cjxl -d 0, the state-of-the-art) — scripts/cr2_multiframe.py:

    Canon JPEG XL ours ours+model
    mean over 8 frames 1.60x 1.89x 1.90x 1.86x

    We match JPEG XL (1.90x vs 1.89x mean), trading the lead frame-to-frame — ours wins the more-compressible frames, JXL the noisier ones (its learned predictor extracts more from near-pure noise). Counting our shipped ~0.5 MB model, JXL is marginally ahead (1.89x vs 1.86x; it wins 5/8 frames). Both decisively beat Canon's own codec. Caveat: JXL is 1-pass and ~40 s; ours is 2-pass, self-trained per frame, and minutes in pure Python — JXL is far more practical. The result is statistical parity, not a win — but reaching it with a from-scratch byte coder + one auto-discovered transform, no hardcoded image knowledge, is the point.

Audio domain — building a codec that beats FLAC

Lossless audio (16-bit PCM, real music) decoded via libsndfile; FLAC is the purpose-built baseline.

First, the generic codec + transform falls short. The per-type transform auto-selects delta(4)+split(2) and beats gzip/zstd (which are near-helpless on PCM), but FLAC wins decisively — 1.16x vs 1.59x. The reason: a stride-delta is only a 1st-order predictor, and audio rewards adaptive high-order prediction. A simple transform can't reach FLAC.

So we built a dedicated audio codec (pertype/audiocodec.py, scripts/audio_codec_benchmark.py) — Monkey's-Audio-style, all integer and exactly reversible: mid/side → fixed order-2 predictor → cascade of integer sign-sign LMS adaptive filters (16 + 256 + 512 tap) → adaptive Rice. The filters learn online from the reconstructed signal (nothing shipped), and adaptive Rice tracks the per-sample magnitude, beating FLAC's per-partition Rice. Over 10 real tracks (bit-exact verified each):

gzip -9 zstd -19 FLAC ours
mean 1.10x 1.12x 1.80x 1.92x

Ours beats FLAC on 9/10 tracks, mean +7.4% (up to +22%). And against xz directly on the PCM (where music gives LZ nothing to grab): ours 1.96× vs xz 1.24× — +59% on 8/8 tracks. This is the flip side of the power result: high-entropy smooth signals are exactly where prediction crushes a general LZ coder. The third (512-tap) LMS stage added +1.5 points of margin over the prior two-stage cascade (measured on 12 tracks, better on 11/12). Caveats: vs libsndfile's FLAC (the flac -8 CLI may be ~1–3% stronger); measured on 3 s chunks where our adaptive filters only partly converge (full tracks likely favour us more); and pure-Python, so slow — a ratio result, not a fast codec.

A second entropy back-end is now selectable (encode(..., coder="ctx")): context-adaptive arithmetic coding (pertype/ctxcoder.py). It does not help here (the LMS cascade already whitens the residual, so Rice's per-sample adaptation wins — 1.84x vs ctx 1.82x over 12 tracks), but it wins decisively on weakly-predicted signals — see the next section.

This is the sharpest version of the unifying lesson. A cheap generic transform closes the gap to a specialist only by as much as the specialist exceeds simple decorrelation — enough for Bayer (→ JPEG-XL parity), not for audio. But a domain-specific adaptive predictor, when the structure demands it, can beat the specialist outright. The architecture tells you which you need: try the cheap transform first; reach for a real predictor only where it doesn't suffice.

Scientific numeric time-series — a reality check

Tested on two real public datasets in exact lossless representations, every result round-trip verified, against gzip/zstd/xz (scripts/scidata_*, scripts/ecg_*). The headline: the same delta + ctxcoder is the right tool across both types tested — it beats xz on ECG and closes most of the gap on repetitive sensor data (and the one apparent "loss" turned out to be a wrong coder choice, not a real limitation).

Repetitive sensor data: the coder mattered, not LZ. UCI household power (2.05 M rows × 7 sensor columns, exact int32 milli-units): 51 % of deltas are exactly zero — long constant runs (appliances off, coarse quantisation). The first pass used the memoryless adaptive Rice coder (delta+Rice = 2.78×) and concluded "this needs LZ, which our fast path lacks". That was wrong about the remedy. Running the order-2 ctxcoder (built for ECG, but never tried here) on the same delta gives 6.27× (beats gzip's 6.15×) — the order-2 context makes a run of zeros cheap (after a zero, the conditioned bucket→0 probability is high), so the 95 %-zero column Sub_1 goes from ~2.5× (Rice) to 41.7× (ctx). It's still short of xz's run-length LZ on those columns (Sub_1 111×; see "Can we beat xz" above), but the headline correction stands: the original "needs LZ" verdict was wrong as a remedy — the right coder more than doubled the ratio. See scripts/scidata_ctx_benchmark.py.

household power gzip xz -9 delta+Rice (old) delta+ctx
ratio 6.15x 8.56x 2.78x 6.27x

So we now beat gzip and close most of the gap to xz (was 3× behind); xz's stronger LZ + range-coder context still edges us on the very runniest data. (Our own LZ codec is not the answer here: on the full file it gets only 5.82× — worse than delta+ctx and gzip — and its cost-optimal parse is pathologically slow on run-heavy data, ~30 min vs ~4 s, since the all-zeros 3-byte key builds enormous hash chains. The order-2 context coder captures the runs better and faster.) The lesson: the same delta + ctxcoder is the right tool for both repetitive sensor data and smooth biosignals — the earlier "loss" was a wrong coder choice, not a missing LZ stage.

Can we beat xz on this data? Tried, and no — and the diagnosis is precise. Per column, delta+ctx ties xz on the dense columns (G_active 4.9×=4.9×) but loses on the run-heavy ones (Sub_1 41.7× vs xz 111×, G_intensity 6.5× vs 10.9×): xz codes a run of N identical values as one range-coded LZ match, where our coder pays per symbol. A better predictor (2nd-difference, fixed-order-2) and an explicit zero-run-length stage both fail to close it (RLE+ctx 6.29× vs xz 8.55×) — ctx's raw mantissa bits and per-symbol overhead can't match integrated LZ + range coding. Beating xz here would mean reimplementing LZMA; the honest boundary is that xz wins on LZ-friendly repetitive data, we win where prediction beats LZ (below, and audio).

Smooth biosignals: a better entropy coder beats xz. PhysioNet Apnea-ECG (8 records, 21 M samples, int16). The diagnosis came from entropy bounds: our memoryless adaptive Rice (6.37 b/s) sat far above the residual's order-0 entropy (5.46 b/s), while the order-1 context entropy — each residual's magnitude conditioned on the previous one — is 5.03 b/s, below xz's 5.39. So the fix was not LZ but a context-adaptive entropy coder (pertype/ctxcoder.py): delta → zigzag → magnitude bucket coded by an adaptive arithmetic model selected by the previous bucket, then raw mantissa bits.

Apnea-ECG gzip zstd -19 xz -9 ours delta+Rice ours delta+ctx
ratio 2.16x 2.63x 2.99x 2.45x 3.16x

We beat xz -9 overall by +7.6% — round-trip verified. The context coder uses an order-2 context (each residual's magnitude bucket conditioned on the previous two buckets); that was chosen by measuring the residual's conditional entropy (order-2 ≈ 4.97 b/s vs order-1's 5.14 and xz's 5.39), and it lifted the ratio from 3.06x. Order-3 and mantissa-bit modelling were measured too and gave too little to justify (sparser contexts / ~0.7%).

The predictor and the entropy coder interact (the unifying finding). The same context coder narrowed the FLAC win on music (1.82x vs Rice's 1.84x), because the LMS cascade already removes the magnitude-context it exploits, leaving a near-memoryless residual where Rice wins. In short: strong adaptive predictor + Rice ≈ weak predictor + context coder. Both ship as selectable back-ends, chosen per type — Rice for audio, ctx for weakly-predicted signals. The honest boundary: we win where prediction beats LZ (audio, ECG); strong LZ (xz/LZMA) still wins on repetitive/periodic data until our own LZ path is ported to native.

Seismic: prediction crushes LZ (scripts/seismic_benchmark.py). Real broadband seismic waveforms (integer ADC counts from IRIS — the 2010 Chile M8.8 at station ANMO, plus a quiet microseism window; round-trip verified): high-rate, smooth, strongly autocorrelated, with no exact-repeat structure, so LZ coders are nearly helpless while an adaptive predictor + context coder thrives.

segment gzip zstd xz ours
quake + aftershocks (416 K) 1.57× 1.84× 2.29× 6.60×
quiet / microseisms (432 K) 2.42× 3.29× 3.73× 7.36×

We beat xz by +97% to +188% (2–3×) — the largest xz margin of any dataset here. The winning configuration is the audio codec's fixed-2 + 16/256-tap LMS cascade feeding ctxcoder: seismic is a smooth continuous waveform like music, so those adaptive filters generalise directly (where on ECG they overshot the sharp QRS spikes and a plain delta won). This is the sharpest point on the prediction- friendly map — smooth, high-rate signals are exactly where prediction + context-adaptive entropy beats general LZ coders outright.

Floating-point: handled, and we beat the general codecs (scripts/float_codec_benchmark.py). IEEE-754 floats don't subtract meaningfully in byte space, so integer delta is useless on them — but a Gorilla-style XOR-delta (XOR each value's bytes with the previous value's) leaves slowly-changing floats as mostly-zero bytes, which the LZ + ctxcoder stages then crush. It's now in the transform repertoire (the xor op, with stride-8/4 + byte-plane-split specs) and the proxy-selection gate picks it automatically where it wins. Measured end-to-end in our full codec on held-out float64, chunked into trained files (not just a transform proxy), vs the best general coder per set:

float64 set zstd -19 xz -9 ours transform picked
power Voltage (smooth) 3.55× 4.60× 4.90× identity
power G_active (jumpy) 4.00× 5.16× 5.32× identity
synthetic random-walk 1.11× 1.28× 1.30× xor8 + split8
synthetic 2-freq sine 1.05× 1.15× 1.30× split8
synthetic ramp + noise 1.60× 1.60× 1.80× xor8 + split8

We beat xz and zstd on all of them. Two float predictors are in the repertoire: the cheap Gorilla XOR-delta (xor) and a full FCM/DFCM value predictor (fcm) — FPC-style: an FCM table predicts the next value from a hash of recent values, a DFCM table predicts the next difference, and per value we XOR with whichever leaves more leading-zero bytes (a 1-byte selector + byte-plane-split residuals the LZ + ctxcoder then crush). The proxy-selection gate picks whichever wins, per type. FCM/DFCM is auto-selected and dominant where value-structure is strong — on a pure linear ramp it crushes the data ~75× over raw bytes (DFCM nails the constant difference), and it wins on a clean single-frequency sine. On the noisier/larger-magnitude real columns and the chunked benchmark above, the gate prefers the simpler transforms (per-file 4096-value chunks limit how much the predictor learns, and bit-level diff prediction weakens across varying float exponents) — and it never regresses, since the gate keeps the best. Honest caveat: smooth float64 is near the entropy floor (~1.3× for anyone — irrational-value mantissas are high-entropy), so the headline is float64 is a handled type we win on, with a real value predictor that shines on structured series (sensor ramps, periodic simulation output). The "detect fixed-precision → scaled int" shortcut still isn't lossless (4.216 has no exact float64).

Lossless video — the temporal-delta hypothesis

The video pipeline below was developed as an ablation across a series of exploratory scripts, now consolidated into the tested pertype/videocodec.py and retired to git history; scripts/video_ffv1_benchmark.py reproduces the headline FFV1 comparison.

Most lossless video codecs (FFV1, Ut Video) are intra-only: each frame is compressed independently, ignoring temporal redundancy. Hypothesis: a cheap temporal frame-delta (delta with stride = one frame) beats intra-only coding on static/slow content and loses on high motion (where motion compensation is needed). Tested on standard .y4m sequences (luma plane), parsed with numpy — no decoder needed. With no ffmpeg/FFV1 available, per-frame JPEG-XL lossless is the intra baseline; since JXL-lossless is stronger than FFV1's intra, that's a conservative stand-in. The temporal delta is isolated by running the same intra codec on the frame residuals; our native context coder (ctxcoder) also codes the residual stream. 60 frames each, round-trip verified.

clip (motion) intra-JXL temporal (best) verdict
akiyo (static head) 2.10 MB 1.01 MB (ctx) temporal +52%
foreman (pan / medium) 2.86 MB 3.30 MB intra wins −16%
stefan (high motion) 3.25 MB 3.82 MB intra wins −18%

The hypothesis holds exactly: frame-delta is a large win on static content and a loss under motion — because a raw frame-delta can't track moving pixels, so the residual loses the spatial structure the intra codec exploits. That is precisely the boundary where motion compensation is required. One nice secondary result: on the static clip our ctxcoder on the temporal residual (1.01 MB) beats JXL-on-residual (1.23 MB) — the right entropy back-end for a near-zero residual stream.

Motion compensation closes the gap. A raw frame-delta forces a zero motion vector, so it loses where content moves. Block MC — per 16×16 block, search the previous frame in a ±8 window for the min-SAD displacement, then code (motion vector + residual) with ctxcoder — converts the losses into wins/ties (60 frames, round-trip verified):

clip intra-JXL frame-delta motion-comp
akiyo (static) 2.10 MB 1.01 MB (+52%) 0.95 MB (+55%)
foreman (medium) 2.86 MB 3.30 MB (−16%) 2.78 MB (+3%)
stefan (high motion) 3.25 MB 3.82 MB (−18%) 3.28 MB (−1%)

MC turns foreman's 16% loss into a 3% win and stefan's 18% loss into a tie — beating intra-only JXL (itself stronger than FFV1's intra) on 2 of 3 clips. A wider ±16 search barely moved the numbers, so the residual cost dominates: the remaining stefan gap is occlusion / newly-revealed content that block matching can't predict. This is the same block-search idea as our LZ match-finder, applied across frames.

Per-block intra/inter mode selection removes that last loss. Each 16×16 block picks the cheaper of INTER (the MC residual) or INTRA (a causal MED / LOCO-I predictor — the JPEG-LS median of left, above and the gradient — within the current frame), so occlusion / newly-revealed blocks with no good past match fall back to intra. The mode bit, motion vectors (inter blocks only) and residual are all ctxcoder-coded. Reconstruction replays the intra pixels causally (intra slots start as a sentinel, so the round-trip genuinely exercises the causal chain) — verified bit-exact.

clip intra-JXL MC MC + mode (MED) intra blocks
akiyo (static) 2.10 MB 0.946 MB 0.945 MB (+55%) 2%
foreman (medium) 2.86 MB 2.780 MB 2.716 MB (+5%) 27%
stefan (high motion) 3.25 MB 3.284 MB 3.172 MB (+2%) 41%

Now every clip beats intra-only JXL, including high-motion stefan: MED codes the occlusion blocks well enough that 27–41% of blocks on the motion clips choose intra, all reusing the project's own primitives (the block search mirrors the LZ match-finder; the residual coder is ctxcoder).

Half-pixel motion vectors add the last gain. After the integer search, each block is refined over the 9 half-pel positions around its best integer MV (bilinear interpolation of the previous frame), keeping the lower-SAD one; MVs are then coded in half-pel units. Real motion is rarely integer-aligned, so this shrinks the inter residual on the moving clips:

clip mode int-MV mode half-pel vs intra-JXL
akiyo (static) 0.945 MB 0.934 MB +56%
foreman (pan) 2.716 MB 2.600 MB +9%
stefan (motion) 3.172 MB 3.054 MB +6%

Half-pel adds +1–4% on top of mode selection, and by improving inter prediction it lets fewer blocks fall back to intra (foreman 27%→16%). The complete arc — temporal-delta → motion compensation → per-block mode selection → MED intra → half-pel MVs — takes stefan from −18% to +6% and foreman from −16% to +9%, beating intra-only JXL (itself stronger than FFV1's intra) on every clip.

A per-block SKIP mode handles exact-static content. In a lossless codec a block can be skipped — no residual, just a mode flag — only when it is bit-identical to its prediction; the co-located previous block (MV 0) catches static backgrounds. On akiyo's static studio set 56% of blocks skip, for +2.7% (→ +57% vs intra-only). On the real-camera clips, sensor noise means no block is exactly static, so skip is never chosen and costs nothing (foreman/stefan unchanged at +9% / +6%). It's a targeted win for screen content / surveillance / animation, harmless elsewhere.

Quarter-pixel motion vectors refine once more: the sub-pel predictor generalises to a single bilinear sampler in quarter-pel units (integer / half / quarter all special cases), and the search refines integer → half → quarter. On top of half-pel it adds +1.5–2% — akiyo +57%→+58%, foreman +9%→+10%, stefan +6%→+7% vs intra-only JXL — diminishing returns after the half-pel step, as expected. The finished inter-frame coder (MC + quarter-pel + per-block SKIP/INTER/INTRA with MED intra, all ctxcoder-coded, every frame bit-exact) takes stefan from −18% to +7% and foreman from −16% to +10%.

Colour planes. Everything above is luma; the clips are 4:2:0, so U/V are quarter-resolution chroma. Running the full pipeline independently on each plane (60 frames, round-trip verified), the full-YUV totals vs per-plane intra-only JXL:

clip Y U V total vs raw YUV
akiyo (static) +58% +49% +49% +56% 7.15× (intra 3.17×)
foreman (pan) +10% +10% +0% +9% 2.74× (intra 2.49×)
stefan (motion) +7% −4% −2% +5% 2.22× (intra 2.11×)

The total beats intra-only on every clip. On static content chroma compresses as well as luma (+49%), but on the motion clips chroma is a wash or slight loss (stefan U/V −2–4%): chroma is smooth and low-energy where intra-JXL is already strong.

Deriving chroma MVs from luma — tested, doesn't help here. The textbook codec design (one mode + one luma MV per block; chroma inherits the mode and a MV scaled by the 4:2:0 subsampling, coding no chroma MV/mode) was the obvious fix for that chroma softness. It instead slightly regressed vs the independent per-plane coder (akiyo −2.7%, foreman −0.2%, stefan −0.5%; 60 frames, round-trip verified). Two reasons: joint coding gives up per-plane SKIP — a chroma block is often exactly static while its luma block moves, which the independent coder skips but the joint coder can't — and it forces a shared mode; meanwhile ctxcoder already codes the small chroma MVs and mode flags so cheaply that the removed "overhead" is negligible. So the chroma softness was never MV/mode cost — chroma is simply smooth content intra-JXL handles well. We keep the independent per-plane coder; the shared-MV design only pays when MV/mode coding is expensive, which it isn't here — a reminder that codec choices are entropy-coder-dependent.

This whole pipeline is now a real codec, not just benchmark scripts: pertype/videocodec.py is a first-class encode / decode (and encode_yuv / decode_yuv) that emits a VID1 container and reconstructs frames from it byte-exact — quarter-pel MC + per-block SKIP/INTER/INTRA (MED), residuals and MVs via ctxcoder, frame 0 all-intra, depends only on numpy + ctxcoder. It's covered by round-trip tests (all block modes, single-frame, fully-static, YUV) and verified on real clips (akiyo 6.58×, foreman 2.30× vs raw luma, 20 frames, bit-exact), and exposed on the CLI (video-encode / video-decode on .y4m).

Real FFV1 baseline. With a static ffmpeg (from the imageio-ffmpeg wheel) we can now compare against FFV1 — the standard intra-only lossless video codec — instead of the JXL stand-in (scripts/video_ffv1_benchmark.py, full YUV, 60 frames, round-trip verified). FFV1 is intra-only, so our motion compensation wins across the board:

clip raw YUV FFV1 ours ours vs FFV1
akiyo (static) 9.12 MB 2.78 MB 1.31 MB +53%
foreman (pan) 9.12 MB 3.69 MB 3.38 MB +8%
stefan (motion) 9.12 MB 4.52 MB 4.16 MB +8%

JXL-intra came out within ~3% of FFV1 throughout, confirming it was a fair stand-in. We beat the real specialist by exploiting the temporal redundancy it ignores.

On real movies — and where the line falls. Beyond the CIF test clips, we ran the codec on decoded frames from a real movie library (scripts/movie_lossless_benchmark.py — decodes a clip to raw 4:2:0 locally with the bundled ffmpeg, then compares ours vs FFV1 / intra-JXL, round-trip verified). The honest framing first: these movies are lossy H.264/MPEG-2, already ~40–200× smaller than raw — a lossless codec can't "beat" the file itself; the fair question is lossless-vs-lossless on the decoded frames. There, the result splits cleanly by content motion, and the codec's own block-mode mix says exactly why:

clip (1080p unless noted) content ours FFV1 ours vs FFV1 Y blocks (skip/inter/intra)
Early Man claymation 13.3× 6.0× +55% 39 / 52 / 9
Girl Who Leapt Through Time anime 7.4× 5.0× +32%
Shrek Forever After CGI 7.4× 6.2× +16%
Force Awakens live action 10.4× 9.6× +7%
Snatch (576p) live action 3.9× 3.4× +12%
Snow White (576p) cel 4.2× 4.1× +3%
The Gentlemen high-motion 6.5× 7.7× −18% 6 / 5 / 89
Sherlock Holmes (576p) high-motion 5.6× 6.0× −6%

Animation is the niche. Held cels, static backgrounds and slow pans make ~90% of blocks skip-or-inter (Early Man 39% skip + 52% inter) — exactly the temporal redundancy intra-only FFV1 throws away — so our edge grows with how static the content is, peaking at +55% on stop-motion. High-motion live action is the opposite: 89% of blocks fall back to intra (The Gentlemen), where our plain-MED intra path is weaker than FFV1's context-modelled intra, so we lose.

Why a stronger motion search didn't change that. The obvious fix for the high-motion losses was a wider motion search, so we replaced the fixed ±8 integer search with a hierarchical coarse-to-fine search (a ÷2 pyramid level extends the effective range to ~±19 px, then a per-block full-res refinement) — a genuinely stronger, more robust search, no round-trip change (the decoder reconstructs from whatever MVs the encoder picks). It moved high-motion by <1%. The block-mode mix explains it: only 5% of high-motion blocks even use inter prediction — motion search was never the bottleneck. The intra path looked like the real lever — but a measure-first test (scripts/video_intra_benchmark.py) ruled out a predictor upgrade: swapping plain MED for CALIC-class intra gains only ~0.4% where intra dominates (high-motion residuals are near-random), a mode-weighted ≤1.13% overall — below the +3% bar. The high-motion gap to FFV1 is its context-modelled entropy path, the same large rewrite as the json gap (see Status & roadmap). The hierarchical search is kept because it's strictly better and helps fast-but-coherent camera pans, which the CIF clips don't exercise.

Status & roadmap

Validated end-to-end on real data across four domains (every result round-trip verified), fully ported to Rust, and packaged as an installable tool:

  • text / byte — trained per-type dictionary + LZ (with a validation-gated blob) + repeat offsets + cost-optimal parse (adaptive search depth) + arithmetic coding;
  • audio — adaptive sign-sign LMS cascade → adaptive Rice / context coder (beats FLAC, and xz by +59%);
  • video — quarter-pel motion compensation + per-block SKIP/INTER/INTRA (MED intra) + context-adaptive residuals (beats FFV1); a real encode/decode (pertype/videocodec.py) exposed on the CLI;
  • numeric / biosignal — per-type transform + the context-adaptive ctxcoder (beats xz on ECG; 6.27× on repetitive sensor data, beating gzip).

The whole compress/decompress hot path is native (C via ctypes, bit-identical with a pure-Python fallback) — ~140× on text — so the family is fast enough to use.

Shipped as a product (see the Quickstart and License sections above): pip install . gives a pertype command with a unified, self-describing compress/decompress; a complete Rust crate is byte-identical to the Python/C reference for both compress and train (with a standalone pertype binary, cross-compatible with the Python tool); dual-licensed AGPL-3.0-or-later + commercial.

The honest open frontier (full list in TODO.md):

  • Beat zstd --train on json too — we beat it on logs (+7%) and html (+6%); json is the holdout, now 6% behind (52.7 vs 49.7 KB) after a varint header and a depth-16 repeat-offset cache closed 38% of the original gap. The remaining ~3 KB is not the dictionary (proven: zstd's own 256 KB dict in our codec is no better), not the literals (order-0 arithmetic is near-optimal), and not offset entropy coding (the distance extra bits are provably ~incompressible — a per-slot model recovers ~178 B of 11.2 KB). It is purely zstd's repeat-offset-aware optimal parser turning more of json's many short matches (~9.7 K, avg 44 B) into near-free rep-hits. But a ceiling test shows even that lever is small — only 2.5% of matches have an equal-length alternative at a cached distance (json's matches hit too many distinct blob positions), worth ~186 B. So no single lever closes the ~2 KB gap; it is the diffuse sum of zstd's mature, integrated parser+coder, won't-fix short of reimplementing its sequence coder wholesale. (Deepening the hash-chain search — the parse is search-limited — recovers up to ~1 KB on this larger json; we now do this adaptively per file — deep on small files, tapering on large — a measured Pareto +0.5–1.4% on held-out text at bounded cost. The rep-aware parser is the only larger lever, still won't-fix.)
  • Stronger video intra — measured below bar, ruled out. On real movies we beat FFV1 on all animation (peak +55% on stop-motion) and general live action, but lose on high-motion. The obvious fix looked like swapping the plain-MED intra path for a CALIC-class predictor + energy-conditioned coding. Measured first (scripts/video_intra_benchmark.py): CALIC beats MED+ctxcoder by +4.6% on low-motion / +4.2% on medium but only +0.4% on high-motion (the target), and the mode-weighted realized gain is ≤1.13% across clips — below the +3% bar. The lever helps where intra is rare (smooth frames are inter-coded) and does nothing where intra dominates (high-motion residuals are near-random). The high-motion gap to FFV1 is an entropy-model problem, not a predictor one — the same large rewrite as the json gap.
  • More transforms — a 2D MED/Paeth intra predictor (shared image + video). Float is now handled by Gorilla XOR-delta and an FCM/DFCM value predictor (both beat xz/zstd on float64; FCM/DFCM dominates structured series). A native C port of the FCM predictor would remove its pure-Python training-time cost.
  • Distribution — complete. rust/ is a feature-complete safe-Rust port: every codec (arithmetic coder, ctxcoder, CALIC, columnar, float, CSV, image, audio, video, the trained text codec) plus model training, byte-identical to the Python/C reference (the two zlib-using codecs cross-decodable both directions), with rayon block parallelism. It ships a standalone pertype binary (no Python), cross-compatible with the Python tool. Verified in tests/test_rust_port.py; speed in scripts/rust_vs_python*benchmark.py (decode 1–10×, training 11–115×). See rust/README.md. v0.1.0 is released — per-OS binaries (Linux musl / Windows / macOS x86+arm) are attached to the GitHub Release. Registry installs (PyPI / crates.io) are the remaining step.

The throughline: predict per type, then entropy-code. It beats the general-purpose tools, and the domain specialists, exactly where prediction beats LZ — and it says so honestly where LZ wins instead.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pertype-0.1.0.tar.gz (184.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pertype-0.1.0-py3-none-any.whl (123.0 kB view details)

Uploaded Python 3

File details

Details for the file pertype-0.1.0.tar.gz.

File metadata

  • Download URL: pertype-0.1.0.tar.gz
  • Upload date:
  • Size: 184.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pertype-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7e2c5ca25b93db4604ddadcb658fa3855d974704204da76448adf86fa786dd5a
MD5 1a19228fb431faa826589cf2bbb9dd5b
BLAKE2b-256 35671466632c7362095918d093ebba769cea1ac88e87c0bda064da0e89bc87ce

See more details on using hashes here.

File details

Details for the file pertype-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pertype-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 123.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for pertype-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8331e5c2b158e00b4e065a8aa25014b1d1730e38c7ce243d07a8294593b798d6
MD5 b9caf5e20628e43f4f3295d4e86605f3
BLAKE2b-256 6888c4af9e4827ffeaceb6a6859348cad52bdc83c73c876eededd75acead5c66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page