High-speed video mosaic on CUDA: NVDEC/NVENC + torch-lap Hungarian (sibling of mosaic-temporal)
Project description
mosaic-temporal-gpu
The high-speed sibling of mosaic-temporal. NVDEC/NVENC + torch-lap Hungarian + on-GPU torch kernels (Triton port queued for v0.2).
⚠️ Status: 0.1.0 release candidate. Public API (
run_pipeline), kernels, solver, NVDEC/NVENC bridge, config schema, and CPU-host tests are in place. The remaining work toward 0.1.0 final is the parity-gate CI on a CUDA runner and the bench-spike sign-off on Kaggle T4 — see Roadmap. The Quickstart below is the supported API; the 3-stream CUDA-overlap optimization that motivated this repo lands in 0.2 without changing the signature.
Positioning
This is the high-speed build of the video mosaic pipeline. The portable
sibling mosaic-temporal keeps a CPU fallback at every step for users without
a GPU; this repo drops every fallback so the hot path can be NVDEC →
Triton → torch-lap → NVENC end-to-end. The cost is hard: NVIDIA GPU with
CUDA ≥ 12.0 is required. The benefit is real throughput on long clips.
| Feature | mosaic-temporal | mosaic-temporal-gpu (high-speed) |
|---|---|---|
| Hungarian assignment | scipy CPU (default) | torch-linear-assignment (only) |
| Cost matrix | numpy CPU loop | torch.cdist on CUDA (Triton in v0.2) |
| Oklab grid mean | numpy | torch view+reduce on CUDA (Triton v0.2) |
| Video I/O | cv2 PNG round-trip | PyAV NVDEC → ndarray → NVENC |
| RAFT optical flow | CPU torch (slow) | not in v0.1.0 — queued for v0.3 |
| Bit-exact CPU output | yes (bit-exact-cpu) |
no — parity gated at SSIM ≥ 0.98 |
| Runtime requirement | none | NVIDIA GPU with CUDA ≥ 12.0 |
If you need the CPU fallback, the bit-exact reference, or Windows/macOS support, use mosaic-temporal. If you have a CUDA GPU and want speed, you're in the right place.
Install (once 0.1.0 ships to PyPI)
mosaic-temporal-gpu requires a CUDA build of PyTorch. Install torch first
from the official CUDA wheel index, then install this package:
# 1. CUDA 12.1 wheels (adjust cu121 to your CUDA version)
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision
# 2. Pure compute kernels only (no video I/O — no PyAV)
pip install mosaic-temporal-gpu
# 2'. With NVDEC/NVENC video I/O (needs a cuvid-enabled FFmpeg + PyAV).
# The PyPI `av` wheel is software-only — see benchmarks/README.md for
# the FFmpeg+PyAV self-build recipe. The `[nvdec]` extra declares the
# `av>=12` dependency; it does NOT build FFmpeg for you.
pip install "mosaic-temporal-gpu[nvdec]"
If you skip step 1, pip will resolve torch to the CPU build from PyPI
and every CUDA-only call will fail at runtime — there is no CPU fallback on
purpose. NVIDIA driver ≥ R535 and CUDA ≥ 12.0 are prerequisites. Until 0.1.0
ships to PyPI, install from source:
git clone https://github.com/hinanohart/mosaic-temporal-gpu
cd mosaic-temporal-gpu
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision
pip install -e ".[dev]"
Quickstart
from pathlib import Path
from mosaic_temporal_gpu import run_pipeline
stats = run_pipeline(
input_video=Path("input.mp4"),
output_video=Path("output.mp4"),
tile_dir=Path("tiles/"), # keyword-only
fps=30, # NVENC output frame rate (input fps
# auto-detection lands in 0.2)
cq=19, # h264_nvenc constant-quality (lower = better)
)
print(stats)
# {"frames": 720, "width": 1920, "height": 1080,
# "fps": 30, "active_codec": "h264_cuvid"}
Pass a D1Config to override the default vivid_b preset:
from mosaic_temporal_gpu import D1Config, run_pipeline
run_pipeline(..., config=D1Config.from_preset("vivid_b"))
For 0.1.0 we ship the vivid_b preset only (saturation_boost=2.10,
mkl_hybrid, neighbor_swap_rounds=5). Additional presets and a CLI
front-end are deferred to 0.2 to keep the launch surface narrow.
The active_codec field in the return value is how you confirm NVDEC
engaged on the decode side ("h264_cuvid" / "hevc_cuvid"); if it
silently falls back to software, the reader raises before any frame is
processed — see the R8 assertion in io/nvdec.py.
What works today (component-level)
import torch
from mosaic_temporal_gpu import D1Config
from mosaic_temporal_gpu.kernels.cost_matrix import compute_cost_matrix_gpu
from mosaic_temporal_gpu.solvers.torch_lap import TorchLapSolver
cfg = D1Config.from_preset("vivid_b") # ✅ schema + preset
cost = compute_cost_matrix_gpu(cells, tiles) # ✅ GPU cost matrix (CUDA req'd)
assignment = TorchLapSolver().solve(cost) # ✅ GPU Hungarian
NvdecReader / NvencWriter are likewise importable and tested on CPU host
for their error paths; full round-trip needs CUDA.
Parity guarantee (planned, not yet wired)
The release contract is: for each frame of a fixed 24-frame synthetic clip,
SSIM(mosaic_temporal_gpu candidate, mosaicraft CPU reference) ≥ 0.98.
The test exists (tests/test_parity_vs_mosaicraft.py, @pytest.mark.parity),
but GitHub's free runners have no CUDA, so the parity job is not in CI
today — it runs locally on a CUDA host with pytest -m parity. A scheduled
GPU runner (Modal / RunPod) is queued for 0.1.0 final. Output is not bit-exact
(GPU reductions are non-associative); the SSIM gate is the operative contract.
Repository layout
src/mosaic_temporal_gpu/
__init__.py # version, public API (D1Config + exceptions today)
_version.py # single source of truth
config.py # D1Config schema (mirror of mosaic-temporal's GPU-valid subset)
kernels/
cost_matrix.py # GPU cost matrix (torch.cdist on CUDA; Triton port = v0.2)
oklab_grid.py # GPU Oklab grid mean (torch view+reduce; Triton port = v0.2)
solvers/
torch_lap.py # torch-linear-assignment wrapper
io/
nvdec.py # PyAV NVDEC reader
nvenc.py # PyAV NVENC writer
pipeline.py # end-to-end run_pipeline (single CUDA stream;
# 3-stream overlap is v0.2)
tests/
test_parity_vs_mosaicraft.py # SSIM ≥ 0.98 gate (xfail until CUDA CI)
test_pipeline_smoke.py # run_pipeline public-API contract
test_kernel_shapes.py
test_solver_torch_lap.py
test_io_bridges.py
test_config_schema.py
test_version_smoke.py
Roadmap
- 0.1.0 —
run_pipeline()shipped (single-stream NVDEC → mosaic → NVENC); parity gate green on a CUDA runner (Modal / RunPod queued); bench-spike sign-off on Kaggle T4. - 0.2 — 3-stream CUDA overlap (
decode | compute | encode); DLPack zero-copy on both ends of the video bridge; Triton kernels for cost matrix and Oklab grid (replacetorch.cdist/torch.view+meanonce we benchmark a real win); CLI front-end; additional presets. - 0.3 — RAFT optical flow on GPU for temporal coherence;
flow_warpmodule. - 1.0 — Stable parity gate across two driver/CUDA upgrades; one breaking-change cycle behind us.
Relation to siblings
- mosaicraft (image mosaic, pure numpy/cv2/scipy) — used here as the CPU reference for the parity gate and for the Oklab / MKL OT / Laplacian primitives.
- mosaic-temporal (video
mosaic, CPU/GPU dual path) — the portable sibling. Same
D1Configsurface, so config files port between the two.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mosaic_temporal_gpu-0.1.0.tar.gz.
File metadata
- Download URL: mosaic_temporal_gpu-0.1.0.tar.gz
- Upload date:
- Size: 36.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43d262b3e8eaab4503a8b137d42a6d7b85529a68b9735da8ae94f91652340d30
|
|
| MD5 |
f2d110f3f2fdaedd39f718af6e5156b0
|
|
| BLAKE2b-256 |
0dbd2c82729a87935cb500c73e4b9c9ecf0005ae0dd7ecdfde8aa09fb0c7714a
|
Provenance
The following attestation bundles were made for mosaic_temporal_gpu-0.1.0.tar.gz:
Publisher:
release.yml on hinanohart/mosaic-temporal-gpu
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mosaic_temporal_gpu-0.1.0.tar.gz -
Subject digest:
43d262b3e8eaab4503a8b137d42a6d7b85529a68b9735da8ae94f91652340d30 - Sigstore transparency entry: 1520054037
- Sigstore integration time:
-
Permalink:
hinanohart/mosaic-temporal-gpu@a34091e2328c48e2d026d3c0129ee4cf8ddd0a2d -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a34091e2328c48e2d026d3c0129ee4cf8ddd0a2d -
Trigger Event:
push
-
Statement type:
File details
Details for the file mosaic_temporal_gpu-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mosaic_temporal_gpu-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cb34b357f248a245f18e46e44b4cedfd4e5348a2b71c1c1691f3d046855bd07
|
|
| MD5 |
ec706f825904ccc2004f711102719839
|
|
| BLAKE2b-256 |
3058be40084e074c88db7793f68e594394e91e6e36a70f17cdd6feaf303e27ab
|
Provenance
The following attestation bundles were made for mosaic_temporal_gpu-0.1.0-py3-none-any.whl:
Publisher:
release.yml on hinanohart/mosaic-temporal-gpu
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mosaic_temporal_gpu-0.1.0-py3-none-any.whl -
Subject digest:
2cb34b357f248a245f18e46e44b4cedfd4e5348a2b71c1c1691f3d046855bd07 - Sigstore transparency entry: 1520054047
- Sigstore integration time:
-
Permalink:
hinanohart/mosaic-temporal-gpu@a34091e2328c48e2d026d3c0129ee4cf8ddd0a2d -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/hinanohart
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@a34091e2328c48e2d026d3c0129ee4cf8ddd0a2d -
Trigger Event:
push
-
Statement type: