Skip to main content

High-speed video mosaic on CUDA: NVDEC/NVENC + torch-lap Hungarian (sibling of mosaic-temporal)

Project description

mosaic-temporal-gpu

The high-speed sibling of mosaic-temporal. NVDEC/NVENC + torch-lap Hungarian + on-GPU torch kernels (Triton port queued for v0.2).

⚠️ Status: 0.1.0 release candidate. Public API (run_pipeline), kernels, solver, NVDEC/NVENC bridge, config schema, and CPU-host tests are in place. The remaining work toward 0.1.0 final is the parity-gate CI on a CUDA runner and the bench-spike sign-off on Kaggle T4 — see Roadmap. The Quickstart below is the supported API; the 3-stream CUDA-overlap optimization that motivated this repo lands in 0.2 without changing the signature.

CI License Python

Positioning

This is the high-speed build of the video mosaic pipeline. The portable sibling mosaic-temporal keeps a CPU fallback at every step for users without a GPU; this repo drops every fallback so the hot path can be NVDEC → Triton → torch-lap → NVENC end-to-end. The cost is hard: NVIDIA GPU with CUDA ≥ 12.0 is required. The benefit is real throughput on long clips.

Feature mosaic-temporal mosaic-temporal-gpu (high-speed)
Hungarian assignment scipy CPU (default) torch-linear-assignment (only)
Cost matrix numpy CPU loop torch.cdist on CUDA (Triton in v0.2)
Oklab grid mean numpy torch view+reduce on CUDA (Triton v0.2)
Video I/O cv2 PNG round-trip PyAV NVDEC → ndarray → NVENC
RAFT optical flow CPU torch (slow) not in v0.1.0 — queued for v0.3
Bit-exact CPU output yes (bit-exact-cpu) no — parity gated at SSIM ≥ 0.98
Runtime requirement none NVIDIA GPU with CUDA ≥ 12.0

If you need the CPU fallback, the bit-exact reference, or Windows/macOS support, use mosaic-temporal. If you have a CUDA GPU and want speed, you're in the right place.

Install (once 0.1.0 ships to PyPI)

mosaic-temporal-gpu requires a CUDA build of PyTorch. Install torch first from the official CUDA wheel index, then install this package:

# 1. CUDA 12.1 wheels (adjust cu121 to your CUDA version)
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision

# 2. Pure compute kernels only (no video I/O — no PyAV)
pip install mosaic-temporal-gpu

# 2'. With NVDEC/NVENC video I/O (needs a cuvid-enabled FFmpeg + PyAV).
#     The PyPI `av` wheel is software-only — see benchmarks/README.md for
#     the FFmpeg+PyAV self-build recipe. The `[nvdec]` extra declares the
#     `av>=12` dependency; it does NOT build FFmpeg for you.
pip install "mosaic-temporal-gpu[nvdec]"

If you skip step 1, pip will resolve torch to the CPU build from PyPI and every CUDA-only call will fail at runtime — there is no CPU fallback on purpose. NVIDIA driver ≥ R535 and CUDA ≥ 12.0 are prerequisites. Until 0.1.0 ships to PyPI, install from source:

git clone https://github.com/hinanohart/mosaic-temporal-gpu
cd mosaic-temporal-gpu
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision
pip install -e ".[dev]"

Quickstart

from pathlib import Path
from mosaic_temporal_gpu import run_pipeline

stats = run_pipeline(
    input_video=Path("input.mp4"),
    output_video=Path("output.mp4"),
    tile_dir=Path("tiles/"),       # keyword-only
    fps=30,                        # NVENC output frame rate (input fps
                                   # auto-detection lands in 0.2)
    cq=19,                         # h264_nvenc constant-quality (lower = better)
)
print(stats)
# {"frames": 720, "width": 1920, "height": 1080,
#  "fps": 30, "active_codec": "h264_cuvid"}

Pass a D1Config to override the default vivid_b preset:

from mosaic_temporal_gpu import D1Config, run_pipeline
run_pipeline(..., config=D1Config.from_preset("vivid_b"))

For 0.1.0 we ship the vivid_b preset only (saturation_boost=2.10, mkl_hybrid, neighbor_swap_rounds=5). Additional presets and a CLI front-end are deferred to 0.2 to keep the launch surface narrow.

The active_codec field in the return value is how you confirm NVDEC engaged on the decode side ("h264_cuvid" / "hevc_cuvid"); if it silently falls back to software, the reader raises before any frame is processed — see the R8 assertion in io/nvdec.py.

What works today (component-level)

import torch
from mosaic_temporal_gpu import D1Config
from mosaic_temporal_gpu.kernels.cost_matrix import compute_cost_matrix_gpu
from mosaic_temporal_gpu.solvers.torch_lap import TorchLapSolver

cfg = D1Config.from_preset("vivid_b")          # ✅ schema + preset
cost = compute_cost_matrix_gpu(cells, tiles)   # ✅ GPU cost matrix (CUDA req'd)
assignment = TorchLapSolver().solve(cost)      # ✅ GPU Hungarian

NvdecReader / NvencWriter are likewise importable and tested on CPU host for their error paths; full round-trip needs CUDA.

Parity guarantee (planned, not yet wired)

The release contract is: for each frame of a fixed 24-frame synthetic clip, SSIM(mosaic_temporal_gpu candidate, mosaicraft CPU reference) ≥ 0.98. The test exists (tests/test_parity_vs_mosaicraft.py, @pytest.mark.parity), but GitHub's free runners have no CUDA, so the parity job is not in CI today — it runs locally on a CUDA host with pytest -m parity. A scheduled GPU runner (Modal / RunPod) is queued for 0.1.0 final. Output is not bit-exact (GPU reductions are non-associative); the SSIM gate is the operative contract.

Repository layout

src/mosaic_temporal_gpu/
  __init__.py            # version, public API (D1Config + exceptions today)
  _version.py            # single source of truth
  config.py              # D1Config schema (mirror of mosaic-temporal's GPU-valid subset)
  kernels/
    cost_matrix.py       # GPU cost matrix (torch.cdist on CUDA; Triton port = v0.2)
    oklab_grid.py        # GPU Oklab grid mean (torch view+reduce; Triton port = v0.2)
  solvers/
    torch_lap.py         # torch-linear-assignment wrapper
  io/
    nvdec.py             # PyAV NVDEC reader
    nvenc.py             # PyAV NVENC writer
  pipeline.py            # end-to-end run_pipeline (single CUDA stream;
                         # 3-stream overlap is v0.2)
tests/
  test_parity_vs_mosaicraft.py   # SSIM ≥ 0.98 gate (xfail until CUDA CI)
  test_pipeline_smoke.py         # run_pipeline public-API contract
  test_kernel_shapes.py
  test_solver_torch_lap.py
  test_io_bridges.py
  test_config_schema.py
  test_version_smoke.py

Roadmap

  • 0.1.0run_pipeline() shipped (single-stream NVDEC → mosaic → NVENC); parity gate green on a CUDA runner (Modal / RunPod queued); bench-spike sign-off on Kaggle T4.
  • 0.2 — 3-stream CUDA overlap (decode | compute | encode); DLPack zero-copy on both ends of the video bridge; Triton kernels for cost matrix and Oklab grid (replace torch.cdist / torch.view+mean once we benchmark a real win); CLI front-end; additional presets.
  • 0.3 — RAFT optical flow on GPU for temporal coherence; flow_warp module.
  • 1.0 — Stable parity gate across two driver/CUDA upgrades; one breaking-change cycle behind us.

Relation to siblings

  • mosaicraft (image mosaic, pure numpy/cv2/scipy) — used here as the CPU reference for the parity gate and for the Oklab / MKL OT / Laplacian primitives.
  • mosaic-temporal (video mosaic, CPU/GPU dual path) — the portable sibling. Same D1Config surface, so config files port between the two.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosaic_temporal_gpu-0.1.0.tar.gz (36.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosaic_temporal_gpu-0.1.0-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file mosaic_temporal_gpu-0.1.0.tar.gz.

File metadata

  • Download URL: mosaic_temporal_gpu-0.1.0.tar.gz
  • Upload date:
  • Size: 36.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mosaic_temporal_gpu-0.1.0.tar.gz
Algorithm Hash digest
SHA256 43d262b3e8eaab4503a8b137d42a6d7b85529a68b9735da8ae94f91652340d30
MD5 f2d110f3f2fdaedd39f718af6e5156b0
BLAKE2b-256 0dbd2c82729a87935cb500c73e4b9c9ecf0005ae0dd7ecdfde8aa09fb0c7714a

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaic_temporal_gpu-0.1.0.tar.gz:

Publisher: release.yml on hinanohart/mosaic-temporal-gpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mosaic_temporal_gpu-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mosaic_temporal_gpu-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2cb34b357f248a245f18e46e44b4cedfd4e5348a2b71c1c1691f3d046855bd07
MD5 ec706f825904ccc2004f711102719839
BLAKE2b-256 3058be40084e074c88db7793f68e594394e91e6e36a70f17cdd6feaf303e27ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for mosaic_temporal_gpu-0.1.0-py3-none-any.whl:

Publisher: release.yml on hinanohart/mosaic-temporal-gpu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page