Skip to main content

Lightspeed video decoding directly into tensors!

Project description

Release and Benchmark Tests License PyPI Version PyPI - Downloads Python Versions Discord

NeLux

NeLux is a high-performance Python library for video processing, leveraging the power of FFmpeg with hardware acceleration (NVDEC/NVENC). It delivers some of the fastest decode times globally, enabling efficient video decoding directly into ML-ready PyTorch tensors.

Originall created by Trentonom0r3


Installation

pip install nelux

Supported platforms:

Platform Backends Notes
Windows x64 CPU + CUDA (NVDEC/NVENC) Requires FFmpeg DLLs on PATH (or pass to os.add_dll_directory).
Linux x86_64 (manylinux_2_28+) CPU + CUDA (NVDEC/NVENC) Install FFmpeg via apt install ffmpeg libavcodec62 libavformat62 libavutil60 libswscale9 libavfilter11 libavdevice62.
macOS arm64 (Apple Silicon, ≥ 12.0) CPU / MPS (via PyTorch) Install FFmpeg via brew install ffmpeg. No CUDA on macOS.

PyTorch must be importable before nelux — the package uses torch's C++ runtime. For CUDA builds, install the matching CUDA torch wheel:

# Linux CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130

# macOS / Linux CPU
pip install torch torchvision

Quick Start

Basic Usage

import torch  # must be imported before nelux
from nelux import VideoReader

# Open video with hardware acceleration (CPU path also supported)
reader = VideoReader("input.mp4", decode_accelerator="nvdec")

# Iterate frames — HWC uint8 by default (matches torchcodec convention)
for frame in reader:
    print(frame.shape)   # torch.Size([1080, 1920, 3]) — HWC
    print(frame.dtype)   # torch.uint8 for 8-bit sources; torch.int16 for >8-bit
                         # (override with force_8bit=True to always return uint8)

    # Permute to BCHW + cast to float when feeding to an ML model
    chw = frame.permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 255.0
    output = model(chw)

Batch Frame Reading

import torch
from nelux import VideoReader

vr = VideoReader("video.mp4")

# Get specific frames — returned tensor is [B, H, W, 3] HWC uint8
batch = vr.get_batch([0, 10, 20])           # [3, H, W, 3]
batch = vr.get_batch_range(0, 100, 10)      # [10, H, W, 3]

# Pythonic slice / list notation (delegates to get_batch under the hood)
batch = vr[0:100:10]                        # [10, H, W, 3]
batch = vr[[-3, -2, -1]]                    # Last 3 frames (negative indexing OK)
single = vr[42]                             # Single frame [H, W, 3]

# Properties
print(len(vr))                              # Total frame count
print(vr.shape)                             # (frames, H, W, 3)

Video Encoding

import torch
from nelux import VideoReader

reader = VideoReader("input.mp4")

# `create_encoder` pre-configures dimensions / fps / pixel format from the source.
with reader.create_encoder("output.mp4") as enc:
    for frame in reader:
        enc.encode_frame(frame)            # frame is [H, W, 3] uint8

print("Done!")

Features

Core Features

  • Hardware Acceleration: NVDEC (decode) and NVENC (encode) on NVIDIA GPUs
  • Native HWC uint8 Output: frames decoded directly into a torch.Tensor of shape [H, W, 3] (or [H, W, 3] int16 for >8-bit sources; force_8bit=True clamps to uint8 always). No implicit float conversion — you cast/normalize on your side based on your model's expected input
  • CPU Path Matches ffmpeg Byte-for-Byte: pure libswscale convert pipeline, default SWS_BILINEAR flags; output is bit-identical to ffmpeg -vf format=rgb24 on every common YUV/RGB format (see CHANGELOG v0.11.0)
  • Batch Decoding: get_batch([...]) / vr[start:stop:step] returns [B, H, W, 3] with seek minimization, deduplication, and a dedicated random-access decoder

Performance Knobs

  • prefetch=True: background producer thread (off by default — queue handoff costs ~2.5× more than the parallelism saves at typical decode speeds)
  • convert_workers=N: explicit control over the CPU convert-pool size. None (default) uses min(hw_concurrency, 16) for throughput-max; 0 matches torchcodec's polite single-threaded convert footprint; positive N pins to that count. See CHANGELOG v0.11.0 for measured tradeoffs
  • NVDEC fused convert: CUDA kernels for NV12 / P010 → RGB run in-line on the GPU; output stays on cuda:0 as a torch tensor — no CPU round-trip when decode_accelerator="nvdec"
  • Decoder-side resize=(W, H): CPU path scales in libswscale; NVDEC uses cuvid's built-in resize=WxH — single pass, no post-decode F.interpolate/cv2.resize needed

Supported Codecs & Formats

CPU path supports anything libavcodec can decode (h264, hevc, vp8/9, av1, mpeg2/4, prores, …). NVDEC support depends on your GPU generation.

Feature CPU path NVDEC path
Codecs any libavcodec decoder H.264, H.265/HEVC, VP9, AV1 (GPU-dependent)
Pixel formats all common YUV/RGB (yuv420p[10le]/yuv422p/yuv444p[10le]/nv12/nv21/rgb24/bgr24/gbrp/yuvj*) NV12, P010, P016, YUV444 (8/10/12/16-bit)
Containers anything libavformat can demux same

Benchmarks

H.264 decode → RGB tensor throughput, measured on Intel i9-13900K (24 logical cores) + RTX 3090, Windows 11, FFmpeg 8.x, PyTorch 2.11+cu130, nelux 0.11.0. Each row is the median of 5 fresh subprocess runs, 600 frames per run (300 at 4K). Output is HWC uint8 for every decoder (apples-to-apples).

Headline: nelux default vs torchcodec vs ffmpeg (CPU)

Resolution Decoder fps CPU% avg RSS MB
720p nelux (default) 3422 874 2350
torchcodec 2924 344 2395
ffmpeg-rgb24 (subprocess) 2273
1080p nelux (default) 2642 1426 4480
torchcodec 1589 502 4502
ffmpeg-rgb24 (subprocess) 1102
4K nelux (default) 607 1656 9205
torchcodec 367 487 9098
ffmpeg-rgb24 (subprocess) 254

nelux fan-outs libswscale convert across cores → +14–67% fps over torchcodec at every res. The trade: ~2.5–3× CPU. RSS is essentially identical.

Polite mode (convert_workers=0) vs torchcodec

Disabling the convert worker pool matches torchcodec's single-threaded convert architecture exactly. fps + CPU + RSS land within ~2%:

Resolution Decoder fps CPU% RSS MB
720p nelux (convert_workers=0) 3167 366 598
torchcodec 3090 343 673
1080p nelux (convert_workers=0) 1755 435 659
torchcodec 1728 432 732
4K nelux (convert_workers=0) 394 440 1022
torchcodec 401 477 1095

So the "+14–67% fps" win above is entirely the convert worker pool — strip it and nelux ≈ torchcodec on every dimension. Pick the trade you want via convert_workers=N.

NVDEC (GPU decode) vs ffmpeg-nvdec

Resolution Decoder fps CPU% GPU mem MB
720p nelux (decode_accelerator="nvdec") 1651 45 2886
ffmpeg-nvdec (subprocess) 1253 2902
1080p nelux 667 40 2911
ffmpeg-nvdec 592 2967
4K nelux 175 24 3052
ffmpeg-nvdec 162 3259

nelux NVDEC beats raw ffmpeg-nvdec by 8–32% on fps at lower CPU (NV12→RGB runs as a fused CUDA kernel; output stays on the GPU as a torch.Tensor, no host round-trip).

Quality (vs ffmpeg -vf format=rgb24 reference, 30-frame compare)

Across 14 (pix_fmt × colorspace) combos: 12 / 14 PSNR = ∞, SSIM = 1.000 — byte-identical to ffmpeg. The two exceptions are yuv420p10le (PSNR 47.9–48.3 dB / VMAF 99.85+) where 10→8-bit downconvert rounds differently from ffmpeg's direct 10-bit YUV→RGB path; perceptually identical. See tests/output/pixfmt_matrix/REPORT.md for the full table.

Caveats

  • ffmpeg-rgb24 CPU% omitted — it runs as a subprocess; the psutil sampler ticks every 100 ms and ffmpeg startup is short, so the few samples it gets are not representative. fps is valid (time wall-clock).
  • Single hardware data point — your numbers will differ. Reproduce with python tests/comprehensive_bench.py --tag mybox (full table) or python tests/bench_thread_modes.py (decoder-architecture comparison).
  • Default prefetch=False matches typical use. With prefetch=True nelux can squeeze another ~3–5% fps on big clips but burns more RAM (background producer queue).

API Reference

VideoReader

VideoReader(
    input_path: str,
    num_threads: int = 0,                          # 0 = ffmpeg auto-detect
    force_8bit: bool = False,                      # cast >8-bit YUV down to uint8
    backend: Literal["pytorch", "numpy"] = "pytorch",
    decode_accelerator: Literal["cpu", "nvdec"] = "cpu",
    cuda_device_index: int = 0,                    # NVDEC GPU index
    resize: tuple[int, int] | None = None,         # decoder-side scale to (W, H)
    prefetch: bool = False,                        # background producer thread
    convert_workers: int | None = None,            # None = min(hw, 16); 0 = polite
)

Properties:

  • width, height, fps, min_fps, max_fps, duration, total_frames
  • pixel_format, bit_depth, aspect_ratio, codec, has_audio
  • properties (full VideoProperties struct)
  • shape(frame_count, H, W, 3) (Python-side BatchMixin)
  • frame_count → cached get_frame_count() (Python-side BatchMixin)

Methods:

  • read_frame() / __next__() / iteration → next [H, W, 3] frame
  • frame_at(timestamp: float | index: int) → random-access frame via secondary decoder (doesn't disturb iteration)
  • __getitem__(int | float | slice | list | range) → single frame OR [B, H, W, 3] batch
  • decode_batch(indices: list[int]) → C++ batch path; called by get_batch after validation
  • get_batch(indices) / get_batch_range(start, end, step) → batch decode with seek minimization
  • set_range(start, end) / reset() → bound iteration
  • reconfigure(...) → reuse this VideoReader for a different file (10-50× faster than re-constructing)
  • create_encoder(output_path)VideoEncoder pre-configured to this source's dims/fps/format
  • start_prefetch() / stop_prefetch() / prefetch_buffered / is_prefetching → runtime prefetch control
  • supported_codecs() → list of codecs the linked libavcodec can decode

Documentation


Requirements

  • Python: 3.13+ (see pyproject.toml requires-python)
  • PyTorch: 2.11+ (import torch must precede import nelux; the matching CUDA wheel provides the CUDA runtime nelux's NVDEC path needs)
  • CUDA: 13.x (for NVDEC/NVENC builds). CPU-only builds drop this requirement.
  • OS: Windows 10/11, Linux (manylinux_2_28+ / Ubuntu 22.04+), macOS 12+ (Apple Silicon, CPU only)

Building from Source

Build system is scikit-build-core + CMake + Ninja + vcpkg. There is no setup.py.

git clone https://github.com/NevermindNilas/NeLux.git
cd NeLux

# Editable install — invokes scikit-build-core, which configures CMake + Ninja
# and runs vcpkg under the hood. Set NELUX_ENABLE_CUDA=ON to build NVDEC/NVENC.
NELUX_ENABLE_CUDA=ON pip install -e .

# Or build a wheel
NELUX_ENABLE_CUDA=ON pip wheel . -w dist/

On Windows the build needs MSVC 18 (or compatible), and FFmpeg headers/libs under external/ffmpeg/ (see tools/download_ffmpeg.ps1).

See BUILD.md for detailed build instructions.


License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.


Acknowledgments

  • FFmpeg: The backbone of video processing in NeLux
  • PyTorch: For tensor operations and CUDA integration
  • Contributors: Thanks to everyone who has contributed to NeLux!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nelux-0.11.0-cp314-cp314-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.14Windows x86-64

nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl (804.5 kB view details)

Uploaded CPython 3.14macOS 12.0+ ARM64

nelux-0.11.0-cp313-cp313-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.13Windows x86-64

nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl (804.0 kB view details)

Uploaded CPython 3.13macOS 12.0+ ARM64

File details

Details for the file nelux-0.11.0-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: nelux-0.11.0-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 a9834f04cb685ec843ffebcc4bf140243fbd6a5444bb7b53e9656993dec3aec5
MD5 7ff440b8483e13df2f6f9b9f4f88c738
BLAKE2b-256 6164367730890371de2f4795a0773be6e825ef17aa338ed712e1fc4312d5a07b

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 48a705011e4d5316af9781bc9151d7479de775b8b5933c7dbc4247eecbe3006b
MD5 de858255aed0947f808ec91fd5a99447
BLAKE2b-256 97047a59d8e2c04854a5513e2102259df4c32c0bdbdc97064aa2785562a1ee1c

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 462bbc78030ae3d5edf53527979a8ca31fa6743a6952a60407f813861f29c0c5
MD5 f1a6fbe35717484948a6c0e2540238a2
BLAKE2b-256 6e23e3552d9ffe4b2828b0cfe96b664551e1c4d966673ebd007d55d96330f953

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: nelux-0.11.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 25accb6bc93f0b32c73cd27810e483a7adf5156f9f35de7d548857e65140437d
MD5 9196a49af9eaf6630a01404c090e30a9
BLAKE2b-256 f158989ec328aca4a0f346eb74282d22eddce8782307e25ed4b542c3e62a7d61

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e6ed6029bb4f85d988b2669d4de66b6b73255ee854d2bd39ba8629aba0378940
MD5 60af20e24b545a9e7c125dd4802928ff
BLAKE2b-256 54cbf52be1fae4bd52cf11b0d67f13036e828b850ab35a267cad1342ae0e4c3e

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl.

File metadata

File hashes

Hashes for nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl
Algorithm Hash digest
SHA256 dd7fa0d4ecf00412df56afefb55c839c5a43b72aa1710b4f7ba452e8d2ac1706
MD5 388a6632d2650d64a526db8b90ee06de
BLAKE2b-256 1dc10b0299121c03fa4ebd0e81e6b77bcae70756b98e11f4594a0b1080546460

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page