Lightspeed video decoding directly into tensors!

These details have not been verified by PyPI

Project links

Project description

NeLux

NeLux is a high-performance Python library for video processing, leveraging the power of FFmpeg with hardware acceleration (NVDEC/NVENC). It delivers some of the fastest decode times globally, enabling efficient video decoding directly into ML-ready PyTorch tensors.

Originall created by Trentonom0r3

Installation

pip install nelux

Supported platforms:

Platform	Backends	Notes
Windows x64	CPU + CUDA (NVDEC/NVENC)	Requires FFmpeg DLLs on `PATH` (or pass to `os.add_dll_directory`).
Linux x86_64 (manylinux_2_28+)	CPU + CUDA (NVDEC/NVENC)	Install FFmpeg via `apt install ffmpeg libavcodec62 libavformat62 libavutil60 libswscale9 libavfilter11 libavdevice62`.
macOS arm64 (Apple Silicon, ≥ 12.0)	CPU / MPS (via PyTorch)	Install FFmpeg via `brew install ffmpeg`. No CUDA on macOS.

PyTorch must be importable before nelux — the package uses torch's C++ runtime. For CUDA builds, install the matching CUDA torch wheel:

# Linux CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130

# macOS / Linux CPU
pip install torch torchvision

Quick Start

Basic Usage

import torch  # must be imported before nelux
from nelux import VideoReader

# Open video with hardware acceleration (CPU path also supported)
reader = VideoReader("input.mp4", decode_accelerator="nvdec")

# Iterate frames — HWC uint8 by default (matches torchcodec convention)
for frame in reader:
    print(frame.shape)   # torch.Size([1080, 1920, 3]) — HWC
    print(frame.dtype)   # torch.uint8 for 8-bit sources; torch.int16 for >8-bit
                         # (override with force_8bit=True to always return uint8)

    # Permute to BCHW + cast to float when feeding to an ML model
    chw = frame.permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 255.0
    output = model(chw)

Batch Frame Reading

import torch
from nelux import VideoReader

vr = VideoReader("video.mp4")

# Get specific frames — returned tensor is [B, H, W, 3] HWC uint8
batch = vr.get_batch([0, 10, 20])           # [3, H, W, 3]
batch = vr.get_batch_range(0, 100, 10)      # [10, H, W, 3]

# Pythonic slice / list notation (delegates to get_batch under the hood)
batch = vr[0:100:10]                        # [10, H, W, 3]
batch = vr[[-3, -2, -1]]                    # Last 3 frames (negative indexing OK)
single = vr[42]                             # Single frame [H, W, 3]

# Properties
print(len(vr))                              # Total frame count
print(vr.shape)                             # (frames, H, W, 3)

Video Encoding

import torch
from nelux import VideoReader

reader = VideoReader("input.mp4")

# `create_encoder` pre-configures dimensions / fps / pixel format from the source.
with reader.create_encoder("output.mp4") as enc:
    for frame in reader:
        enc.encode_frame(frame)            # frame is [H, W, 3] uint8

print("Done!")

Features

Core Features

Hardware Acceleration: NVDEC (decode) and NVENC (encode) on NVIDIA GPUs
Native HWC uint8 Output: frames decoded directly into a torch.Tensor of shape [H, W, 3] (or [H, W, 3] int16 for >8-bit sources; force_8bit=True clamps to uint8 always). No implicit float conversion — you cast/normalize on your side based on your model's expected input
CPU Path Matches ffmpeg Byte-for-Byte: pure libswscale convert pipeline, default SWS_BILINEAR flags; output is bit-identical to ffmpeg -vf format=rgb24 on every common YUV/RGB format (see CHANGELOG v0.11.0)
Batch Decoding: get_batch([...]) / vr[start:stop:step] returns [B, H, W, 3] with seek minimization, deduplication, and a dedicated random-access decoder

Performance Knobs

prefetch=True: background producer thread (off by default — queue handoff costs ~2.5× more than the parallelism saves at typical decode speeds)
convert_workers=N: explicit control over the CPU convert-pool size. None (default) uses min(hw_concurrency, 16) for throughput-max; 0 matches torchcodec's polite single-threaded convert footprint; positive N pins to that count. See CHANGELOG v0.11.0 for measured tradeoffs
NVDEC fused convert: CUDA kernels for NV12 / P010 → RGB run in-line on the GPU; output stays on cuda:0 as a torch tensor — no CPU round-trip when decode_accelerator="nvdec"
Decoder-side resize=(W, H): CPU path scales in libswscale; NVDEC uses cuvid's built-in resize=WxH — single pass, no post-decode F.interpolate/cv2.resize needed

Supported Codecs & Formats

CPU path supports anything libavcodec can decode (h264, hevc, vp8/9, av1, mpeg2/4, prores, …). NVDEC support depends on your GPU generation.

Feature	CPU path	NVDEC path
Codecs	any libavcodec decoder	H.264, H.265/HEVC, VP9, AV1 (GPU-dependent)
Pixel formats	all common YUV/RGB (yuv420p[10le]/yuv422p/yuv444p[10le]/nv12/nv21/rgb24/bgr24/gbrp/yuvj*)	NV12, P010, P016, YUV444 (8/10/12/16-bit)
Containers	anything libavformat can demux	same

Benchmarks

H.264 decode → RGB tensor throughput, measured on Intel i9-13900K (24 logical cores) + RTX 3090, Windows 11, FFmpeg 8.x, PyTorch 2.11+cu130, nelux 0.11.0. Each row is the median of 5 fresh subprocess runs, 600 frames per run (300 at 4K). Output is HWC uint8 for every decoder (apples-to-apples).

Headline: nelux default vs torchcodec vs ffmpeg (CPU)

Resolution	Decoder	fps	CPU% avg	RSS MB
720p	nelux (default)	3422	874	2350
	torchcodec	2924	344	2395
	ffmpeg-rgb24 (subprocess)	2273	—	—
1080p	nelux (default)	2642	1426	4480
	torchcodec	1589	502	4502
	ffmpeg-rgb24 (subprocess)	1102	—	—
4K	nelux (default)	607	1656	9205
	torchcodec	367	487	9098
	ffmpeg-rgb24 (subprocess)	254	—	—

nelux fan-outs libswscale convert across cores → +14–67% fps over torchcodec at every res. The trade: ~2.5–3× CPU. RSS is essentially identical.

Polite mode (`convert_workers=0`) vs torchcodec

Disabling the convert worker pool matches torchcodec's single-threaded convert architecture exactly. fps + CPU + RSS land within ~2%:

Resolution	Decoder	fps	CPU%	RSS MB
720p	nelux (`convert_workers=0`)	3167	366	598
	torchcodec	3090	343	673
1080p	nelux (`convert_workers=0`)	1755	435	659
	torchcodec	1728	432	732
4K	nelux (`convert_workers=0`)	394	440	1022
	torchcodec	401	477	1095

So the "+14–67% fps" win above is entirely the convert worker pool — strip it and nelux ≈ torchcodec on every dimension. Pick the trade you want via convert_workers=N.

NVDEC (GPU decode) vs ffmpeg-nvdec

Resolution	Decoder	fps	CPU%	GPU mem MB
720p	nelux (`decode_accelerator="nvdec"`)	1651	45	2886
	ffmpeg-nvdec (subprocess)	1253	—	2902
1080p	nelux	667	40	2911
	ffmpeg-nvdec	592	—	2967
4K	nelux	175	24	3052
	ffmpeg-nvdec	162	—	3259

nelux NVDEC beats raw ffmpeg-nvdec by 8–32% on fps at lower CPU (NV12→RGB runs as a fused CUDA kernel; output stays on the GPU as a torch.Tensor, no host round-trip).

Quality (vs `ffmpeg -vf format=rgb24` reference, 30-frame compare)

Across 14 (pix_fmt × colorspace) combos: 12 / 14 PSNR = ∞, SSIM = 1.000 — byte-identical to ffmpeg. The two exceptions are yuv420p10le (PSNR 47.9–48.3 dB / VMAF 99.85+) where 10→8-bit downconvert rounds differently from ffmpeg's direct 10-bit YUV→RGB path; perceptually identical. See tests/output/pixfmt_matrix/REPORT.md for the full table.

Caveats

ffmpeg-rgb24 CPU% omitted — it runs as a subprocess; the psutil sampler ticks every 100 ms and ffmpeg startup is short, so the few samples it gets are not representative. fps is valid (time wall-clock).
Single hardware data point — your numbers will differ. Reproduce with python tests/comprehensive_bench.py --tag mybox (full table) or python tests/bench_thread_modes.py (decoder-architecture comparison).
Default prefetch=False matches typical use. With prefetch=True nelux can squeeze another ~3–5% fps on big clips but burns more RAM (background producer queue).

API Reference

VideoReader

VideoReader(
    input_path: str,
    num_threads: int = 0,                          # 0 = ffmpeg auto-detect
    force_8bit: bool = False,                      # cast >8-bit YUV down to uint8
    backend: Literal["pytorch", "numpy"] = "pytorch",
    decode_accelerator: Literal["cpu", "nvdec"] = "cpu",
    cuda_device_index: int = 0,                    # NVDEC GPU index
    resize: tuple[int, int] | None = None,         # decoder-side scale to (W, H)
    prefetch: bool = False,                        # background producer thread
    convert_workers: int | None = None,            # None = min(hw, 16); 0 = polite
)

Properties:

width, height, fps, min_fps, max_fps, duration, total_frames
pixel_format, bit_depth, aspect_ratio, codec, has_audio
properties (full VideoProperties struct)
shape → (frame_count, H, W, 3) (Python-side BatchMixin)
frame_count → cached get_frame_count() (Python-side BatchMixin)

Methods:

read_frame() / __next__() / iteration → next [H, W, 3] frame
frame_at(timestamp: float | index: int) → random-access frame via secondary decoder (doesn't disturb iteration)
__getitem__(int | float | slice | list | range) → single frame OR [B, H, W, 3] batch
decode_batch(indices: list[int]) → C++ batch path; called by get_batch after validation
get_batch(indices) / get_batch_range(start, end, step) → batch decode with seek minimization
set_range(start, end) / reset() → bound iteration
reconfigure(...) → reuse this VideoReader for a different file (10-50× faster than re-constructing)
create_encoder(output_path) → VideoEncoder pre-configured to this source's dims/fps/format
start_prefetch() / stop_prefetch() / prefetch_buffered / is_prefetching → runtime prefetch control
supported_codecs() → list of codecs the linked libavcodec can decode

Documentation

Full Usage Guide - Complete API reference
Changelog - Version history
Benchmarks - Performance comparisons

Requirements

Python: 3.13+ (see pyproject.toml requires-python)
PyTorch: 2.11+ (import torch must precede import nelux; the matching CUDA wheel provides the CUDA runtime nelux's NVDEC path needs)
CUDA: 13.x (for NVDEC/NVENC builds). CPU-only builds drop this requirement.
OS: Windows 10/11, Linux (manylinux_2_28+ / Ubuntu 22.04+), macOS 12+ (Apple Silicon, CPU only)

Building from Source

Build system is scikit-build-core + CMake + Ninja + vcpkg. There is no setup.py.

git clone https://github.com/NevermindNilas/NeLux.git
cd NeLux

# Editable install — invokes scikit-build-core, which configures CMake + Ninja
# and runs vcpkg under the hood. Set NELUX_ENABLE_CUDA=ON to build NVDEC/NVENC.
NELUX_ENABLE_CUDA=ON pip install -e .

# Or build a wheel
NELUX_ENABLE_CUDA=ON pip wheel . -w dist/

On Windows the build needs MSVC 18 (or compatible), and FFmpeg headers/libs under external/ffmpeg/ (see tools/download_ffmpeg.ps1).

See BUILD.md for detailed build instructions.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

Acknowledgments

FFmpeg: The backbone of video processing in NeLux
PyTorch: For tensor operations and CUDA integration
Contributors: Thanks to everyone who has contributed to NeLux!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.11.0

May 16, 2026

0.10.1

May 10, 2026

0.10.0

May 3, 2026

0.9.2

Apr 22, 2026

0.9.1

Apr 22, 2026

0.9.0

Apr 18, 2026

0.8.10

Apr 4, 2026

0.8.9

Mar 10, 2026

0.8.8

Feb 15, 2026

0.8.7

Feb 1, 2026

0.8.6

Jan 28, 2026

0.8.5

Jan 21, 2026

0.8.4

Jan 18, 2026

0.8.3

Jan 16, 2026

0.8.2

Dec 13, 2025

0.8.1

Dec 8, 2025

0.8.0

Dec 4, 2025

0.7.9

Dec 1, 2025

0.7.8

Nov 28, 2025

0.7.7

Nov 28, 2025

0.7.6

Nov 27, 2025

0.7.5

Nov 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nelux-0.11.0-cp314-cp314-win_amd64.whl (1.2 MB view details)

Uploaded May 16, 2026 CPython 3.14Windows x86-64

nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded May 16, 2026 CPython 3.14manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl (804.5 kB view details)

Uploaded May 16, 2026 CPython 3.14macOS 12.0+ ARM64

nelux-0.11.0-cp313-cp313-win_amd64.whl (1.2 MB view details)

Uploaded May 16, 2026 CPython 3.13Windows x86-64

nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (1.2 MB view details)

Uploaded May 16, 2026 CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl (804.0 kB view details)

Uploaded May 16, 2026 CPython 3.13macOS 12.0+ ARM64

File details

Details for the file nelux-0.11.0-cp314-cp314-win_amd64.whl.

File metadata

Download URL: nelux-0.11.0-cp314-cp314-win_amd64.whl
Upload date: May 16, 2026
Size: 1.2 MB
Tags: CPython 3.14, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp314-cp314-win_amd64.whl
Algorithm	Hash digest
SHA256	`a9834f04cb685ec843ffebcc4bf140243fbd6a5444bb7b53e9656993dec3aec5`
MD5	`7ff440b8483e13df2f6f9b9f4f88c738`
BLAKE2b-256	`6164367730890371de2f4795a0773be6e825ef17aa338ed712e1fc4312d5a07b`

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Upload date: May 16, 2026
Size: 1.2 MB
Tags: CPython 3.14, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`48a705011e4d5316af9781bc9151d7479de775b8b5933c7dbc4247eecbe3006b`
MD5	`de858255aed0947f808ec91fd5a99447`
BLAKE2b-256	`97047a59d8e2c04854a5513e2102259df4c32c0bdbdc97064aa2785562a1ee1c`

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl.

File metadata

Download URL: nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl
Upload date: May 16, 2026
Size: 804.5 kB
Tags: CPython 3.14, macOS 12.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl
Algorithm	Hash digest
SHA256	`462bbc78030ae3d5edf53527979a8ca31fa6743a6952a60407f813861f29c0c5`
MD5	`f1a6fbe35717484948a6c0e2540238a2`
BLAKE2b-256	`6e23e3552d9ffe4b2828b0cfe96b664551e1c4d966673ebd007d55d96330f953`

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp313-cp313-win_amd64.whl.

File metadata

Download URL: nelux-0.11.0-cp313-cp313-win_amd64.whl
Upload date: May 16, 2026
Size: 1.2 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`25accb6bc93f0b32c73cd27810e483a7adf5156f9f35de7d548857e65140437d`
MD5	`9196a49af9eaf6630a01404c090e30a9`
BLAKE2b-256	`f158989ec328aca4a0f346eb74282d22eddce8782307e25ed4b542c3e62a7d61`

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

Download URL: nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Upload date: May 16, 2026
Size: 1.2 MB
Tags: CPython 3.13, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`e6ed6029bb4f85d988b2669d4de66b6b73255ee854d2bd39ba8629aba0378940`
MD5	`60af20e24b545a9e7c125dd4802928ff`
BLAKE2b-256	`54cbf52be1fae4bd52cf11b0d67f13036e828b850ab35a267cad1342ae0e4c3e`

See more details on using hashes here.

File details

Details for the file nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl.

File metadata

Download URL: nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl
Upload date: May 16, 2026
Size: 804.0 kB
Tags: CPython 3.13, macOS 12.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl
Algorithm	Hash digest
SHA256	`dd7fa0d4ecf00412df56afefb55c839c5a43b72aa1710b4f7ba452e8d2ac1706`
MD5	`388a6632d2650d64a526db8b90ee06de`
BLAKE2b-256	`1dc10b0299121c03fa4ebd0e81e6b77bcae70756b98e11f4594a0b1080546460`

See more details on using hashes here.

nelux 0.11.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

NeLux

Installation

Quick Start

Basic Usage

Batch Frame Reading

Video Encoding

Features

Core Features

Performance Knobs

Supported Codecs & Formats

Benchmarks

Headline: nelux default vs torchcodec vs ffmpeg (CPU)

Polite mode (convert_workers=0) vs torchcodec

NVDEC (GPU decode) vs ffmpeg-nvdec

Quality (vs ffmpeg -vf format=rgb24 reference, 30-frame compare)

Caveats

API Reference

VideoReader

Documentation

Requirements

Building from Source

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

Polite mode (`convert_workers=0`) vs torchcodec

Quality (vs `ffmpeg -vf format=rgb24` reference, 30-frame compare)