Lightspeed video decoding directly into tensors!
Project description
NeLux
NeLux is a high-performance Python library for video processing, leveraging the power of FFmpeg with hardware acceleration (NVDEC/NVENC). It delivers some of the fastest decode times globally, enabling efficient video decoding directly into ML-ready PyTorch tensors.
Originall created by Trentonom0r3
Installation
pip install nelux
Supported platforms:
| Platform | Backends | Notes |
|---|---|---|
| Windows x64 | CPU + CUDA (NVDEC/NVENC) | Requires FFmpeg DLLs on PATH (or pass to os.add_dll_directory). |
| Linux x86_64 (manylinux_2_28+) | CPU + CUDA (NVDEC/NVENC) | Install FFmpeg via apt install ffmpeg libavcodec62 libavformat62 libavutil60 libswscale9 libavfilter11 libavdevice62. |
| macOS arm64 (Apple Silicon, ≥ 12.0) | CPU / MPS (via PyTorch) | Install FFmpeg via brew install ffmpeg. No CUDA on macOS. |
PyTorch must be importable before nelux — the package uses torch's C++ runtime. For CUDA builds, install the matching CUDA torch wheel:
# Linux CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
# macOS / Linux CPU
pip install torch torchvision
Quick Start
Basic Usage
import torch # must be imported before nelux
from nelux import VideoReader
# Open video with hardware acceleration (CPU path also supported)
reader = VideoReader("input.mp4", decode_accelerator="nvdec")
# Iterate frames — HWC uint8 by default (matches torchcodec convention)
for frame in reader:
print(frame.shape) # torch.Size([1080, 1920, 3]) — HWC
print(frame.dtype) # torch.uint8 for 8-bit sources; torch.int16 for >8-bit
# (override with force_8bit=True to always return uint8)
# Permute to BCHW + cast to float when feeding to an ML model
chw = frame.permute(2, 0, 1).unsqueeze(0).to(torch.float32) / 255.0
output = model(chw)
Batch Frame Reading
import torch
from nelux import VideoReader
vr = VideoReader("video.mp4")
# Get specific frames — returned tensor is [B, H, W, 3] HWC uint8
batch = vr.get_batch([0, 10, 20]) # [3, H, W, 3]
batch = vr.get_batch_range(0, 100, 10) # [10, H, W, 3]
# Pythonic slice / list notation (delegates to get_batch under the hood)
batch = vr[0:100:10] # [10, H, W, 3]
batch = vr[[-3, -2, -1]] # Last 3 frames (negative indexing OK)
single = vr[42] # Single frame [H, W, 3]
# Properties
print(len(vr)) # Total frame count
print(vr.shape) # (frames, H, W, 3)
Video Encoding
import torch
from nelux import VideoReader
reader = VideoReader("input.mp4")
# `create_encoder` pre-configures dimensions / fps / pixel format from the source.
with reader.create_encoder("output.mp4") as enc:
for frame in reader:
enc.encode_frame(frame) # frame is [H, W, 3] uint8
print("Done!")
Features
Core Features
- Hardware Acceleration: NVDEC (decode) and NVENC (encode) on NVIDIA GPUs
- Native HWC
uint8Output: frames decoded directly into atorch.Tensorof shape[H, W, 3](or[H, W, 3]int16for >8-bit sources; force_8bit=True clamps to uint8 always). No implicit float conversion — you cast/normalize on your side based on your model's expected input - CPU Path Matches ffmpeg Byte-for-Byte: pure libswscale convert pipeline, default
SWS_BILINEARflags; output is bit-identical toffmpeg -vf format=rgb24on every common YUV/RGB format (see CHANGELOG v0.11.0) - Batch Decoding:
get_batch([...])/vr[start:stop:step]returns[B, H, W, 3]with seek minimization, deduplication, and a dedicated random-access decoder
Performance Knobs
prefetch=True: background producer thread (off by default — queue handoff costs ~2.5× more than the parallelism saves at typical decode speeds)convert_workers=N: explicit control over the CPU convert-pool size.None(default) usesmin(hw_concurrency, 16)for throughput-max;0matches torchcodec's polite single-threaded convert footprint; positiveNpins to that count. See CHANGELOG v0.11.0 for measured tradeoffs- NVDEC fused convert: CUDA kernels for NV12 / P010 → RGB run in-line on the GPU; output stays on
cuda:0as a torch tensor — no CPU round-trip whendecode_accelerator="nvdec" - Decoder-side
resize=(W, H): CPU path scales in libswscale; NVDEC uses cuvid's built-inresize=WxH— single pass, no post-decodeF.interpolate/cv2.resizeneeded
Supported Codecs & Formats
CPU path supports anything libavcodec can decode (h264, hevc, vp8/9, av1, mpeg2/4, prores, …). NVDEC support depends on your GPU generation.
| Feature | CPU path | NVDEC path |
|---|---|---|
| Codecs | any libavcodec decoder | H.264, H.265/HEVC, VP9, AV1 (GPU-dependent) |
| Pixel formats | all common YUV/RGB (yuv420p[10le]/yuv422p/yuv444p[10le]/nv12/nv21/rgb24/bgr24/gbrp/yuvj*) | NV12, P010, P016, YUV444 (8/10/12/16-bit) |
| Containers | anything libavformat can demux | same |
Benchmarks
H.264 decode → RGB tensor throughput, measured on Intel i9-13900K (24 logical cores) + RTX 3090, Windows 11, FFmpeg 8.x, PyTorch 2.11+cu130, nelux 0.11.0. Each row is the median of 5 fresh subprocess runs, 600 frames per run (300 at 4K). Output is HWC uint8 for every decoder (apples-to-apples).
Headline: nelux default vs torchcodec vs ffmpeg (CPU)
| Resolution | Decoder | fps | CPU% avg | RSS MB |
|---|---|---|---|---|
| 720p | nelux (default) | 3422 | 874 | 2350 |
| torchcodec | 2924 | 344 | 2395 | |
| ffmpeg-rgb24 (subprocess) | 2273 | — | — | |
| 1080p | nelux (default) | 2642 | 1426 | 4480 |
| torchcodec | 1589 | 502 | 4502 | |
| ffmpeg-rgb24 (subprocess) | 1102 | — | — | |
| 4K | nelux (default) | 607 | 1656 | 9205 |
| torchcodec | 367 | 487 | 9098 | |
| ffmpeg-rgb24 (subprocess) | 254 | — | — |
nelux fan-outs libswscale convert across cores → +14–67% fps over torchcodec at every res. The trade: ~2.5–3× CPU. RSS is essentially identical.
Polite mode (convert_workers=0) vs torchcodec
Disabling the convert worker pool matches torchcodec's single-threaded convert architecture exactly. fps + CPU + RSS land within ~2%:
| Resolution | Decoder | fps | CPU% | RSS MB |
|---|---|---|---|---|
| 720p | nelux (convert_workers=0) |
3167 | 366 | 598 |
| torchcodec | 3090 | 343 | 673 | |
| 1080p | nelux (convert_workers=0) |
1755 | 435 | 659 |
| torchcodec | 1728 | 432 | 732 | |
| 4K | nelux (convert_workers=0) |
394 | 440 | 1022 |
| torchcodec | 401 | 477 | 1095 |
So the "+14–67% fps" win above is entirely the convert worker pool — strip it and nelux ≈ torchcodec on every dimension. Pick the trade you want via convert_workers=N.
NVDEC (GPU decode) vs ffmpeg-nvdec
| Resolution | Decoder | fps | CPU% | GPU mem MB |
|---|---|---|---|---|
| 720p | nelux (decode_accelerator="nvdec") |
1651 | 45 | 2886 |
| ffmpeg-nvdec (subprocess) | 1253 | — | 2902 | |
| 1080p | nelux | 667 | 40 | 2911 |
| ffmpeg-nvdec | 592 | — | 2967 | |
| 4K | nelux | 175 | 24 | 3052 |
| ffmpeg-nvdec | 162 | — | 3259 |
nelux NVDEC beats raw ffmpeg-nvdec by 8–32% on fps at lower CPU (NV12→RGB runs as a fused CUDA kernel; output stays on the GPU as a torch.Tensor, no host round-trip).
Quality (vs ffmpeg -vf format=rgb24 reference, 30-frame compare)
Across 14 (pix_fmt × colorspace) combos: 12 / 14 PSNR = ∞, SSIM = 1.000 — byte-identical to ffmpeg. The two exceptions are yuv420p10le (PSNR 47.9–48.3 dB / VMAF 99.85+) where 10→8-bit downconvert rounds differently from ffmpeg's direct 10-bit YUV→RGB path; perceptually identical. See tests/output/pixfmt_matrix/REPORT.md for the full table.
Caveats
- ffmpeg-rgb24 CPU% omitted — it runs as a subprocess; the
psutilsampler ticks every 100 ms and ffmpeg startup is short, so the few samples it gets are not representative. fps is valid (timewall-clock). - Single hardware data point — your numbers will differ. Reproduce with
python tests/comprehensive_bench.py --tag mybox(full table) orpython tests/bench_thread_modes.py(decoder-architecture comparison). - Default
prefetch=Falsematches typical use. Withprefetch=Truenelux can squeeze another ~3–5% fps on big clips but burns more RAM (background producer queue).
API Reference
VideoReader
VideoReader(
input_path: str,
num_threads: int = 0, # 0 = ffmpeg auto-detect
force_8bit: bool = False, # cast >8-bit YUV down to uint8
backend: Literal["pytorch", "numpy"] = "pytorch",
decode_accelerator: Literal["cpu", "nvdec"] = "cpu",
cuda_device_index: int = 0, # NVDEC GPU index
resize: tuple[int, int] | None = None, # decoder-side scale to (W, H)
prefetch: bool = False, # background producer thread
convert_workers: int | None = None, # None = min(hw, 16); 0 = polite
)
Properties:
width,height,fps,min_fps,max_fps,duration,total_framespixel_format,bit_depth,aspect_ratio,codec,has_audioproperties(fullVideoPropertiesstruct)shape→(frame_count, H, W, 3)(Python-sideBatchMixin)frame_count→ cachedget_frame_count()(Python-sideBatchMixin)
Methods:
read_frame()/__next__()/ iteration → next[H, W, 3]frameframe_at(timestamp: float | index: int)→ random-access frame via secondary decoder (doesn't disturb iteration)__getitem__(int | float | slice | list | range)→ single frame OR[B, H, W, 3]batchdecode_batch(indices: list[int])→ C++ batch path; called byget_batchafter validationget_batch(indices)/get_batch_range(start, end, step)→ batch decode with seek minimizationset_range(start, end)/reset()→ bound iterationreconfigure(...)→ reuse this VideoReader for a different file (10-50× faster than re-constructing)create_encoder(output_path)→VideoEncoderpre-configured to this source's dims/fps/formatstart_prefetch()/stop_prefetch()/prefetch_buffered/is_prefetching→ runtime prefetch controlsupported_codecs()→ list of codecs the linked libavcodec can decode
Documentation
- Full Usage Guide - Complete API reference
- Changelog - Version history
- Benchmarks - Performance comparisons
Requirements
- Python: 3.13+ (see
pyproject.tomlrequires-python) - PyTorch: 2.11+ (
import torchmust precedeimport nelux; the matching CUDA wheel provides the CUDA runtime nelux's NVDEC path needs) - CUDA: 13.x (for NVDEC/NVENC builds). CPU-only builds drop this requirement.
- OS: Windows 10/11, Linux (manylinux_2_28+ / Ubuntu 22.04+), macOS 12+ (Apple Silicon, CPU only)
Building from Source
Build system is scikit-build-core + CMake + Ninja + vcpkg. There is no setup.py.
git clone https://github.com/NevermindNilas/NeLux.git
cd NeLux
# Editable install — invokes scikit-build-core, which configures CMake + Ninja
# and runs vcpkg under the hood. Set NELUX_ENABLE_CUDA=ON to build NVDEC/NVENC.
NELUX_ENABLE_CUDA=ON pip install -e .
# Or build a wheel
NELUX_ENABLE_CUDA=ON pip wheel . -w dist/
On Windows the build needs MSVC 18 (or compatible), and FFmpeg headers/libs under external/ffmpeg/ (see tools/download_ffmpeg.ps1).
See BUILD.md for detailed build instructions.
License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nelux-0.11.0-cp314-cp314-win_amd64.whl.
File metadata
- Download URL: nelux-0.11.0-cp314-cp314-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.14, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9834f04cb685ec843ffebcc4bf140243fbd6a5444bb7b53e9656993dec3aec5
|
|
| MD5 |
7ff440b8483e13df2f6f9b9f4f88c738
|
|
| BLAKE2b-256 |
6164367730890371de2f4795a0773be6e825ef17aa338ed712e1fc4312d5a07b
|
File details
Details for the file nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: nelux-0.11.0-cp314-cp314-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.14, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48a705011e4d5316af9781bc9151d7479de775b8b5933c7dbc4247eecbe3006b
|
|
| MD5 |
de858255aed0947f808ec91fd5a99447
|
|
| BLAKE2b-256 |
97047a59d8e2c04854a5513e2102259df4c32c0bdbdc97064aa2785562a1ee1c
|
File details
Details for the file nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl.
File metadata
- Download URL: nelux-0.11.0-cp314-cp314-macosx_12_0_arm64.whl
- Upload date:
- Size: 804.5 kB
- Tags: CPython 3.14, macOS 12.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
462bbc78030ae3d5edf53527979a8ca31fa6743a6952a60407f813861f29c0c5
|
|
| MD5 |
f1a6fbe35717484948a6c0e2540238a2
|
|
| BLAKE2b-256 |
6e23e3552d9ffe4b2828b0cfe96b664551e1c4d966673ebd007d55d96330f953
|
File details
Details for the file nelux-0.11.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: nelux-0.11.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25accb6bc93f0b32c73cd27810e483a7adf5156f9f35de7d548857e65140437d
|
|
| MD5 |
9196a49af9eaf6630a01404c090e30a9
|
|
| BLAKE2b-256 |
f158989ec328aca4a0f346eb74282d22eddce8782307e25ed4b542c3e62a7d61
|
File details
Details for the file nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.
File metadata
- Download URL: nelux-0.11.0-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.13, manylinux: glibc 2.27+ x86-64, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6ed6029bb4f85d988b2669d4de66b6b73255ee854d2bd39ba8629aba0378940
|
|
| MD5 |
60af20e24b545a9e7c125dd4802928ff
|
|
| BLAKE2b-256 |
54cbf52be1fae4bd52cf11b0d67f13036e828b850ab35a267cad1342ae0e4c3e
|
File details
Details for the file nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl.
File metadata
- Download URL: nelux-0.11.0-cp313-cp313-macosx_12_0_arm64.whl
- Upload date:
- Size: 804.0 kB
- Tags: CPython 3.13, macOS 12.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd7fa0d4ecf00412df56afefb55c839c5a43b72aa1710b4f7ba452e8d2ac1706
|
|
| MD5 |
388a6632d2650d64a526db8b90ee06de
|
|
| BLAKE2b-256 |
1dc10b0299121c03fa4ebd0e81e6b77bcae70756b98e11f4594a0b1080546460
|