Scalable Data Preprocessing Tool for Training Large Language Models

These details have not been verified by PyPI

Project description

NVIDIA NeMo Curator

GPU-accelerated data curation for training better AI models, faster. Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio.

Part of the NVIDIA NeMo software suite for managing the AI agent lifecycle.

What You Can Do

Modality	Key Capabilities	Get Started
Text	Deduplication • Classification • Quality Filtering • Language Detection	Text Guide
Image	Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication	Image Guide
Video	Scene Detection • Clip Extraction • Motion Filtering • Deduplication	Video Guide
Audio	ASR Transcription • Quality Assessment • WER Filtering	Audio Guide

Quick Start

# Install for your modality
uv pip install "nemo-curator[text_cuda12]"

# Run the quickstart example
python tutorials/quickstart.py

Full setup: Installation Guide • Docker • Tutorials

Features by Modality

Text Curation

Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.

Category	Features	Documentation
Data Sources	Common Crawl • Wikipedia • ArXiv • Custom datasets	Load Data
Quality Filtering	30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type	Quality Assessment
Deduplication	Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated)	Deduplication
Processing	Text cleaning • Language identification	Content Processing

Image Curation

Curate large-scale image datasets for vision language models (VLMs) and generative AI training.

Category	Features	Documentation
Data Loading	WebDataset format • Large-scale image-text pairs	Load Data
Embeddings	CLIP embeddings for semantic analysis	Embeddings
Filtering	Aesthetic quality scoring • NSFW detection	Filters

Video Curation

Process large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs).

Category	Features	Documentation
Data Loading	Local paths • S3-compatible storage • HTTP(S) URLs	Load Data
Clipping	Fixed-stride splitting • Scene-change detection (TransNetV2)	Clipping
Processing	GPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filtering	Processing
Embeddings	Cosmos-Embed1 for clip-level embeddings	Embeddings
Deduplication	K-means clustering • Pairwise similarity for near-duplicates	Deduplication

Audio Curation

Prepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training.

Category	Features	Documentation
Data Loading	Local files • Custom manifests • Public datasets (FLEURS)	Load Data
ASR Processing	NeMo Framework pretrained models • Automatic transcription	ASR Inference
Quality Assessment	Word Error Rate (WER) calculation • Duration analysis • Quality-based filtering	Quality Assessment
Integration	Text curation workflow integration for multimodal pipelines	Text Integration

Why NeMo Curator?

Performance at Scale

NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments.

Proven Results:

16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)
40% lower total cost of ownership (TCO) compared to CPU-based alternatives
Near-linear scaling from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)

Real-World Recipe: The Nemotron-CC curation pipeline uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the Nemotron-CC datasets. The SDG stage is also available as an in-repo tutorial.

Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling

Quality Improvements

Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:

Model accuracy improvements across curation pipeline stages

Results: Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.

Learn More

Resource	Links
Documentation	Main Docs • API Reference • Concepts
Tutorials	Text • Image • Video • Audio
Recipes	Nemotron-CC: end-to-end web data curation • SDG tutorial (in-repo)
Deployment	Installation • Infrastructure
Community	GitHub Discussions • Issues

Contribute

We welcome community contributions! Please refer to CONTRIBUTING.md for guidelines.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.0

May 14, 2026

1.1.0

Feb 23, 2026

1.1.0rc0.dev0 pre-release

Feb 23, 2026

1.0.0

Oct 1, 2025

0.9.0

Jul 28, 2025

0.8.0

May 9, 2025

0.8.0rc3.dev0 pre-release

Apr 15, 2025

0.8.0rc2.dev0 pre-release

Apr 7, 2025

0.7.1

Mar 31, 2025

0.7.0

Mar 12, 2025

0.7.0rc2.dev0 pre-release

Feb 25, 2025

0.7.0rc1.dev1 pre-release

Feb 19, 2025

0.7.0rc0.dev1 pre-release

Feb 3, 2025

0.6.0

Jan 7, 2025

0.6.0rc2.dev1 pre-release

Jan 3, 2025

0.6.0rc1.dev1 pre-release

Dec 20, 2024

0.6.0rc0.dev1 pre-release

Dec 11, 2024

0.5.1

Dec 3, 2024

0.5.0

Oct 30, 2024

0.4.1

Oct 3, 2024

0.4.0

Aug 14, 2024

0.3.0

May 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemo_curator-1.2.0.tar.gz (430.1 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nemo_curator-1.2.0-py3-none-any.whl (690.3 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file nemo_curator-1.2.0.tar.gz.

File metadata

Download URL: nemo_curator-1.2.0.tar.gz
Upload date: May 14, 2026
Size: 430.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_curator-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`efa0efeccb6b1b7d8870ad34b9a2857d97378e65752570caac7e4a573db0455e`
MD5	`0bb8e6250157aa12c64f768cd73ae0b8`
BLAKE2b-256	`33da08ecae6d9e68c2c85ca88fb14b61a9d154997dc9155c2651d6dab6db9200`

See more details on using hashes here.

File details

Details for the file nemo_curator-1.2.0-py3-none-any.whl.

File metadata

Download URL: nemo_curator-1.2.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 690.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_curator-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`031b3b43b0118e477aea191f00d683e0b6b71cc787ddaf2605269f632d6a57a1`
MD5	`e668ae9970b0ecbc7a395dabbf692c71`
BLAKE2b-256	`d10a5b7f301b964e57a383e42c7d60ca9e7b2dcbb7acdd83630d8353c3d3e4ee`

See more details on using hashes here.

nemo-curator 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

NVIDIA NeMo Curator

What You Can Do

Quick Start

Features by Modality

Text Curation

Image Curation

Video Curation

Audio Curation

Why NeMo Curator?

Performance at Scale

Quality Improvements

Learn More

Contribute

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes