Skip to main content

Scalable Data Preprocessing Tool for Training Large Language Models

Project description

https://pypi.org/project/nemo-curator codecov https://pypi.org/project/nemo-curator/ NVIDIA-NeMo/Curator https://github.com/NVIDIA-NeMo/Curator/releases https://github.com/Naereen/badges/

NVIDIA NeMo Curator

GPU-accelerated data curation for training better AI models, faster. Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio.

Part of the NVIDIA NeMo software suite for managing the AI agent lifecycle.

What You Can Do

Modality Key Capabilities Get Started
Text Deduplication • Classification • Quality Filtering • Language Detection Text Guide
Image Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication Image Guide
Video Scene Detection • Clip Extraction • Motion Filtering • Deduplication Video Guide
Audio ASR Transcription • Quality Assessment • WER Filtering Audio Guide

Quick Start

# Install for your modality
uv pip install "nemo-curator[text_cuda12]"

# Run the quickstart example
python tutorials/quickstart.py

Full setup: Installation GuideDockerTutorials


Features by Modality

Text Curation

Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.

Category Features Documentation
Data Sources Common Crawl • Wikipedia • ArXiv • Custom datasets Load Data
Quality Filtering 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type Quality Assessment
Deduplication Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) Deduplication
Processing Text cleaning • Language identification Content Processing

Image Curation

Curate large-scale image datasets for vision language models (VLMs) and generative AI training.

Category Features Documentation
Data Loading WebDataset format • Large-scale image-text pairs Load Data
Embeddings CLIP embeddings for semantic analysis Embeddings
Filtering Aesthetic quality scoring • NSFW detection Filters

Video Curation

Process large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs).

Category Features Documentation
Data Loading Local paths • S3-compatible storage • HTTP(S) URLs Load Data
Clipping Fixed-stride splitting • Scene-change detection (TransNetV2) Clipping
Processing GPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filtering Processing
Embeddings Cosmos-Embed1 for clip-level embeddings Embeddings
Deduplication K-means clustering • Pairwise similarity for near-duplicates Deduplication

Audio Curation

Prepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training.

Category Features Documentation
Data Loading Local files • Custom manifests • Public datasets (FLEURS) Load Data
ASR Processing NeMo Framework pretrained models • Automatic transcription ASR Inference
Quality Assessment Word Error Rate (WER) calculation • Duration analysis • Quality-based filtering Quality Assessment
Integration Text curation workflow integration for multimodal pipelines Text Integration

Why NeMo Curator?

Performance at Scale

NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments.

Proven Results:

  • 16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)
  • 40% lower total cost of ownership (TCO) compared to CPU-based alternatives
  • Near-linear scaling from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)

Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling

Quality Improvements

Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:

Model accuracy improvements across curation pipeline stages

Results: Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.


Learn More

Resource Links
Documentation Main DocsAPI ReferenceConcepts
Tutorials TextImageVideoAudio
Deployment InstallationInfrastructure
Community GitHub DiscussionsIssues

Contribute

We welcome community contributions! Please refer to CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nemo_curator-1.1.0.tar.gz (285.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nemo_curator-1.1.0-py3-none-any.whl (464.0 kB view details)

Uploaded Python 3

File details

Details for the file nemo_curator-1.1.0.tar.gz.

File metadata

  • Download URL: nemo_curator-1.1.0.tar.gz
  • Upload date:
  • Size: 285.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_curator-1.1.0.tar.gz
Algorithm Hash digest
SHA256 9a16f6bef83239d23738d8c03ba0ed911c6cefceb552b1176bc1360932c03536
MD5 3132fb1be8e7fd47a0e382d722c205e1
BLAKE2b-256 6ebd1058ab36b1d2aa8dbe730b0bae8821197088c293f9e6eb8db0438e1a3a2c

See more details on using hashes here.

File details

Details for the file nemo_curator-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: nemo_curator-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 464.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for nemo_curator-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d38e0b98960fa1d49377583e77c324abb407aeeef67bcf8a66ce8dff310f775
MD5 5c0bfd8b4ee1df8ac77d144d24f890e8
BLAKE2b-256 3834bdd64f8cc117152b9eb095848b6d9811c4c7348f9e920fb44301196a2026

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page