Skip to main content

Scalable Data Preprocessing Tool for Training Large Language Models

Project description

https://pypi.org/project/nemo-curator https://pypi.org/project/nemo-curator/ NVIDIA/NeMo-Curator https://github.com/NVIDIA/NeMo-Curator/releases https://github.com/Naereen/badges/

NeMo Curator

🚀 The GPU-Accelerated Open Source Framework for Efficient Large Language Model Data Curation 🚀

diagram

NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.

At the core of the NeMo Curator is the DocumentDataset which serves as the the main dataset class. It acts as a straightforward wrapper around a Dask DataFrame. The Python library offers easy-to-use methods for expanding the functionality of your curation pipeline while eliminating scalability concerns.

Key Features

NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:

These modules offer flexibility and permit reordering, with only a few exceptions. In addition, the NeMo Framework Launcher provides pre-built pipelines that can serve as a foundation for your customization use cases.

Resources

Get Started

This section explains how to install NeMo Curator and use the Python library, Python modules, and CLI scripts. It also includes a list of tutorials to help you get started right away. Finally, this section explains how to use the NeMo Framework Launcher as an alternative method for interfacing with NeMo Curator.

Install NeMo Curator

Requirements

Before installing NeMo Curator, ensure that the following requirements are met:

  • Python 3.10
  • Ubuntu 22.04/20.04
  • NVIDIA GPU (optional)

You can install NeMo-Curator from PyPi, from source or get it through the NeMo Framework container.

From PyPi

To install the CPU-only modules:

pip install nemo-curator

To install the CPU and CUDA-accelerated modules:

pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]

From Source

  1. Clone the NeMo Curator repository in GitHub.

    git clone https://github.com/NVIDIA/NeMo-Curator.git
    cd NeMo-Curator
    
  2. Install the modules that you need.

    To install the CPU-only modules:

    pip install .
    

    To install the CPU and CUDA-accelerated modules:

    pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
    

From the NeMo Framework Container

The latest release of NeMo Curator comes preinstalled in the NeMo Framework Container. If you want the latest commit inside the container, uninstall the existing version using:

pip uninstall nemo-curator

And follow the instructions for installing from source from above.

Use NeMo Curator

Python API Quick Example

The following snippet demonstrates how to create a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset.

# Download your dataset
dataset = download_common_crawl("/datasets/common_crawl/", "2021-04", "2021-10", url_limit=10)
# Build your pipeline
curation_pipeline = Sequential([
  # Fix unicode
  Modify(UnicodeReformatter()),
  # Discard short records
  ScoreFilter(WordCountFilter(min_words=80)),
  # Discard low-quality records
  ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
  # Discard records from the evaluation metrics to prevent test set leakage.
  TaskDecontamination([Winogrande(), Squad(), TriviaQA()])
])
# Execute the pipeline on your dataset
curated_dataset = curation_pipeline(dataset)

Explore NeMo Curator Tutorials

To get started with NeMo Curator, you can follow the tutorials available here. These tutorials include:

  • tinystories which focuses on data curation for training LLMs from scratch.
  • peft-curation which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
  • distributed_data_classification which focuses on using the quality and domain classifiers to help with data annotation.
  • single_node_tutorial which demonstrates an end-to-end data curation pipeline for curating Wikipedia data in Thai.

Access Python Modules

The NeMo Curator section of the NeMo Framework User Guide provides in-depth information about how the Python modules work. The examples directory in the GitHub repository provides scripts that showcase these modules.

Use CLI Scripts

NeMo Curator also offers CLI scripts for you to use. The scripts in nemo_curator/scripts map closely to the supplied Python modules. Refer to the NeMo Framework User Guide for more information about the Python modules and scripts.

Use NeMo Framework Launcher

As an alternative method for interfacing with NeMo Curator, you can use the NeMo Framework Launcher. The launcher enables you to easily configure the parameters and cluster. It can also automatically generate the SLURM batch scripts that wrap around the CLI scripts required to run your pipeline.

In addition, other methods are available to run NeMo Curator on SLURM. For example, refer to the example scripts in examples/slurm for information on how to run NeMo Curator on SLURM without the NeMo Framework Launcher.

Module Ablation and Compute Performance

The modules within NeMo Curator were primarily designed to curate high-quality documents from Common Crawl snapshots in a scalable manner. To evaluate the quality of the curated Common Crawl documents, we conducted a series of ablation experiments. In these experiments, we trained a 357M-parameter GPT-style model using datasets generated at various stages of our data curation pipeline, which was implemented in NeMo Curator.

The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance.

drawing

In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours with 64 NVIDIA A100 Tensor Core GPUs.

Additionally, using the CPU-based modules, the following table shows the time required and resulting data size reduction for each processing step Common Crawl snapshot from November/December of 2020 using 30 CPU nodes (with hardware similar to the c5.24xlarge Amazon AWS C5 instance).

Dataset Download and text extraction Text cleaning Quality filtering
Time Output Size Time Output Size Time Output Size
Common Crawl 2020-50 36 hrs 2.8 TB 1 hr 2.8 TB 0.2 hr 0.52 TB

Contribute to NeMo Curator

We welcome community contributions! Please refer to CONTRIBUTING.md for the process.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invisible_unicorn-0.4.0.tar.gz (173.2 kB view details)

Uploaded Source

Built Distribution

invisible_unicorn-0.4.0-py3-none-any.whl (259.5 kB view details)

Uploaded Python 3

File details

Details for the file invisible_unicorn-0.4.0.tar.gz.

File metadata

  • Download URL: invisible_unicorn-0.4.0.tar.gz
  • Upload date:
  • Size: 173.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for invisible_unicorn-0.4.0.tar.gz
Algorithm Hash digest
SHA256 227073929d4766217efc8db317223980fee3d9be3eff62a5fb7c1160be2f7ed0
MD5 7d2bb33d730c27c86b2a37b66a4264e7
BLAKE2b-256 134616b439ff8b3e00979b64e9ac584a14b02d4a601d2b36b458579ea984a784

See more details on using hashes here.

File details

Details for the file invisible_unicorn-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for invisible_unicorn-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c3d3df4b77b987da29078f4419a9fea03bcefdb4eb20fd0e666666f76f6ee3e
MD5 14515654d90689be05255fa327074b62
BLAKE2b-256 b52d54914cd1576cfb06b49c21713739da17341f0b879718a8fc30e1cabf9f94

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page