Skip to main content

Scalable Data Preprocessing Tool for Training Large Language Models

Project description

NeMo Curator

NeMo Curator is a Python library that consists of a collection of scalable data-mining modules for curating natural language processing (NLP) data for training large language models (LLMs). The modules within NeMo Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora. For a demonstration of how each of the modules in NeMo Curator improves downstream performance, check out the module ablation.

NeMo Curator is built on Dask and RAPIDS to scale data curation and provide GPU acceleration. The Python interface provides easy methods to expand the functionality of your curation pipeline without worrying about how it will scale. More information can be found in the usage section. There are many ways to integrate NeMo Curator in your pipeline. Check out the installation instructions for how to get started using it.

Features

We currently support the following data-curation modules. For more details on each module, visit its documentation page in the NeMo framework user guide.

These modules are designed to be flexible and allow for reordering with few exceptions. The NeMo Framework Launcher includes prebuilt pipelines for you to start with and modify as needed.

Learn More

Installation

NeMo Curator currently requires Python 3.10 and the GPU accelerated modules require CUDA 12 or above installed in order to be used.

PyPi

NeMo Curator can be installed manually by cloning the repository and installing as follows -

For CPU only modules:

pip install nemo-curator

For CPU + CUDA accelerated modules:

pip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]

NeMo Framework Container

The latest release of NeMo Curator is preinstalled in the NeMo Framework Container. The NeMo Framework Container provides an end-to-end platform for development of custom generative AI models anywhere. If you want the latest commit inside the container, uninstall the existing version using

pip uninstall nemo-curator

And follow the instructions for installing from source below.

Source

If you want to install the latest commit, please clone the repository and install with either of the follwing commands

For CPU only modules:

pip install .

For CPU + CUDA accelerated modules:

pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"

Usage

Python Library

# Download your dataset
dataset = download_common_crawl("/datasets/common_crawl/", "2021-04", "2021-10", url_limit=10)
# Build your pipeline
curation_pipeline = Sequential([
  Modify(UnicodeReformatter()),
  ScoreFilter(WordCountFilter(min_words=80)),
  ScoreFilter(FastTextQualityFilter(model_path="model.bin")),
  TaskDecontamination([Winogrande(), Squad(), TriviaQA()])
])
# Curate your dataset
curated_dataset = curation_pipeline(dataset)

NeMo Curator provides a collection of robust python modules that can be chained together to construct your entire data curation pipeline. These modules can be run on your local machine or in a distributed compute environment like SLURM with no modifications. NeMo Curator provides simple base classes that you can inherit from to create your own filters, document modifiers, and other extensions without needing to worry about how they scale. The examples directory contains a bunch of scripts showcasing each of these modules. The data curation section of the NeMo framework user guide provides in-depth documentation on how each of the modules work. If you need more information to modify the NeMo Curator for your usecase, the implementation section provides a good starting point.

Scripts

We provide CLI scripts to use as well in case those are more convienent. The scripts under nemo_curator/scripts map closely with each of the created python modules. Visit the documentation for each of the python modules for more information about the scripts associated with it.

NeMo Framework Launcher

NeMo Megatron Launcher is another way to interface with NeMo Curator. The launcher allows for easy parameter and cluster configuration and will automatically generate the SLURM batch scripts that wrap around the CLI scripts required to run your pipeline. Note: This is not the only way to run NeMo Curator on SLURM. There are example scripts in examples/slurm for running NeMo Curator on SLURM without the launcher.

Module Ablation and Compute Performance

The modules within NeMo Curator were in large part designed to curate high-quality documents from Common Crawl snapshots and to be able to do so in a scalable manner. In order to assess the quality of the Common Crawl documents curated by the modules in NeMo Curator, we performed a series of ablation experiments in which we trained a 357M-parameter GPT-style model on the datasets resulting from the different stages of our data curation pipeline implemented in NeMo Curator. The figure below demonstrates that the different data curation modules implemented within NeMo Curator lead to improved model zero-shot downstream task performance.

drawing

In terms of scalability and compute performance, using the RAPIDS + Dask fuzzy deduplication, we are able to deduplicate the 1.1 Trillion token Red Pajama dataset in 1.8 hours using 64 A100s.

Additionally, using the CPU-based modules the table below shows the time required and resulting data size reduction of each step of processing the Common Crawl snapshot from November/December of 2020 using 30 CPU nodes (with hardware similar to the c5.24xlarge Amazon AWS C5 instance):

Dataset Download and text extraction Text cleaning Quality filtering
Time Output Size Time Output Size Time Output Size
Common Crawl 2020-50 36 hrs 2.8 TB 1 hr 2.8 TB 0.2 hr 0.52 TB

Implementation

As mentioned above, the modules within NeMo Curator enable users to scale data-mining and NLP processing tasks to many nodes within a compute cluster. The modules accomplish this using Dask with cuDF (for the GPU-accelerated modules). At the core of the NeMo Curator, DocumentDataset (the main dataset class) is just a simple wrapper around a Dask dataframe. Dask allows NeMo Curator to scale to arbitrary cluster sizes, and it supports a variety of distributed computing platforms. It supports reading and writing to different file formats, and it can balance these operations among nodes in the cluster. Importantly, Dask also supports the RAPIDS cuDF library for GPU-acclerated exact and fuzzy deduplication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

invisible_rabbit-0.3.1.tar.gz (137.0 kB view details)

Uploaded Source

Built Distribution

invisible_rabbit-0.3.1-py3-none-any.whl (213.5 kB view details)

Uploaded Python 3

File details

Details for the file invisible_rabbit-0.3.1.tar.gz.

File metadata

  • Download URL: invisible_rabbit-0.3.1.tar.gz
  • Upload date:
  • Size: 137.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.0

File hashes

Hashes for invisible_rabbit-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b4bcea423e2946688c622e902a0948d724779cb7f232b89f71ab1de7451d330a
MD5 517cf0cbd7328b00a83a86ed218a7937
BLAKE2b-256 933de987a88ed85e82fce793c9c195313ee08b68b2b90b08e59b866dbf6f459f

See more details on using hashes here.

File details

Details for the file invisible_rabbit-0.3.1-py3-none-any.whl.

File metadata

File hashes

Hashes for invisible_rabbit-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f251f7c05e3a8546fcdf3faf620814facb24a8e884af6ba904f8f83209de49bf
MD5 affa80584ccc484b7bd1a57d830b0b01
BLAKE2b-256 153b3f29077527074aab1a5f9bcd52b2eda400089b362025616dc38fb96cd28d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page