Skip to main content

A One-Stop Data Processing System for Large Language Models.

Project description

English | 中文

Data-Juicer: A One-Stop Data Processing System for Large Language Models

Data-Juicer

Paper Contributing

Document_List 文档列表 API Reference ModelScope-10+ Demos ModelScope-20+_Refined_Datasets

QualityClassifier AutoEvaluation

Data-Juicer is a one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs. This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in promoting LLM data development and research!


Table of Contents

Features

Overview

  • Systematic & Reusable: Empowering users with a systematic library of 20+ reusable config recipes, 50+ core OPs, and feature-rich dedicated toolkits, designed to function independently of specific LLM datasets and processing pipelines.

  • Data-in-the-loop: Allowing detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process. Data-in-the-loop

  • Comprehensive Data Processing Recipes: Offering tens of pre-built data processing recipes for pre-training, post-tuning, en, zh, and more scenarios. Validated on reference LLaMA models.
    exp_llama

  • Enhanced Efficiency: Providing a speedy data processing pipeline requiring less memory and CPU usage, optimized for maximum productivity. sys-perf

  • Flexible & Extensible: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to implement your own OPs for customizable data processing.

  • User-Friendly Experience: Designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

Prerequisites

  • Recommend Python==3.8
  • gcc >= 5 (at least C++14 support)

Installation

From Source

  • Run the following commands to install the latest data_juicer version in editable mode:
cd <path_to_data_juicer>
pip install -v -e .[all]
  • Or install optional dependencies:
cd <path_to_data_juicer>
pip install -v -e .  # install a minimal dependencies
pip install -v -e .[tools] # install a subset of tools dependencies

The dependency options are listed below:

Tag Description
. Install minimal dependencies for basic Data-Juicer.
.[all] Install all optional dependencies (all of the following)
.[dev] Install dependencies for developing the package as contributors
.[tools] Install dependencies for dedicated tools, such as quality classifiers.

Using pip

  • Run the following command to install the latest data_juicer using pip:
pip install py-data-juicer
  • Note: only the basic APIs in data_juicer and two basic tools (data processing and analysis) are available in this way. If you want customizable and complete functions, we recommend you install data_juicer from source.

Using Docker

  • Run the following command to build the docker image including the latest data-juicer with provided Dockerfile:
docker build -t data-juicer:<version_tag> .

Installation check

import data_juicer as dj
print(dj.__version__)

Quick Start

Data Processing

  • Run process_data.py tool or dj-process command line tool with your config as the argument to process your dataset.
# only for installation from source
python tools/process_data.py --config configs/demo/process.yaml

# use command line tool
dj-process --config configs/demo/process.yaml
  • Note: For some operators that involve third-party models or resources which are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first. The default download cache directory is ~/.cache/data_juicer. Change the cache location by setting the shell environment variable, DATA_JUICER_CACHE_HOME to another directory, and you can also change DATA_JUICER_MODELS_CACHE or DATA_JUICER_ASSETS_CACHE in the same way:
# cache home
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
# cache models
export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
# cache assets
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"

Data Analysis

  • Run analyze_data.py tool or dj-analyze command line tool with your config as the argument to analyse your dataset.
# only for installation from source
python tools/analyze_data.py --config configs/demo/analyser.yaml

# use command line tool
dj-analyze --config configs/demo/analyser.yaml
  • Note: Analyser only compute stats of Filter ops. So extra Mapper or Deduplicator ops will be ignored in the analysis process.

Data Visualization

  • Run app.py tool to visualize your dataset in your browser.
  • Note: only available for installation from source.
streamlit run app.py

Build Up Config Files

  • Config files specify some global arguments, and an operator list for the data process. You need to set:
    • Global arguments: input/output dataset path, number of workers, etc.
    • Operator list: list operators with their arguments used to process the dataset.
  • You can build up your own config files by:
    • ➖:Modify from our example config file config_all.yaml which includes all ops and default arguments. You just need to remove ops that you won't use and refine some arguments of ops.
    • ➕:Build up your own config files from scratch. You can refer our example config file config_all.yaml, op documents, and advanced Build-Up Guide for developers.
    • Besides the yaml files, you also have the flexibility to specify just one (of several) parameters on the command line, which will override the values in yaml files.
python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang=en
  • The basic config format and definition is shown below.

    Basic config example of format and definition

Preprocess Raw Data (Optional)

  • Our formatters support some common input dataset formats for now:
    • Multi-sample in one file: jsonl/json, parquet, csv/tsv, etc.
    • Single-sample in one file: txt, code, docx, pdf, etc.
  • However, data from different sources are complicated and diverse. Such as:
    • Raw arXiv data downloaded from S3 include thousands of tar files and even more gzip files in them, and expected tex files are embedded in the gzip files so they are hard to obtain directly.
    • Some crawled data include different kinds of files (pdf, html, docx, etc.). And extra information like tables, charts, and so on is hard to extract.
  • It's impossible to handle all kinds of data in Data-Juicer, issues/PRs are welcome to contribute to process new data types!
  • Thus, we provide some common preprocessing tools in tools/preprocess for you to preprocess these data.
    • You are welcome to make your contributions to new preprocessing tools for the community.
    • We highly recommend that complicated data can be preprocessed to jsonl or parquet files.

For Docker Users

  • If you build or pull the docker image of data-juicer, you can run the commands or tools mentioned above using this docker image.
  • Run directly:
# run the data processing directly
docker run --rm \  # remove container after the processing
  --name dj \  # name of the container
  -v <host_data_path>:<image_data_path> \  # mount data or config directory into the container
  -v ~/.cache/:/root/.cache/ \  # mount the cache directory into the container to reuse caches and models (recommended)
  data-juicer:<version_tag> \  # image to run
  dj-process --config /path/to/config.yaml  # similar data processing commands
  • Or enter into the running container and run commands in editable mode:
# start the container
docker run -dit \  # run the container in the background
  --rm \
  --name dj \
  -v <host_data_path>:<image_data_path> \
  -v ~/.cache/:/root/.cache/ \
  data-juicer:latest /bin/bash

# enter into this container and then you can use data-juicer in editable mode
docker exec -it <container_id> bash

Documentation | 文档

Data Recipes

Demos

License

Data-Juicer is released under Apache License 2.0.

Contributing

We greatly welcome contributions of new features, bug fixes, and discussions. Please refer to How-to Guide for Developers.

References

If you find our work useful for your research or development, please kindly cite the following paper.

@misc{chen2023datajuicer,
title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
year={2023},
eprint={2309.02033},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

py_data_juicer-0.1.2-py3-none-any.whl (119.0 kB view details)

Uploaded Python 3

File details

Details for the file py_data_juicer-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for py_data_juicer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 070da0b57dd133ae3d2ff97b61840fbb3be85bbed2264e66a65a0feda3bfbdf1
MD5 da52aa6c01fc70ec18b47a9e93125a20
BLAKE2b-256 6770731e349d2a92bf59a767230b956665dd27ea24a7a948d4a1710154d77a24

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page