Skip to main content

Data Processing for and with Foundation Models.

Project description

[中文主页] | [DJ-Cookbook] | [OperatorZoo] | [API] | [Awesome LLM Data]

Data Processing for and with Foundation Models

Data-Juicer

pypi version Docker version Docker on OSS

DataModality Usage PyPI Downloads

OpZoo Paper Paper

Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs). We provide a playground with a managed JupyterLab. Try Data-Juicer straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our works.

Platform for AI of Alibaba Cloud (PAI) has deeply integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: Quickly submit a DataJuicer job.

Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us, in promoting data-model co-development along with research and applications of foundation models!

News

History News: >
  • [2025-01-03] We support post-tuning scenarios better, via 20+ related new OPs, and via unified dataset format compatible to LLaMA-Factory and ModelScope-Swift.
  • [2024-12-17] We propose HumanVBench, which comprises 16 human-centric tasks with synthetic data, benchmarking 22 video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our paper, and try to evaluate your models with it.
  • [2024-11-22] We release DJ v1.0.0, in which we refactored Data-Juicer's Operator, Dataset, Sandbox and many other modules for better usability, such as supporting fault-tolerant, FastAPI and adaptive resource management.
  • [2024-08-25] We give a tutorial about data processing for multimodal LLMs in KDD'2024.
  • [2024-08-09] We propose Img-Diff, which enhances the performance of multimodal large language models through contrastive data synthesis, achieving a score that is 12 points higher than GPT-4V on the MMVP benchmark. See more details in our paper, and download the dataset from huggingface and modelscope.
  • [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's official website for more information.
  • [2024-07-17] We utilized the Data-Juicer Sandbox Laboratory Suite to systematically optimize data and models through a co-development workflow between data and models, achieving a new top spot on the VBench text-to-video leaderboard. The related achievements have been compiled and published in a paper, and the model has been released on the ModelScope and HuggingFace platforms.
  • [2024-07-12] Our awesome list of MLLM-Data has evolved into a systemic survey from model-data co-development perspective. Welcome to explore and contribute!
  • [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's official website for more information.
  • [2024-03-07] We release Data-Juicer v0.2.0 now! In this new version, we support more features for multimodal data (including video now), and introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models.
  • [2024-02-20] We have actively maintained an awesome list of LLM-Data, welcome to visit and contribute!
  • [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
  • [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's official website for more information.
  • [2024-01-05] We release Data-Juicer v0.1.3 now! In this new version, we support more Python versions (3.8-3.10), and support multimodal dataset converting/processing (Including texts, images, and audios. More modalities will be supported in the future). Besides, our paper is also updated to v3.
  • [2023-10-13] Our first data-centric LLM competition begins! Please visit the competition's official websites, FT-Data Ranker (1B Track, 7B Track), for more information.

Demos

DataJuicer-Agent: Quick start your data processing journey!

https://github.com/user-attachments/assets/6eb726b7-6054-4b0c-905e-506b2b9c7927

DataJuicer-Sandbox: Better data-model co-dev at a lower cost!

https://github.com/user-attachments/assets/a45f0eee-0f0e-4ffe-9a42-d9a55370089d

Why Data-Juicer?

  • Systematic & Reusable: Empowering users with a systematic library of 100+ core OPs, and 50+ reusable config recipes and dedicated toolkits, designed to function independently of specific multimodal LLM datasets and processing pipelines. Supporting data analysis, cleaning, and synthesis in pre-training, post-tuning, en, zh, and more scenarios.

  • User-Friendly & Extensible: Designed for simplicity and flexibility, with easy-start guides, and DJ-Cookbook containing fruitful demo usages. Feel free to implement your own OPs for customizable data processing.

    Data-Juicer now uses AI to automatically rewrite and optimize operator docstrings, generating detailed operator documentation to help users quickly understand the functionality and usage of each operator.
    For details about the implementation of this documentation enhancement workflow, please visit the op_doc_enhance_workflow.

  • Efficient & Robust: Providing performance-optimized parallel data processing (Aliyun-PAI\Ray\CUDA\OP Fusion), faster with less resource usage, verified in large-scale production environments.

  • Effect-Proven & Sandbox: Supporting data-model co-development, enabling rapid iteration through the sandbox laboratory, and providing features such as feedback loops and visualization, so that you can better understand and improve your data and models. Many effect-proven datasets and models have been derived from DJ, in scenarios such as pre-training, text-to-video and image-to-text generation. Data-in-the-loop

Documentation

For detailed documentation, please see here.

License

Data-Juicer is released under Apache License 2.0.

Contribution and Acknowledgements

Data-Juicer has benefited greatly from and continues to welcome contributions at all levels: new operators (from simple functions to advanced algorithms based on existing papers), data-recipes & processing scenarios, feature requests, efficiency enhancements, bug fixes, better documentation and usage feedback. Please refer to our Developer Guide to get started. Spreading the word in the community and giving the repository a star ⭐ are also invaluable forms of support!

Our sincere gratitude goes to all our code contributors who are the cornerstone of this project. We strive to keep the list below updated and look forward to including more names (alphabetical order); please reach out if we have missed any acknowledgements.

We look forward to your feedback and collaboration, including partnership inquiries or proposals for new sub-projects related to Data-Juicer. Feel free to contact via issues, PRs, Slack, DingTalk, Discord, and e-mails.

Welcome to join our community on

Discord DingTalk

References

If you find Data-Juicer useful for your research or development, please kindly cite the following works, 1.0paper, 2.0paper.

@inproceedings{djv1,
  title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
  author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pan and Ce Ge and Dawei Gao and Yuexiang Xie and Zhaoyang Liu and Jinyang Gao and Yaliang Li and Bolin Ding and Jingren Zhou},
  booktitle={International Conference on Management of Data},
  year={2024}
}

@article{djv2,
  title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models},
  author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Zhang, Yilei and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},
  journal={Advances in Neural Information Processing Systems},
  year={2025}
}
More data-related papers from the Data-Juicer Team: >

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_data_juicer-1.4.6.tar.gz (524.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

py_data_juicer-1.4.6-py3-none-any.whl (2.0 MB view details)

Uploaded Python 3

File details

Details for the file py_data_juicer-1.4.6.tar.gz.

File metadata

  • Download URL: py_data_juicer-1.4.6.tar.gz
  • Upload date:
  • Size: 524.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for py_data_juicer-1.4.6.tar.gz
Algorithm Hash digest
SHA256 503613d4f2012153857c600069046e0f6a775dccd03c46e7372cfa260bee2267
MD5 943e70cfcf92fb5d31d0de237792ab9e
BLAKE2b-256 820ccf83028d8129a8ebd4a70ce750d4316cc6da81477102387a948909879d3f

See more details on using hashes here.

File details

Details for the file py_data_juicer-1.4.6-py3-none-any.whl.

File metadata

  • Download URL: py_data_juicer-1.4.6-py3-none-any.whl
  • Upload date:
  • Size: 2.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for py_data_juicer-1.4.6-py3-none-any.whl
Algorithm Hash digest
SHA256 dfc1df4ead6446fd8e8d2a1ac3a1ba51d2cfaba18bfd3609a014c3b2a743a74d
MD5 a817bf8155a5788a6eb17edf5ea0d213
BLAKE2b-256 a3867cebbf5da1fe563b483d0c6e4d05b6b59854e400bed7c488f1a53e291211

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page