Data Processing for and with Foundation Models.

These details have not been verified by PyPI

Project links

repository

Project description

Data-Juicer: The Data Operating System for the Foundation Model Era

Multimodal | Cloud-Native | AI-Ready | Large-Scale

Data-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as composable infrastructure—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.

Whether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required.

Alibaba Cloud PAI has deeply integrated Data-Juicer into its data processing products. See Quickly submit a DataJuicer job.

🚀 Quick Start

Zero-install exploration:

Install & run:

uv pip install py-data-juicer
dj-process --config demos/process_simple/process.yaml

Or compose in Python:

from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper

ds = NestedDataset.from_dict({
    "text": ["Short", "This passes the filter.", "Text   with   spaces"]
})
res_ds = ds.process([
    TextLengthFilter(min_len=10),
    WhitespaceNormalizationMapper()
])

for s in res_ds:
    print(s)

✨ Why Data-Juicer?

1. Modular & Extensible Architecture

200+ operators spanning text, image, audio, video, and multimodal data
Recipe-first: Reproducible YAML pipelines you can version, share, and fork like code
Composable: Drop in a single operator, chain complex workflows, or orchestrate full pipelines
Hot-reload: Iterate on operators without pipeline restarts

2. Full-Spectrum Data Intelligence

Foundation Models: Pre-training, fine-tuning, RL, and evaluation-grade curation
Agent Systems: Clean tool traces, structure context, de-identification, and quality gating
RAG & Analytics: Extraction, normalization, semantic chunking, deduplication, and data profiling

3. Production-Ready Performance

Scale: Process 70B samples in 2h on 50 Ray nodes (6400 cores)
Efficiency: Deduplicate 5TB in 2.8h using 1280 cores
Optimization: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness
Observability: Built-in tracing for debugging, auditing, and iterative improvement

⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo. It helps more people discover the project and keeps you notified of new releases and features.

📰 News

[2026-06-26] Release v1.5.3: VLA Ops Enhancements; Ray Repartition Pipeline; Scalability & Robustness

🤖 VLA Ops Enhancements — Expanded embodied-AI processing with 10+ new/renamed VLA operators (camera calibration via DeepCalib/DroidCalib/MoGe, atomic action segmentation, hand action computation & motion smoothing, clip reassembly, trajectory overlay, LeRobot export) and a complete VLA pipeline demo.
🔄 Ray Repartition Pipeline — New ray_repartition_pipeline for dataset-level block repartitioning in Ray mode.
⚡ Scalable Ray Data Reads — Wired override_num_blocks through the full call chain for controlling block parallelism on PB-scale datasets.
🧪 Test Coverage Expansion — Added 409 new test cases across 18 test files.
🐳 Stability & Robustness Fixes — JSONStreamDatasource schema unification, OP env version resolution, FUSE-safe rmtree for PartitionedRayExecutor, deprecated model name updates, and num_proc handling fixes.

[2026-05-29] Release v1.5.2: Semantic LLM OPs, Cross-doc Line Dedup & Leaner Dependencies

🧹 New Deduplicator — Added DocumentLineDeduplicator for cross-document line-level dedup, removing boilerplate lines (templates, copyright notices, navigation bars) by global document frequency.
🤖 Agent Data Quality Toolkit — Shipped interaction-quality OPs & recipe, a bad-case HTML report, and more robust JSONL / HuggingFace meta loading.
📦 Leaner & Faster Install — Slimmed the default dependency set (Ray, audio, spaCy, av, etc. moved to on-demand extras) to speed up installation.
🐳 Stability & Robustness Fixes — Library-safe error handling (raise over exit(1)), Ray init/temp-dir fixes, valid API params (drop invalid max_new_tokens), PyArrow 20+ batch JSON reading, local-path aesthetics model support, and more performance/bug fixes.
🧠 Semantic LLM Operators — Introduced llm_extract_mapper, llm_condition_filter, and llm_structured_ops with unified llm_* naming and configurable inference strategies (join/agg/top-k planned).

[2026-03-17] Release v1.5.1: LaTeX OPs; Compressed Format Support; Operator Robustness Fixes

📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle .tex archives and figure contexts.
🗜️ Compressed dataset format support: json[l].gz files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.
📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.
🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI/session capabilities were comprehensively redesigned for better maintainability and extensibility. See date-juicer-agents for more details.

[2026-02-12] Release v1.5.0: Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs

🚀 Enhanced Distributed Execution Framework -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.
🤖 Expanded Embodied AI Video Processing -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.
💪🏻 System Performance & Developer Experience Optimizations -- Enabled batch inference, memory/log reduction, core logic refactoring, and updated documentation/templates.
🐳 Critical Bug Fixes & Stability Improvements -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.

[2026-02-02] Release v1.4.6: Copilot, Video Bytes I/O & Ray Tracing

🤖 Q&A Copilot — Now live on our Doc Site | DingTalk | Discord. Feel free to ask anything related to Data-Juicer ecosystem!
- Check 🤖 Data-Juicer Agents | 📃 Deploy-ready codes | 🎬 More demos for more details.
🎬 Video Bytes I/O — Direct bytes processing for video pipelines
🫆 Ray Mode Tracer — Track changed samples in distributed processing
🐳 Enhancements & fixes — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug/doc fixes.

[2026-01-15] Release v1.4.5: 20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade

Embodied-AI OPs: added/enhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus S3 upload/download.
New Pipeline OP: compose multiple OPs into one pipeline; introduced Ray + vLLM pipelines for LLM/VLM inference.
Docs upgrade: moved to a unified Sphinx-based documentation build/deploy workflow with isolated theme/architecture repo.
Enhancements & fixes: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes.

[2025-12-01] Release v1.4.4: NeurIPS’25 Spotlight, 6 New Video/MM OPs & S3 I/O

NeurIPS'25 Spotlight for Data-Juicer 2.0
Repo split: sandbox/recipes/agents moved to standalone repos
S3 I/O added to loader/exporter
6 new video & multimodal OPs (character detection, VGGT, whole-body pose, hand reconstruction) + docs/Ray/video I/O improvements and bug fixes

View All Release and News Archive

🔌 Users & Ecosystems

The below list focuses on developer-facing integration and usages in alphabetical order.
Missing your project / name? Feel free to open a PR or reach out.

Data-Juicer plugs into your existing stack and evolves with community contributions:

Extensions

data-juicer-agents — DJ Copilot and agentic workflows
data-juicer-hub — Community recipes and best practices
data-juicer-sandbox — Data-model co-development with feedback loops

Frameworks & Platforms

AgentScope · Apache Arrow · Apache HDFS · Apache Hudi · Apache Iceberg · Apache Paimon · Alibaba PAI · Delta Lake · DiffSynth-Studio · EasyAnimate · Eval-Scope · Huawei Ascend · Hugging Face · LanceDB · LLaMA-Factory · ModelScope · ModelScope Swift · NVIDIA NeMo · Ray · RM-Gallery · Trinity-RFT · Volcano Engine

Industry

Alibaba Group, Ant Group, BYD Auto, ByteDance, DTSTACK, JD.com, NVIDIA, OPPO, Xiaohongshu, Xiaomi, Ximalaya, and more.

Academia

CAS, Nanjing University, Peking University, RUC, Tsinghua University, UCAS, Zhejiang University, and more.

Contributing & Community

We believe in building together. Whether you're fixing a typo, crafting a new operator, or sharing a breakthrough recipe, every contribution shapes the future of data processing.

We welcome contributions at all levels:

Good First Issues — Add operators, improve docs, report issues, or fix bugs
Developer Guide — Optimize engines, add features, or enhance core infrastructure
DJ-Hub — Share knowledge: recipes, papers, and best practices
Connect: Slack · DingTalk · Discord

Discord	DingTalk

Data-Juicer is made possible by the users and community:

Initiated by: Alibaba Tongyi Lab
Co-developed with: Alibaba Cloud PAI, Anyscale (Ray team), Sun Yat-sen University, NVIDIA (NeMo team), and contributors worldwide
Inspired by: Apache Arrow, Ray, Hugging Face Datasets, BLOOM, RedPajama-Data, ...

Documentation

For detailed documentation, please see here.

Quick Links:

operator zoo — Browse 200+ operators with examples
Agent interaction quality & bad-case — In-repo recipe, JSONL pipeline, HTML report (demos/agent/; operators such as agent_bad_case_signal_mapper are also listed in docs/Operators.md)
data-juicer-hub — Community-driven recipes and best practices
developer guide — Build your own code and contribute to DJ
data-juicer-cookbook — resource archive
awesome_llm_data — “Awesome List” for data-model co-development

📄 License & Attribution

Data-Juicer is released under the Apache License 2.0.
Attribution is appreciated: please use our badge, or text as "This project uses Data-Juicer: https://github.com/datajuicer".

📖 Citation

If you find Data-Juicer useful in your work, please cite:

@inproceedings{djv1,
  title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},
  author={Chen, Daoyuan and Huang, Yilun and Ma, Zhijian and Chen, Hesen and Pan, Xuchen and Ge, Ce and Gao, Dawei and Xie, Yuexiang and Liu, Zhaoyang and Gao, Jinyang and Li, Yaliang and Ding, Bolin and Zhou, Jingren},
  booktitle={SIGMOD},
  year={2024}
}

@article{djv2,
  title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models},
  author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Zhang, Yilei and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},
  journal={NeurIPS},
  year={2025}
}

More Publications (Click to expand)

(ICML'25 Spotlight) Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development
(CVPR'25) ImgDiff: Contrastive Data Synthesis for Vision Large Language Models
(TPAMI'25) The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
(NeurIPS'25) Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
(NeurIPS'25) MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?
(CVPR'26) HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data
(ICML'26) DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
(Data Scaling) BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

Project details

These details have not been verified by PyPI

Project links

repository

Release history Release notifications | RSS feed

This version

1.5.3

Jun 26, 2026

1.5.2

May 29, 2026

1.5.1

Mar 17, 2026

1.5.0

Feb 26, 2026

1.4.6

Feb 2, 2026

1.4.5

Jan 13, 2026

1.4.4

Dec 1, 2025

1.4.3

Sep 11, 2025

1.4.2

Aug 18, 2025

1.4.1

Jul 16, 2025

1.4.0

Jun 16, 2025

1.3.3

May 9, 2025

1.3.2

Apr 25, 2025

1.3.1

Apr 11, 2025

1.3.0

Mar 28, 2025

1.2.2

Mar 14, 2025

1.2.1

Feb 28, 2025

1.2.0

Feb 14, 2025

1.1.0.post1

Feb 7, 2025

1.1.0

Jan 17, 2025

1.0.3

Jan 3, 2025

1.0.2

Dec 20, 2024

1.0.1

Dec 6, 2024

1.0.0

Nov 22, 2024

0.2.0

Mar 8, 2024

0.1.3

Jan 5, 2024

0.1.2

Sep 28, 2023

0.1.1

Sep 15, 2023

0.1.0

Sep 15, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_data_juicer-1.5.3.tar.gz (530.2 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

py_data_juicer-1.5.3-py3-none-any.whl (2.3 MB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file py_data_juicer-1.5.3.tar.gz.

File metadata

Download URL: py_data_juicer-1.5.3.tar.gz
Upload date: Jun 26, 2026
Size: 530.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for py_data_juicer-1.5.3.tar.gz
Algorithm	Hash digest
SHA256	`e85b715323dc7a24103fd334cd2f701e1c9948b6f2884f5d4976503cca2d97bd`
MD5	`11c4da77955b4382aa845b24f637aa84`
BLAKE2b-256	`7a7681c0330a57f487083f0fdc1bbef5a46a56bb19076921ed4b081fc511d69f`

See more details on using hashes here.

File details

Details for the file py_data_juicer-1.5.3-py3-none-any.whl.

File metadata

Download URL: py_data_juicer-1.5.3-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 2.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for py_data_juicer-1.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`22478ed66194dcdc3895af89632a5ac896f387bff9fd2bf74007ca6cdbec875c`
MD5	`77623a924f9dbdf9e046962a175b508f`
BLAKE2b-256	`c6e4cc742e1f3c187b7bd2c59c0f8f1a80a0bb2678ba395751240b25b44e0888`

See more details on using hashes here.

py-data-juicer 1.5.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data-Juicer: The Data Operating System for the Foundation Model Era

🚀 Quick Start

✨ Why Data-Juicer?

1. Modular & Extensible Architecture

2. Full-Spectrum Data Intelligence

3. Production-Ready Performance

📰 News

🔌 Users & Ecosystems

Extensions

Frameworks & Platforms

Industry

Academia

Contributing & Community

Documentation

📄 License & Attribution

📖 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes