Modern Data Centric AI system for Large Language Models

These details have not been verified by PyPI

Project links

Project description

DataFlow

PyPI - Downloads

🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.

Beginner-friendly learning resources (continuously updated): [🎬 Video Tutorials] [📚 Written Tutorials]

简体中文 | English

📰 1. News

[2025-12-19] 🎉 Our DataFlow technical report is now available!
We welcome you to read and cite our work if you find it helpful.
👉 Read the full report on arXiv: https://arxiv.org/abs/2512.16676
[2025-11-20] Introducing New Data Agents for DataFlow! 🤖 You can try them out now and follow the tutorial on Bilibili for a quick start.
[2025-06-28] 🎉 We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.

🔍 2. Overview

df_overview_final

DataFlow is a data preparation and training system designed to parse, generate, process, and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG using knowledge base cleaning. DataFlow has been empirically validated to improve domain-oriented LLMs' performance in fields such as healthcare, finance, and law.

Specifically, we are constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.

🛠️ 3. Operators Functionality

🔧 3.1 How Operators Work

DataFlow adopts a modular operator design philosophy, building flexible data processing pipelines by combining different types of operators. As the basic unit of data processing, an operator can receive structured data input (such as in json/jsonl/csv format) and, after intelligent processing, output high-quality data results. For a detailed guide on using operators, please refer to the Operator Documentation.

dataflow_operator

📊 3.2 Operator Classification System

In the DataFlow framework, operators are divided into three core categories based on their functional characteristics:

Operator Type	Quantity	Main Function
Generic Operators	80+	Covers general functions for text evaluation, processing, and synthesis
Domain-Specific Operators	40+	Specialized processing for specific domains (e.g., medical, financial, legal)
Evaluation Operators	20+	Comprehensively evaluates data quality from 6 dimensions

🛠️ 4. Pipelines Functionality

🔧 4.1 Ready-to-Use PipeLines

Current Pipelines in Dataflow are as follows:

📝 Text Pipeline: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
- [HuggingFace🤗 demo input & output for Text Pipeline]
🧠 Reasoning Pipeline: Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
- [HuggingFace🤗 demo input & output for Reasoning Pipeline]
🗃️ Text2SQL Pipeline: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
- [HuggingFace🤗 demo input & output for Text2SQL Pipeline]
📚 Knowlege Base Cleaning Pipeline: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
🤖 Agentic RAG Pipeline: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.

⚙️ 4.2 Flexible Operator PipeLines

In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.

🤖 4.3 Agent Guided Pipelines

DataFlow Agent: An intelligent assistant that performs data analysis, writes custom operators, and automatically orchestrates them into pipelines based on specific task objectives.
- [HuggingFace🤗 demo input & output for DataFlow Agent]

⚡ 5. Quick Start

🛠️ 5.1 Environment Setup and Installation

Please use the following commands for environment setup and installation👇

conda create -n dataflow python=3.10 
conda activate dataflow

pip install open-dataflow

If you want to use your own GPU for local inference, please use:

pip install open-dataflow[vllm]

DataFlow supports Python>=3.10 environments

After installation, you can use the following command to check if dataflow has been installed correctly:

dataflow -v

If installed correctly, you should see:

open-dataflow codebase version: 1.0.0
        Checking for updates...
        Local version:  1.0.0
        PyPI newest version:  1.0.0
You are using the latest version: 1.0.0.

🐳 5.1.1 Docker Installation (Alternative)

We also provide a Dockerfile for easy deployment and a pre-built Docker image for immediate use.

Option 1: Use Pre-built Docker Image

You can directly pull and use our pre-built Docker image:

# Pull the pre-built image
docker pull molyheci/dataflow:cu124

# Run the container with GPU support
docker run --gpus all -it molyheci/dataflow:cu124

# Inside the container, verify installation
dataflow -v

Option 2: Build from Dockerfile

Alternatively, you can build the Docker image from the provided Dockerfile:

# Clone the repository (HTTPS)
git clone https://github.com/OpenDCAI/DataFlow.git
# Or use SSH
# git clone git@github.com:OpenDCAI/DataFlow.git

cd DataFlow

# Build the Docker image
docker build -t dataflow:custom .

# Run the container
docker run --gpus all -it dataflow:custom

# Inside the container, verify installation
dataflow -v

Note: The Docker image includes CUDA 12.4.1 support and comes with vLLM pre-installed for GPU acceleration. Make sure you have NVIDIA Container Toolkit installed to use GPU features.

📖 5.2 Reference Project Documentation

For detailed usage instructions and getting started guide, please visit our Documentation.

🧪 6. Experimental Results

For Detailed Experiments setting, please visit our DataFlow Technical Report.

6.1 Text Pipeline

6.1.1 Pre-training data filter pipeline

From the SlimPajama-627B corpus, we extract a 100B-token subset and apply multiple DataFlow text-pretraining filters. We train a Qwen2.5-0.5B model from scratch for 30B tokens using the Megatron-DeepSpeed framework, the results are as follows:

Methods	ARC-C	ARC-E	MMLU	HellaSwag	WinoGrande	Gaokao-MathQA	Avg
Random-30B	25.26	43.94	27.03	37.02	50.99	27.35	35.26
Qurating-30B	25.00	43.14	27.50	37.03	50.67	26.78	35.02
FineWeb-Edu-30B	26.45	45.41	27.41	38.06	50.43	25.64	35.57
DataFlow-30B	25.51	45.58	27.42	37.58	50.67	27.35	35.69

6.1.2 SFT data filter and synthesis pipeline

To study small-scale SFT data quality, we fine-tune the Qwen2.5-7B base model using LLaMA-Factory on WizardLM and Alpaca datasets.
For each dataset, we compared a randomly sampled set of 5K instances against a set of 5K instances filtered by DataFlow's SFT pipeline. Additionally, we synthesize a 15k-size dataset, DataFlow-SFT-15K, using DataFlow’s Condor Generator and Condor Refiner pipeline, followed by DataFlow’s SFT filtering pipeline (excluding the Instagram filter). Benchmarks include comprehensive Math, Code, and Knowledge evaluation suites.

Math Benchmarks

Methods	math	gsm8k	aime24	minerva	olympiad	Avg
Alpaca (random)	54.9	77.2	13.3	14.0	27.0	37.3
Alpaca (filtered)	60.3	80.0	13.3	14.7	30.7	39.8
WizardLM (random)	61.1	84.2	6.7	18.0	29.3	39.9
WizardLM (filtered)	69.7	88.8	10.0	19.9	35.4	44.8
DataFlow-SFT-15K (random)	72.6	89.6	13.3	37.9	32.9	49.3
DataFlow-SFT-15K (filtered)	73.3	90.2	13.3	36.0	35.9	49.7

Code Benchmarks

Methods	HumanEval	MBPP	Avg
Alpaca (random)	71.3	75.9	73.6
Alpaca (filtered)	73.8	75.7	74.8
WizardLM (random)	75.6	82.0	78.8
WizardLM (filtered)	77.4	80.4	78.9
DataFlow-SFT-15K (random)	79.9	75.9	77.9
DataFlow-SFT-15K (filtered)	82.9	74.9	78.9

Knowledge Benchmarks

Methods	MMLU	C-EVAL	Avg
Alpaca (random)	71.8	80.0	75.9
Alpaca (filtered)	71.8	80.0	75.9
WizardLM (random)	71.8	79.2	75.5
WizardLM (filtered)	71.9	79.6	75.8
DataFlow-SFT-15K (random)	72.1	80.0	76.1
DataFlow-SFT-15K (filtered)	72.2	80.4	76.3

6.1.3 Conversation Synthesis Pipeline

We synthesize DataFlow-Chat-15K using DataFlow's conversation-generation pipeline and fine-tune Qwen2.5-7B-Base on it. Baselines include ShareGPT-15K, UltraChat-15K, and their full (non-truncated) versions. We evaluate on domain-specific tasks (TopDial, Light) and general benchmarks (MMLU, AlpacaEval, Arena-Hard).

Conversation Benchmarks

Model	TopDial	Light	Avg
Qwen2.5-7B	7.71	7.79	7.75
+ ShareGPT-15K	7.75	6.72	7.24
+ UltraChat-15K	7.72	6.83	7.28
+ DataFlow-Chat-15K	7.98	8.10	8.04

General Benchmarks

Model	MMLU	AlpacaEval	Arena-Hard	Avg
Qwen2.5-7B	71.45	7.05	0.60	26.36
+ ShareGPT-15K	73.09	3.70	1.30	26.03
+ UltraChat-15K	72.97	3.97	0.80	25.91
+ DataFlow-Chat-15K	73.41	10.11	1.10	28.21

6.2 Reasoning Pipeline

We adopt the NuminaMath dataset as a high-quality seed dataset. We compare three training sources: (1) a random 10K subset from Open-R1, (2) a random 10K subset from Synthetic-1, and (3) our 10K synthesized DataFlow-Reasoning-10K dataset constructed using DataFlow.

Setting	Model	gsm8k	math	amc23	olympiad	gaokao24_mix	minerva	AIME24@32	AIME25@32	Avg
Baseline	Qwen2.5-32B-Instruct	95.8	73.5	70.0	38.5	42.9	26.5	16.8	11.6	46.95
1 Epoch	+ SYNTHETIC-1-10k	92.9	71.8	52.5	38.4	23.1	24.3	35.6	34.0	46.6
1 Epoch	+ Open-R1-10k	91.5	72.3	65.0	38.4	20.9	24.6	43.0	33.5	48.7
1 Epoch	+ DataFlow-Reasoning-10K	93.9	72.3	72.5	38.7	38.5	26.5	35.9	34.5	51.6
2 Epochs	+ SYNTHETIC-1-10k	94.5	78.4	75.0	45.0	24.2	28.3	48.4	37.9	54.0
2 Epochs	+ Open-R1-10k	93.9	77.2	80.0	44.1	20.9	25.4	51.0	40.7	54.2
2 Epochs	+ DataFlow-Reasoning-10K	94.4	76.6	75.0	45.2	42.9	25.7	45.4	40.0	55.7

6.3 Code PipeLine

We randomly sample 20k instances from the Ling-Coder-SFT corpus and process them through the DataFlow Code Pipeline. This yields three curated code instruction datasets of different scales, DataFlow-Code-1K, DataFlow-Code-5K, and DataFlow-Code-10K, each designed to provide high-quality, pipeline-refined supervision signals for code generation tasks.

We compare our synthesized datasets against Code-Alpaca-1k and Self-OSS-Instruct-SC2-Exec-Filter-1k.

Trained on Qwen2.5-7B-Instruct

Training Data	BigCodeBench	LiveCodeBench (v6)	CruxEval (Input)	CruxEval (Output)	HumanEval+	Avg
Qwen2.5-7B-Instruct	35.3	23.4	44.8	43.9	72.6	44.0
+ Code Alpaca-1K	33.3	18.7	45.6	46.4	66.5	42.1
+ Self-OSS	31.9	21.4	46.9	45.9	70.1	43.2
+ DataFlow-Code-1K	35.5	25.7	48.0	45.1	72.6	45.4
+ DataFlow-Code-5K	36.2	26.4	48.6	45.0	73.2	45.9
+ DataFlow-Code-10K	36.8	26.0	48.8	45.4	73.8	46.2

Trained on Qwen2.5-14B-Instruct

Training Data	BigCodeBench	LiveCodeBench (v6)	CruxEval (Input)	CruxEval (Output)	HumanEval+	Avg
Qwen2.5-14B-Instruct	37.5	33.4	48.0	48.5	74.4	48.4
+ Code Alpaca-1K	37.0	28.2	50.2	49.6	71.3	47.3
+ Self-OSS	36.9	22.3	52.6	50.1	68.3	46.0
+ DataFlow-Code-1K	41.4	33.7	51.0	50.9	77.3	50.9
+ DataFlow-Code-5K	41.1	33.2	52.5	50.6	76.2	50.7
+ DataFlow-Code-10K	41.9	33.2	52.9	51.0	76.2	51.0

📄 7. Publications

Our team has published the following papers that form core components of the DataFlow system:

Paper Title	DataFlow Component	Venue	Year
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification	Multimodal reasoning verification framework for data processing and evaluation	ACL	2025
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration	Multi-actor collaborative data selection mechanism for enhanced data filtering and processing	ACL	2025

Contributing Institutions: PKU HKUST CAS Shanghai AI Lab Baichuan Ant Group

🏆 8. Awards & Achievements

We are honored to have received first-place awards in two major international AI competitions, recognizing the excellence and robustness of DataFlow and its reasoning capabilities:

Competition	Track	Award	Organizer	Date
ICML 2025 Challenges on Automated Math Reasoning and Extensions	Track 2: Physics Reasoning with Diagrams and Expressions	🥇 First Place Winner	ICML AI for Math Workshop & AWS Codabench	July 18, 2025
2025 Language and Intelligence Challenge (LIC)	Track 2: Beijing Academy of Artificial Intelligence	🥇 First Prize	Beijing Academy of Artificial Intelligence (BAAI) & Baidu	August 10, 2025

_{ICML 2025 Automated Math Reasoning Challenge — First Place Winner}

_{BAAI Language & Intelligence Challenge 2025 — First Prize}

💐 9. Acknowledgements

We sincerely thank MinerU for their outstanding work, whose powerful PDF/document text extraction capabilities provided essential support for our data loading process.
We also thank LLaMA-Factory for offering an efficient and user-friendly framework for large model fine-tuning, which greatly facilitated rapid iteration in our training and experimentation workflows.
Our gratitude extends to all contributors in the open-source community—their efforts collectively drive the development of DataFlow.

🤝 10. Community & Support

Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!

• 📮 GitHub Issues: Report bugs or suggest features

• 🔧 GitHub Pull Requests: Contribute code improvements

• 💬 Join our community groups to connect with us and other contributors!

📜 11. Citation

If you use DataFlow in your research, feel free to give us a cite.

@misc{liang2025dataflowllmdrivenframeworkunified,
      title={DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI}, 
      author={Hao Liang and Xiaochen Ma and Zhou Liu and Zhen Hao Wong and Zhengyang Zhao and Zimo Meng and Runming He and Chengyu Shen and Qifeng Cai and Zhaoyang Han and Meiyi Qiang and Yalin Feng and Tianyi Bai and Zewei Pan and Ziyi Guo and Yizhen Jiang and Jingwen Deng and Qijie You and Peichao Lai and Tianyu Guo and Chi Hsu Tsai and Hengyi Feng and Rui Hu and Wenkai Yu and Junbo Niu and Bohan Zeng and Ruichuan An and Lu Ma and Jihao Huang and Yaowei Zheng and Conghui He and Linpeng Tang and Bin Cui and Weinan E and Wentao Zhang},
      year={2025},
      eprint={2512.16676},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.16676}, 
}

📊 12. Statistics

_{Connect with the
PKU-DCAI Research Team
on Xiaohongshu: 26133106768}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.10

Mar 26, 2026

1.0.9

Feb 27, 2026

This version

1.0.8

Dec 19, 2025

1.0.7

Nov 20, 2025

1.0.6

Oct 15, 2025

1.0.5

Jul 23, 2025

1.0.4

Jul 15, 2025

1.0.3

Jul 10, 2025

1.0.2

Jul 3, 2025

1.0.1

Jul 3, 2025

1.0.0

Jun 30, 2025

0.0.3 yanked

Jun 29, 2025

Reason this release was yanked:

deprecated

0.0.2 yanked

Jun 25, 2025

Reason this release was yanked:

deprecated

0.0.1 yanked

Jun 11, 2025

Reason this release was yanked:

deprecated

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_dataflow-1.0.8.tar.gz (2.4 MB view details)

Uploaded Dec 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_dataflow-1.0.8-py3-none-any.whl (2.7 MB view details)

Uploaded Dec 19, 2025 Python 3

File details

Details for the file open_dataflow-1.0.8.tar.gz.

File metadata

Download URL: open_dataflow-1.0.8.tar.gz
Upload date: Dec 19, 2025
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_dataflow-1.0.8.tar.gz
Algorithm	Hash digest
SHA256	`b1e711110bbb28a4bfb55877ea12041acf53badbdb1b0c2ee076304f99fd9a2c`
MD5	`b7bbbf828c473a76f47d5f0bcf2b8563`
BLAKE2b-256	`74fb7c72bedb54260bb7bf7fad2052241eb97e37ac421fe9e639f9beab9092be`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_dataflow-1.0.8.tar.gz:

Publisher: python-publish.yml on OpenDCAI/DataFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_dataflow-1.0.8.tar.gz
- Subject digest: b1e711110bbb28a4bfb55877ea12041acf53badbdb1b0c2ee076304f99fd9a2c
- Sigstore transparency entry: 772087955
- Sigstore integration time: Dec 19, 2025
Source repository:
- Permalink: OpenDCAI/DataFlow@df4f68991be4af4b25cfa2ad76ae76bcf68a48fb
- Branch / Tag: refs/tags/v1.0.8
- Owner: https://github.com/OpenDCAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@df4f68991be4af4b25cfa2ad76ae76bcf68a48fb
- Trigger Event: release

File details

Details for the file open_dataflow-1.0.8-py3-none-any.whl.

File metadata

Download URL: open_dataflow-1.0.8-py3-none-any.whl
Upload date: Dec 19, 2025
Size: 2.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_dataflow-1.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`98ad989583ad15b3cd285c45292b30f070c75d1e66c8a59281affb4e27967118`
MD5	`2859003f21add4baef666f508574f3e8`
BLAKE2b-256	`151eaf8867936375a8608c9d11df2ea5c5627822fb80cb53192f25590c1dc805`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_dataflow-1.0.8-py3-none-any.whl:

Publisher: python-publish.yml on OpenDCAI/DataFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_dataflow-1.0.8-py3-none-any.whl
- Subject digest: 98ad989583ad15b3cd285c45292b30f070c75d1e66c8a59281affb4e27967118
- Sigstore transparency entry: 772087965
- Sigstore integration time: Dec 19, 2025
Source repository:
- Permalink: OpenDCAI/DataFlow@df4f68991be4af4b25cfa2ad76ae76bcf68a48fb
- Branch / Tag: refs/tags/v1.0.8
- Owner: https://github.com/OpenDCAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@df4f68991be4af4b25cfa2ad76ae76bcf68a48fb
- Trigger Event: release

open-dataflow 1.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataFlow

📰 1. News

🔍 2. Overview

🛠️ 3. Operators Functionality

🔧 3.1 How Operators Work

📊 3.2 Operator Classification System

🛠️ 4. Pipelines Functionality

🔧 4.1 Ready-to-Use PipeLines

⚙️ 4.2 Flexible Operator PipeLines

🤖 4.3 Agent Guided Pipelines

⚡ 5. Quick Start

🛠️ 5.1 Environment Setup and Installation

🐳 5.1.1 Docker Installation (Alternative)

Option 1: Use Pre-built Docker Image

Option 2: Build from Dockerfile

📖 5.2 Reference Project Documentation

🧪 6. Experimental Results

6.1 Text Pipeline

6.1.1 Pre-training data filter pipeline

6.1.2 SFT data filter and synthesis pipeline

Math Benchmarks

Code Benchmarks

Knowledge Benchmarks

6.1.3 Conversation Synthesis Pipeline

Conversation Benchmarks

General Benchmarks

6.2 Reasoning Pipeline

6.3 Code PipeLine

Trained on Qwen2.5-7B-Instruct

Trained on Qwen2.5-14B-Instruct

📄 7. Publications

🏆 8. Awards & Achievements

💐 9. Acknowledgements

🤝 10. Community & Support

📜 11. Citation

📊 12. Statistics

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance