Modern Data Centric AI system for Large Language Models

These details have not been verified by PyPI

Project links

Project description

DataFlow

简体中文 | English

🚀 Features • ⚡ Quick Start • 📖 Documentation • 🧪 Experiments

https://github.com/user-attachments/assets/05e047a5-99bb-4043-bc71-2b5ccdab2126

📰 1. News

🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.

🔍 2. Overview

DataFlow is a data preparation and training system designed to parse, generate, process and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.

Specifically, we constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.

🛠️ 3. Pipelines Functionality

🔧 3.1 Ready-to-Use PipeLines

Current Pipelines in Dataflow are as follows:

📝 Text Pipeline: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
- [HuggingFace🤗 demo input & output for Text Pipeline]
🧠 Reasoning Pipeline: Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
- [HuggingFace🤗 demo input & output for Reasoning Pipeline]
🗃️ Text2SQL Pipeline: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
- [HuggingFace🤗 demo input & output for Text2SQL Pipeline]
📚 Knowlege Base Cleaning Pipeline: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
🤖 Agentic RAG Pipeline: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.

⚙️ 3.2 Flexible Operator PipeLines

In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.

🤖 3.3 Agent Guided Pipelines

DataFlow Agent: Can arrange existing operators and automatically construct new pipelines based on task requirements.
- [HuggingFace🤗 demo input & output for DataFlow Agent]

⚡ 4. Quick Start

For environment setup and installation, please using the following commands👇

conda create -n dataflow python=3.10 
conda activate dataflow

pip install open-dataflow

If you want to use your own GPU to inference locally, please use:

pip install open-dataflow[vllm]

Dataflow supports Python>=3.10

You can use follwing command to check if installed correctly:

dataflow -v

You are expected to see following outputs:

open-dataflow codebase version: 1.0.0
        Checking for updates...
        Local version:  1.0.0
        PyPI newest version:  1.0.0
You are using the latest version: 1.0.0.

For Quick-Start and Guide, please visit our Documentation.

🧪 5. Experimental Results

For Detailed Experiments setting, please visit our documentation.

📝 5.1 Text PipeLine

5.1.1 Pre-training data filter pipeline

The pre-training data processing pipeline was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using QuratingScorer are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.

5.1.2 SFT data filter pipeline

We filted 3k record from alpaca dataset and compare it with radom selected 3k data from alpaca dataset by training it on Qwen2.5-7B. Results are:

🧠 5.2 Reasoning Pipeline

We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:

🗃️ 5.3 Text2SQL PipeLine

We fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:

🤝 6. Community & Support

Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!

• 📮 GitHub Issues: Report bugs or suggest features

• 🔧 GitHub Pull Requests: Contribute code improvements

• 💬 Join our community groups to connect with us and other contributors!

📜 7. Citation

If you use DataFlow in your research, feel free to give us a cite.

@misc{dataflow2025,
  author       = {DataFlow Develop Team},
  title        = {DataFlow: A Unified Framework for Data-Centric AI},
  year         = {2025},
  howpublished = {\url{https://github.com/OpenDCAI/DataFlow}},
  note         = {Accessed: 2025-07-08}
}

📊 8. Statistics

_{Developed and maintained by the
PKU-DCAI Research Team ❤️

Connect with us on Xiaohongshu: 26133106768}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.10

Mar 26, 2026

1.0.9

Feb 27, 2026

1.0.8

Dec 19, 2025

1.0.7

Nov 20, 2025

1.0.6

Oct 15, 2025

1.0.5

Jul 23, 2025

1.0.4

Jul 15, 2025

This version

1.0.3

Jul 10, 2025

1.0.2

Jul 3, 2025

1.0.1

Jul 3, 2025

1.0.0

Jun 30, 2025

0.0.3 yanked

Jun 29, 2025

Reason this release was yanked:

deprecated

0.0.2 yanked

Jun 25, 2025

Reason this release was yanked:

deprecated

0.0.1 yanked

Jun 11, 2025

Reason this release was yanked:

deprecated

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_dataflow-1.0.3.tar.gz (1.4 MB view details)

Uploaded Jul 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

open_dataflow-1.0.3-py3-none-any.whl (1.5 MB view details)

Uploaded Jul 10, 2025 Python 3

File details

Details for the file open_dataflow-1.0.3.tar.gz.

File metadata

Download URL: open_dataflow-1.0.3.tar.gz
Upload date: Jul 10, 2025
Size: 1.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for open_dataflow-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`36905e4d785a1eb917c3a7c55fa5c50d3111dae90c5318f1d1d2cdfa7b1c1509`
MD5	`4ec7dd1a5f9b73cfff230f6fa1137a20`
BLAKE2b-256	`468b91f2914727570caa74b2064a0015b3c5cbe4aa3ae9bf32ddd4aec884f053`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_dataflow-1.0.3.tar.gz:

Publisher: python-publish.yml on OpenDCAI/DataFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_dataflow-1.0.3.tar.gz
- Subject digest: 36905e4d785a1eb917c3a7c55fa5c50d3111dae90c5318f1d1d2cdfa7b1c1509
- Sigstore transparency entry: 269812197
- Sigstore integration time: Jul 10, 2025
Source repository:
- Permalink: OpenDCAI/DataFlow@8951fe335fc7bc5786f97fe0355a0fba4503d6b2
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/OpenDCAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@8951fe335fc7bc5786f97fe0355a0fba4503d6b2
- Trigger Event: release

File details

Details for the file open_dataflow-1.0.3-py3-none-any.whl.

File metadata

Download URL: open_dataflow-1.0.3-py3-none-any.whl
Upload date: Jul 10, 2025
Size: 1.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for open_dataflow-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9a766ecfddc9954d09c44b81459c8b352a3d06cf9853030eadcf1795f2f3013`
MD5	`ba690e997a4d55cf33caee3d1ad5cdd6`
BLAKE2b-256	`6f90fee34acb309b062430113231a3bbc2e46142252f48d179645a54924d2ed3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for open_dataflow-1.0.3-py3-none-any.whl:

Publisher: python-publish.yml on OpenDCAI/DataFlow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: open_dataflow-1.0.3-py3-none-any.whl
- Subject digest: e9a766ecfddc9954d09c44b81459c8b352a3d06cf9853030eadcf1795f2f3013
- Sigstore transparency entry: 269812204
- Sigstore integration time: Jul 10, 2025
Source repository:
- Permalink: OpenDCAI/DataFlow@8951fe335fc7bc5786f97fe0355a0fba4503d6b2
- Branch / Tag: refs/tags/v1.0.3
- Owner: https://github.com/OpenDCAI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@8951fe335fc7bc5786f97fe0355a0fba4503d6b2
- Trigger Event: release

open-dataflow 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataFlow

📰 1. News

🔍 2. Overview

🛠️ 3. Pipelines Functionality

🔧 3.1 Ready-to-Use PipeLines

⚙️ 3.2 Flexible Operator PipeLines

🤖 3.3 Agent Guided Pipelines

⚡ 4. Quick Start

🧪 5. Experimental Results

📝 5.1 Text PipeLine

5.1.1 Pre-training data filter pipeline

5.1.2 SFT data filter pipeline

🧠 5.2 Reasoning Pipeline

🗃️ 5.3 Text2SQL PipeLine

🤝 6. Community & Support

📜 7. Citation

📊 8. Statistics

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance