Placeholder for open-dataflow.
Reason this release was yanked:
deprecated
Project description
DataFlow
1 News
- [2025-06-15] 🎉 We release the dataflow-agentbot.
- [2025-06-10] 🎉 We release the documentation of dataflow!
- [2025-06-01] 🎉 Our new data-centric generation and evaluation system is now open-sourced — stay tuned for future updates!
2 Overview
DataFlow is a data preparation system designed to process, generate and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.
Specifically, we constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive Dataflow system. Additionally, we develop an intelligent Dataflow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.
3 Pipelines Functionality
3.1 Ready-to-Use PipeLines
Current Pipelines in Dataflow are as follows:
- Text Pipeline: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
- Reasoning Pipeline: Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
- Text2SQL Pipeline: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
- Knowlege Base Cleaning Pipeline: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
- Agentic RAG Pipeline: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.
3.2 Flexible Operator PipeLines
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.
3.3 Agent Guided Pipelines
- DataFlow Agent: Can arrange existing
operatorsand automatically construct new pipelines based on task requirements.
4 Quick Start
For environment setup and installation, please using the following commands👇
conda create -n dataflow python=3.10
conda activate dataflow
git clone https://github.com/Open-DataFlow/DataFlow
cd DataFlow
pip install -e .
For Quick-Start and Guide, please visit or Documentation.
5 Experimental Results
For Detailed Experiments setting, please visit
5.1 Text PipeLine
5.1.1 Pre-training data filter pipeline
The pre-training data processing pipeline was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using QuratingScorer are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.
5.1.2 SFT data filter pipeline
We filted 3k record from alpaca dataset and compare it with radom selected 3k data from alpaca dataset by training it on Qwen2.5-7B. Results are:
2. Reasoning Pipeline
We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:
3. Text2SQL PipeLine
We fine-tuned the Qwen2.5-Coder-7B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_dataflow-0.0.2.tar.gz.
File metadata
- Download URL: open_dataflow-0.0.2.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e84e1c6470ce378f7327113550c9e80fd0c49085cd42519e71b157f590088e54
|
|
| MD5 |
2c11fa96082bf7db7ac9a97e26a1a43a
|
|
| BLAKE2b-256 |
ec0ff64ac7cb992409ca189d204e31781247e09ca9dc5001e590f19a9d3aa07f
|
Provenance
The following attestation bundles were made for open_dataflow-0.0.2.tar.gz:
Publisher:
python-publish.yml on Open-DataFlow/DataFlow-Preview
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
open_dataflow-0.0.2.tar.gz -
Subject digest:
e84e1c6470ce378f7327113550c9e80fd0c49085cd42519e71b157f590088e54 - Sigstore transparency entry: 250230642
- Sigstore integration time:
-
Permalink:
Open-DataFlow/DataFlow-Preview@fa0e10d862d132785ed9b1687b57e668131ab288 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/Open-DataFlow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@fa0e10d862d132785ed9b1687b57e668131ab288 -
Trigger Event:
release
-
Statement type:
File details
Details for the file open_dataflow-0.0.2-py3-none-any.whl.
File metadata
- Download URL: open_dataflow-0.0.2-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed07fd6ba546c01d4a0502fd5eec143cdc070c598cde35bb76798acf7618f21c
|
|
| MD5 |
828143b390e5a57579400bd4ac75be61
|
|
| BLAKE2b-256 |
725ffb35a9d139189c3e0aeab9e27f478caa5996ab70f1e66b1a374d79ce7fbc
|
Provenance
The following attestation bundles were made for open_dataflow-0.0.2-py3-none-any.whl:
Publisher:
python-publish.yml on Open-DataFlow/DataFlow-Preview
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
open_dataflow-0.0.2-py3-none-any.whl -
Subject digest:
ed07fd6ba546c01d4a0502fd5eec143cdc070c598cde35bb76798acf7618f21c - Sigstore transparency entry: 250230648
- Sigstore integration time:
-
Permalink:
Open-DataFlow/DataFlow-Preview@fa0e10d862d132785ed9b1687b57e668131ab288 -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/Open-DataFlow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@fa0e10d862d132785ed9b1687b57e668131ab288 -
Trigger Event:
release
-
Statement type: