Automatic data preparation for LLMs via multi-level self-evolving pipelines
Project description
DataEvolver
Automatic data preparation for LLMs via multi-level self-evolving pipelines
Turn noisy raw data + a handful of seed examples into training-ready, seed-aligned datasets — with executable DAGs, trial feedback, and iterative refinement built in.
Paper · Demo · Install · Quick Start · Usage · Results · Community
Give us a ⭐ if DataEvolver helps your data prep workflow — it helps others discover the project.
TL;DR
| You provide | DataEvolver does | You get |
|---|---|---|
| Raw data | Understands target profile from seeds | Structured understanding artifact |
| Seed examples | Orchestrates & validates operator DAGs | Executable pipeline plan |
| Optional task description | Instantiates, trials, judges, evolves | High-quality prepared data |
One sentence: DataEvolver is a self-evolving data-prep system that jointly optimizes executability and seed alignment, not just one-shot pipeline synthesis.
Table of Contents
- Why DataEvolver
- Highlights
- How It Works
- Results
- Demo
- Installation
- Quick Start
- Usage
- Project Structure
- Configuration
- FAQ
- Community
- Citation
🔥 Why DataEvolver
Training data quality remains a bottleneck in LLM post-training. Raw corpora are often noisy, structurally inconsistent, or misaligned with the supervision style you actually want.
Most existing approaches fall into two camps:
| Approach | Strength | Limitation |
|---|---|---|
| Predefined recipes | Stable engineering | Hard to adapt to new tasks |
| One-shot pipeline synthesis | Flexible | Often fragile in execution & quality |
DataEvolver targets a harder, more practical question:
Can we automatically build a high-quality data preparation pipeline from raw data and only a small set of seed examples?
That requires optimizing two goals at once:
- Executability — the pipeline must actually run end-to-end
- Quality alignment — outputs must match the profile implied by seeds
DataEvolver achieves this through multi-level self-evolving: operator-level DAG repair + pipeline-level experience feedback across rounds.
✨ Highlights
- Seed-guided understanding — infer schema, style, and quality constraints from seeds + sampled raw data
- Operator-level self-evolving — build, validate, and repair DAGs; synthesize operators when the registry is insufficient
- Pipeline-level self-evolving — trial runs, pilot judging, experience summarization, and next-round refinement
- Three aligned interfaces — Web UI, CLI, and HTTP API share the same workflow semantics
- Observable by design — stage artifacts, orchestration retries, token ledger, and round history are all inspectable
- Open & extensible — modular subsystems, editable operator registry, and scriptable automation
🧠 How It Works
flowchart LR
A[Raw Data + Seeds] --> B[Understanding]
B --> C[Orchestration]
C --> D[Operator Evolution]
D --> E[Instantiation]
E --> F[Trial Run]
F --> G[Quality Check]
G --> H[Experience]
H -->|not aligned| B
G -->|ready| I[Full Run]
Core workflow loop
understanding → orchestration → operator_evolution → instantiation → trial_run → quality_check → experience
When quality criteria are met, DataEvolver runs the refined pipeline on the full dataset.
Three self-evolving layers
- Understanding — learn the target data profile from seeds and raw samples
- Operator evolution — fix DAG structure, dependencies, and missing capabilities
- Pipeline evolution — convert trial-vs-seed gaps into reusable experience for the next round
📊 Results
Overall downstream performance
Across 7 benchmarks from 4 task categories (instruction following, multiple-choice QA, math reasoning, text-to-SQL), DataEvolver improves training data quality and downstream performance — about 12% relative gain on average vs. weaker preparation settings.
Comparison against strong baselines
DataEvolver outperforms vanilla SFT on raw data and strong data-preparation baselines. In several settings, fewer but better-prepared samples match or exceed larger, weakly prepared alternatives.
Ablation: both evolution loops matter
- Without operator-level evolution → pipelines are less executable and coherent
- Without pipeline-level evolution → outputs are less seed-aligned
Efficiency
DataEvolver improves training-readiness and seed alignment while reducing preparation overhead — about 40% lower amortized token cost on average in our experiments.
Case study
See how an initial logical plan evolves into a refined executable pipeline, and how trial feedback becomes constraints for later rounds.
🎬 Demo
Recommended (small download for a clean clone):
Download DataEvolver_Demo_small.mov
The Web UI shows the evolution canvas — DAG orchestration tabs, instantiation cards, sample evaluation, and experience reflow across rounds.
⚡ Quick Start
Full cross-platform guide: docs/INSTALL.md
Prerequisites
| Component | Version |
|---|---|
| Python | 3.10+ |
| Node.js | 18+ LTS (Web UI) |
| LLM API | OpenAI-compatible endpoint + key |
1. Clone & install (pick your OS)
git clone https://github.com/Akanezora0/DataEvolver.git
cd DataEvolver
| Platform | One-command setup |
|---|---|
| Linux / macOS / Git Bash | bash setup_env.sh |
| Windows PowerShell | powershell -ExecutionPolicy Bypass -File .\setup_env.ps1 |
| Windows CMD | setup_env.bat |
| Any OS | python scripts/setup_env.py |
This creates .venv, installs Python + npm dependencies, and copies config/*.example.json → config/*.json when missing.
Install from PyPI (CLI / API only)
pip install dataevolver
mkdir my_project && cd my_project
dataevolver init # creates config/ + data/ from bundled templates
# edit config/api_config.json & config/api_keys.json
dataevolver --help
dataevolver-server --reload # API on :8000 (Web UI still needs git clone + npm)
See docs/PUBLISHING.md for maintainers (Test PyPI / PyPI upload).
Optional flags: --skip-frontend (API/CLI only) · --frontend-only (npm only).
2. Configure LLM
config/api_config.json # provider, base URL, model
config/api_keys.json # API key (gitignored — do not commit)
3. Start services (two terminals)
| Service | Cross-platform | Classic |
|---|---|---|
Backend :8000 |
python scripts/dev.py backend |
python run_server.py --reload (after activating .venv) |
Frontend :5173 |
python scripts/dev.py frontend |
cd frontend && npm run dev |
Activate virtualenv if needed:
# Linux / macOS
source .venv/bin/activate
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# Windows Git Bash
source .venv/Scripts/activate
4. Open the app
| Service | URL |
|---|---|
| Web UI | http://127.0.0.1:5173 |
| HTTP API | http://127.0.0.1:8000 |
| OpenAPI docs | http://127.0.0.1:8000/docs |
5. First pipeline (CLI)
dataevolver session-start my_pipeline \
--raw tmp/samples/finance_raw.jsonl \
--seed tmp/samples/finance_seed.jsonl \
--description tmp/samples/finance_description.txt
dataevolver workflow advance-all my_pipeline --max-steps 32
dataevolver workflow state my_pipeline
🛠️ Usage
DataEvolver exposes the same workflow through three interfaces.
Web UI (recommended for exploration)
- Create or select a pipeline session
- Upload raw data, seed data, and optional task description
- Advance step-by-step or run continuously
- Inspect DAG tabs, instantiation code, trial scores, and experience
- Trigger full run only after quality gates pass
CLI (recommended for reproducibility)
dataevolver --help
dataevolver state my_pipeline
dataevolver advance my_pipeline
dataevolver workflow advance-all my_pipeline --max-steps 32
Stage commands
| Stage | Command |
|---|---|
| Understanding | dataevolver understand my_pipeline |
| Orchestration | dataevolver orchestrate my_pipeline |
| Instantiation | dataevolver instantiate my_pipeline |
| Trial run | dataevolver trial my_pipeline |
| Quality check | dataevolver quality-check my_pipeline |
| Experience | dataevolver experience my_pipeline |
| Full run | dataevolver run my_pipeline |
Debugging & automation
dataevolver rerun my_pipeline orchestration
dataevolver tokens my_pipeline
dataevolver state --json my_pipeline
dataevolver advance --json my_pipeline
Operator pool (manual add)
Add custom operators to the task memory layer (data/operator_registry_user/<pipeline_id>.json). Same assimilation path as auto-evolution — eligible for domain/general promotion later.
# List pool for a pipeline
dataevolver operators list -p my_pipeline
dataevolver op list -p my_pipeline --source task # short alias: op
# Add one operator
dataevolver op add my_task.clean_answer -p my_pipeline \
-d "Strip boilerplate and keep direct answers" \
-c semantic --requires-llm
# Interactive wizard
dataevolver op add -p my_pipeline -i
# Import from JSON (see examples/operator_template.json)
dataevolver op add -p my_pipeline --from-file examples/operator_template.json
# Clone spec from an existing operator
dataevolver op add my_task.custom_filter -p my_pipeline --copy-from remove_field -d "My variant"
# Remove from task memory (cannot delete base operators)
dataevolver op remove my_task.clean_answer -p my_pipeline
After adding operators, re-run orchestration so the DAG can pick them up:
dataevolver workflow orchestrate my_pipeline
HTTP API (recommended for integration)
| Endpoint | Purpose |
|---|---|
POST /api/sessions/start |
Create session & register manifest |
GET /api/workflow/{pipeline_id}/state |
Read workflow state |
POST /api/workflow/{pipeline_id}/advance |
Advance one step |
POST /api/workflow/{pipeline_id}/rerun |
Rerun from a stage |
POST /api/pipeline/{pipeline_id}/run-full |
Full dataset execution |
GET /api/operators/?pipeline_id= |
List merged operator pool |
POST /api/operators/add |
Manually add operator(s) |
POST /api/operators/remove |
Remove from task/domain/general memory |
Interactive schema: http://127.0.0.1:8000/docs
🧩 Project Structure
DataEvolver/
├── core/ # config, paths, LLM client, logging, token ledger
├── subsystems/ # understanding, orchestration, instantiation, trial, workflow, …
├── web/ # FastAPI app & routers
├── frontend/ # React + Vite evolution canvas UI
├── cli/ # Typer CLI (`dataevolver`)
├── config/ # runtime configs & templates
├── data/ # artifacts, workflow state, uploads (runtime)
├── assets/ # paper figures, demo media
├── examples/ # sample configs (e.g. operator_template.json)
├── scripts/
│ ├── setup_env.py # cross-platform installer (core)
│ └── dev.py # dev server helpers
├── setup_env.sh # Linux / macOS / Git Bash → setup_env.py
├── setup_env.ps1 # Windows PowerShell → setup_env.py
├── setup_env.bat # Windows CMD → setup_env.py
└── docs/INSTALL.md # full deployment guide
⚙️ Configuration
| File | Purpose |
|---|---|
config/api_config.json |
LLM provider, model, endpoints |
config/api_keys.json |
API credentials (keep out of git) |
config/operator_registry*.json |
Built-in & custom operators |
data/workflow_runs/{id}/state.json |
Per-pipeline workflow progress |
Tips
- Use
--force/rerunwhen you want to regenerate a stage instead of reusing cached artifacts - Delete
data/generated_pipelines/{id}.jsonto force re-instantiation - Token usage is tracked per workflow step via
dataevolver tokens
❓ FAQ
Why does instantiation finish instantly?
If artifacts already exist, instantiation may reuse previous outputs (skipped). Built-in operators also use template delegation — only requires_llm operators trigger LLM codegen. Check the UI banner or dataevolver state message for reuse vs. LLM details.
Why does experience also finish quickly?
Experience summarization is rule-based aggregation over quality check, trial, and pilot results — it is designed for deterministic reflow, not LLM step-by-step rewriting.
Why do I see multiple orchestration tabs?
Each tab is a distinct orchestration attempt — typically a failed validation followed by a repaired DAG. Archives live under data/artifact_history/{pipeline_id}/.
Which data formats are supported today?
The current release focuses on text data preparation for LLM training: instruction tuning, QA-style supervision, math reasoning traces, and text-to-SQL. The architecture is extensible to broader modalities in future releases.
🤝 Community
We welcome issues, ideas, and contributions!
| Channel | Link |
|---|---|
| Bug reports & feature requests | GitHub Issues |
| Questions & show-and-tell | GitHub Discussions (enable if not yet active) |
| Demo video | Release download |
Contributing (lightweight)
- Fork the repo and create a feature branch
- Keep changes focused; match existing module boundaries (
subsystems/,web/,frontend/,cli/) - Run backend smoke tests /
npm run buildinfrontend/when touching UI - Open a PR with: what changed, why, and how to verify
Good first contribution areas
- New operators in the registry
- Additional evaluation metrics or dataset adapters
- UI polish on the evolution canvas
- Docs, examples, and reproducible benchmark scripts
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataevolver-0.1.0.tar.gz.
File metadata
- Download URL: dataevolver-0.1.0.tar.gz
- Upload date:
- Size: 164.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5aa25f5bbe8dd94624b1fba498ad75ab0a8a847b210d6b758cc3dba467b1e66c
|
|
| MD5 |
1fc5f18f64feb02ffc5ff38aa8e38f34
|
|
| BLAKE2b-256 |
b8f939056b73317c95ce3870b400fc64f35897b3d79143a776e296058501fe9e
|
File details
Details for the file dataevolver-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataevolver-0.1.0-py3-none-any.whl
- Upload date:
- Size: 195.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b0eef310afb7adeb1d4c448f7820fc7ff52d7e382601ae42f95622fec466d42
|
|
| MD5 |
da5f2c9f7df54fc54a57e36ea4ef5ac5
|
|
| BLAKE2b-256 |
4581be7024c5ea1ae1a5f62767945c567a6c20e32cbeabfa891f6a165636b916
|