Automatic data preparation for LLMs via multi-level self-evolving pipelines

These details have not been verified by PyPI

Project links

Project description

DataEvolver

Automatic data preparation for LLMs via multi-level self-evolving pipelines

Turn noisy raw data + a handful of seed examples into training-ready, seed-aligned datasets — with executable DAGs, trial feedback, and iterative refinement built in.

Paper · Demo · Install · Quick Start · Usage · Results · Community

_{Give us a ⭐ if DataEvolver helps your data prep workflow — it helps others discover the project.}

TL;DR

You provide	DataEvolver does	You get
Raw data	Understands target profile from seeds	Structured understanding artifact
Seed examples	Orchestrates & validates operator DAGs	Executable pipeline plan
Optional task description	Instantiates, trials, judges, evolves	High-quality prepared data

One sentence: DataEvolver is a self-evolving data-prep system that jointly optimizes executability and seed alignment, not just one-shot pipeline synthesis.

Why DataEvolver
Highlights
How It Works
Results
Demo
Installation
Quick Start
Usage
Project Structure
Configuration
FAQ
Community
Citation

🔥 Why DataEvolver

Training data quality remains a bottleneck in LLM post-training. Raw corpora are often noisy, structurally inconsistent, or misaligned with the supervision style you actually want.

Most existing approaches fall into two camps:

Approach	Strength	Limitation
Predefined recipes	Stable engineering	Hard to adapt to new tasks
One-shot pipeline synthesis	Flexible	Often fragile in execution & quality

DataEvolver targets a harder, more practical question:

Can we automatically build a high-quality data preparation pipeline from raw data and only a small set of seed examples?

That requires optimizing two goals at once:

Executability — the pipeline must actually run end-to-end
Quality alignment — outputs must match the profile implied by seeds

DataEvolver achieves this through multi-level self-evolving: operator-level DAG repair + pipeline-level experience feedback across rounds.

✨ Highlights

Seed-guided understanding — infer schema, style, and quality constraints from seeds + sampled raw data
Operator-level self-evolving — build, validate, and repair DAGs; synthesize operators when the registry is insufficient
Pipeline-level self-evolving — trial runs, pilot judging, experience summarization, and next-round refinement
Three aligned interfaces — Web UI, CLI, and HTTP API share the same workflow semantics
Observable by design — stage artifacts, orchestration retries, token ledger, and round history are all inspectable
Open & extensible — modular subsystems, editable operator registry, and scriptable automation

🧠 How It Works

DataEvolver framework

flowchart LR
  A[Raw Data + Seeds] --> B[Understanding]
  B --> C[Orchestration]
  C --> D[Operator Evolution]
  D --> E[Instantiation]
  E --> F[Trial Run]
  F --> G[Quality Check]
  G --> H[Experience]
  H -->|not aligned| B
  G -->|ready| I[Full Run]

Core workflow loop

understanding → orchestration → operator_evolution → instantiation → trial_run → quality_check → experience

When quality criteria are met, DataEvolver runs the refined pipeline on the full dataset.

Three self-evolving layers

Understanding — learn the target data profile from seeds and raw samples
Operator evolution — fix DAG structure, dependencies, and missing capabilities
Pipeline evolution — convert trial-vs-seed gaps into reusable experience for the next round

📊 Results

Overall downstream performance

Main Experiment Results

Across 7 benchmarks from 4 task categories (instruction following, multiple-choice QA, math reasoning, text-to-SQL), DataEvolver improves training data quality and downstream performance — about 12% relative gain on average vs. weaker preparation settings.

Comparison against strong baselines

Comparison Results

DataEvolver outperforms vanilla SFT on raw data and strong data-preparation baselines. In several settings, fewer but better-prepared samples match or exceed larger, weakly prepared alternatives.

Ablation: both evolution loops matter

Ablation Study

Without operator-level evolution → pipelines are less executable and coherent
Without pipeline-level evolution → outputs are less seed-aligned

Efficiency

DataEvolver improves training-readiness and seed alignment while reducing preparation overhead — about 40% lower amortized token cost on average in our experiments.

Case study

Case study: pipeline evolution

See how an initial logical plan evolves into a refined executable pipeline, and how trial feedback becomes constraints for later rounds.

🎬 Demo

Recommended (small download for a clean clone):

Download DataEvolver_Demo_small.mov

The Web UI shows the evolution canvas — DAG orchestration tabs, instantiation cards, sample evaluation, and experience reflow across rounds.

⚡ Quick Start

Full cross-platform guide: docs/INSTALL.md

Prerequisites

Component	Version
Python	3.10+
Node.js	18+ LTS (Web UI)
LLM API	OpenAI-compatible endpoint + key

1. Clone & install (pick your OS)

git clone https://github.com/Akanezora0/DataEvolver.git
cd DataEvolver

Platform	One-command setup
Linux / macOS / Git Bash	`bash setup_env.sh`
Windows PowerShell	`powershell -ExecutionPolicy Bypass -File .\setup_env.ps1`
Windows CMD	`setup_env.bat`
Any OS	`python scripts/setup_env.py`

This creates .venv, installs Python + npm dependencies, and copies config/*.example.json → config/*.json when missing.

Install from PyPI (CLI / API only)

pip install dataevolver
mkdir my_project && cd my_project
dataevolver init          # creates config/ + data/ from bundled templates
# edit config/api_config.json & config/api_keys.json
dataevolver --help
dataevolver-server --reload   # API on :8000 (Web UI still needs git clone + npm)

See docs/PUBLISHING.md for maintainers (Test PyPI / PyPI upload).

Optional flags: --skip-frontend (API/CLI only) · --frontend-only (npm only).

2. Configure LLM

config/api_config.json   # provider, base URL, model
config/api_keys.json     # API key (gitignored — do not commit)

3. Start services (two terminals)

Service	Cross-platform	Classic
Backend `:8000`	`python scripts/dev.py backend`	`python run_server.py --reload` (after activating `.venv`)
Frontend `:5173`	`python scripts/dev.py frontend`	`cd frontend && npm run dev`

Activate virtualenv if needed:

# Linux / macOS
source .venv/bin/activate

# Windows PowerShell
.\.venv\Scripts\Activate.ps1

# Windows Git Bash
source .venv/Scripts/activate

4. Open the app

Service	URL
Web UI	http://127.0.0.1:5173
HTTP API	http://127.0.0.1:8000
OpenAPI docs	http://127.0.0.1:8000/docs

5. First pipeline (CLI)

dataevolver session-start my_pipeline \
  --raw tmp/samples/finance_raw.jsonl \
  --seed tmp/samples/finance_seed.jsonl \
  --description tmp/samples/finance_description.txt

dataevolver workflow advance-all my_pipeline --max-steps 32
dataevolver workflow state my_pipeline

🛠️ Usage

DataEvolver exposes the same workflow through three interfaces.

Web UI (recommended for exploration)

Create or select a pipeline session
Upload raw data, seed data, and optional task description
Advance step-by-step or run continuously
Inspect DAG tabs, instantiation code, trial scores, and experience
Trigger full run only after quality gates pass

CLI (recommended for reproducibility)

dataevolver --help
dataevolver state my_pipeline
dataevolver advance my_pipeline
dataevolver workflow advance-all my_pipeline --max-steps 32

Stage commands

Stage	Command
Understanding	`dataevolver understand my_pipeline`
Orchestration	`dataevolver orchestrate my_pipeline`
Instantiation	`dataevolver instantiate my_pipeline`
Trial run	`dataevolver trial my_pipeline`
Quality check	`dataevolver quality-check my_pipeline`
Experience	`dataevolver experience my_pipeline`
Full run	`dataevolver run my_pipeline`

Debugging & automation

dataevolver rerun my_pipeline orchestration
dataevolver tokens my_pipeline
dataevolver state --json my_pipeline
dataevolver advance --json my_pipeline

Operator pool (manual add)

Add custom operators to the task memory layer (data/operator_registry_user/<pipeline_id>.json). Same assimilation path as auto-evolution — eligible for domain/general promotion later.

# List pool for a pipeline
dataevolver operators list -p my_pipeline
dataevolver op list -p my_pipeline --source task   # short alias: op

# Add one operator
dataevolver op add my_task.clean_answer -p my_pipeline \
  -d "Strip boilerplate and keep direct answers" \
  -c semantic --requires-llm

# Interactive wizard
dataevolver op add -p my_pipeline -i

# Import from JSON (see examples/operator_template.json)
dataevolver op add -p my_pipeline --from-file examples/operator_template.json

# Clone spec from an existing operator
dataevolver op add my_task.custom_filter -p my_pipeline --copy-from remove_field -d "My variant"

# Remove from task memory (cannot delete base operators)
dataevolver op remove my_task.clean_answer -p my_pipeline

After adding operators, re-run orchestration so the DAG can pick them up:

dataevolver workflow orchestrate my_pipeline

HTTP API (recommended for integration)

Endpoint	Purpose
`POST /api/sessions/start`	Create session & register manifest
`GET /api/workflow/{pipeline_id}/state`	Read workflow state
`POST /api/workflow/{pipeline_id}/advance`	Advance one step
`POST /api/workflow/{pipeline_id}/rerun`	Rerun from a stage
`POST /api/pipeline/{pipeline_id}/run-full`	Full dataset execution
`GET /api/operators/?pipeline_id=`	List merged operator pool
`POST /api/operators/add`	Manually add operator(s)
`POST /api/operators/remove`	Remove from task/domain/general memory

Interactive schema: http://127.0.0.1:8000/docs

🧩 Project Structure

DataEvolver/
├── core/           # config, paths, LLM client, logging, token ledger
├── subsystems/     # understanding, orchestration, instantiation, trial, workflow, …
├── web/            # FastAPI app & routers
├── frontend/       # React + Vite evolution canvas UI
├── cli/            # Typer CLI (`dataevolver`)
├── config/         # runtime configs & templates
├── data/           # artifacts, workflow state, uploads (runtime)
├── assets/         # paper figures, demo media
├── examples/       # sample configs (e.g. operator_template.json)
├── scripts/
│   ├── setup_env.py   # cross-platform installer (core)
│   └── dev.py         # dev server helpers
├── setup_env.sh       # Linux / macOS / Git Bash → setup_env.py
├── setup_env.ps1      # Windows PowerShell → setup_env.py
├── setup_env.bat      # Windows CMD → setup_env.py
└── docs/INSTALL.md    # full deployment guide

⚙️ Configuration

File	Purpose
`config/api_config.json`	LLM provider, model, endpoints
`config/api_keys.json`	API credentials (keep out of git)
`config/operator_registry*.json`	Built-in & custom operators
`data/workflow_runs/{id}/state.json`	Per-pipeline workflow progress

Tips

Use --force / rerun when you want to regenerate a stage instead of reusing cached artifacts
Delete data/generated_pipelines/{id}.json to force re-instantiation
Token usage is tracked per workflow step via dataevolver tokens

❓ FAQ

Why does instantiation finish instantly?

If artifacts already exist, instantiation may reuse previous outputs (skipped). Built-in operators also use template delegation — only requires_llm operators trigger LLM codegen. Check the UI banner or dataevolver state message for reuse vs. LLM details.

Why does experience also finish quickly?

Experience summarization is rule-based aggregation over quality check, trial, and pilot results — it is designed for deterministic reflow, not LLM step-by-step rewriting.

Why do I see multiple orchestration tabs?

Each tab is a distinct orchestration attempt — typically a failed validation followed by a repaired DAG. Archives live under data/artifact_history/{pipeline_id}/.

Which data formats are supported today?

The current release focuses on text data preparation for LLM training: instruction tuning, QA-style supervision, math reasoning traces, and text-to-SQL. The architecture is extensible to broader modalities in future releases.

🤝 Community

We welcome issues, ideas, and contributions!

Channel	Link
Bug reports & feature requests	GitHub Issues
Questions & show-and-tell	GitHub Discussions (enable if not yet active)
Demo video	Release download

Contributing (lightweight)

Fork the repo and create a feature branch
Keep changes focused; match existing module boundaries (subsystems/, web/, frontend/, cli/)
Run backend smoke tests / npm run build in frontend/ when touching UI
Open a PR with: what changed, why, and how to verify

Good first contribution areas

New operators in the registry
Additional evaluation metrics or dataset adapters
UI polish on the evolution canvas
Docs, examples, and reproducible benchmark scripts

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

May 28, 2026

This version

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataevolver-0.1.0.tar.gz (164.1 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataevolver-0.1.0-py3-none-any.whl (195.0 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file dataevolver-0.1.0.tar.gz.

File metadata

Download URL: dataevolver-0.1.0.tar.gz
Upload date: May 28, 2026
Size: 164.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dataevolver-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5aa25f5bbe8dd94624b1fba498ad75ab0a8a847b210d6b758cc3dba467b1e66c`
MD5	`1fc5f18f64feb02ffc5ff38aa8e38f34`
BLAKE2b-256	`b8f939056b73317c95ce3870b400fc64f35897b3d79143a776e296058501fe9e`

See more details on using hashes here.

File details

Details for the file dataevolver-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataevolver-0.1.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 195.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dataevolver-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b0eef310afb7adeb1d4c448f7820fc7ff52d7e382601ae42f95622fec466d42`
MD5	`da5f2c9f7df54fc54a57e36ea4ef5ac5`
BLAKE2b-256	`4581be7024c5ea1ae1a5f62767945c567a6c20e32cbeabfa891f6a165636b916`

See more details on using hashes here.

dataevolver 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataEvolver

TL;DR

Table of Contents

🔥 Why DataEvolver

✨ Highlights

🧠 How It Works

Three self-evolving layers

📊 Results

Overall downstream performance

Comparison against strong baselines

Ablation: both evolution loops matter

Efficiency

Case study

🎬 Demo

⚡ Quick Start

Prerequisites

1. Clone & install (pick your OS)

Install from PyPI (CLI / API only)

2. Configure LLM

3. Start services (two terminals)

4. Open the app

5. First pipeline (CLI)

🛠️ Usage

Web UI (recommended for exploration)

CLI (recommended for reproducibility)

Operator pool (manual add)

HTTP API (recommended for integration)

🧩 Project Structure

⚙️ Configuration

❓ FAQ

🤝 Community

Contributing (lightweight)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes