Skip to main content

Automatic data preparation for LLMs via multi-level self-evolving pipelines

Project description

DataEvolver

Automatic data preparation for LLMs via multi-level self-evolving pipelines

Turn noisy raw data + a handful of seed examples into training-ready, seed-aligned datasets — with executable DAGs, trial feedback, and iterative refinement built in.


Python FastAPI React Paper Demo

Paper · Demo · Install · Quick Start · Usage · Results · Community


DataEvolver overview

Give us a ⭐ if DataEvolver helps your data prep workflow — it helps others discover the project.


TL;DR

You provide DataEvolver does You get
Raw data Understands target profile from seeds Structured understanding artifact
Seed examples Orchestrates & validates operator DAGs Executable pipeline plan
Optional task description Instantiates, trials, judges, evolves High-quality prepared data

One sentence: DataEvolver is a self-evolving data-prep system that jointly optimizes executability and seed alignment, not just one-shot pipeline synthesis.


Table of Contents


🔥 Why DataEvolver

Training data quality remains a bottleneck in LLM post-training. Raw corpora are often noisy, structurally inconsistent, or misaligned with the supervision style you actually want.

Most existing approaches fall into two camps:

Approach Strength Limitation
Predefined recipes Stable engineering Hard to adapt to new tasks
One-shot pipeline synthesis Flexible Often fragile in execution & quality

DataEvolver targets a harder, more practical question:

Can we automatically build a high-quality data preparation pipeline from raw data and only a small set of seed examples?

That requires optimizing two goals at once:

  • Executability — the pipeline must actually run end-to-end
  • Quality alignment — outputs must match the profile implied by seeds

DataEvolver achieves this through multi-level self-evolving: operator-level DAG repair + pipeline-level experience feedback across rounds.


✨ Highlights

  • Seed-guided understanding — infer schema, style, and quality constraints from seeds + sampled raw data
  • Operator-level self-evolving — build, validate, and repair DAGs; synthesize operators when the registry is insufficient
  • Pipeline-level self-evolving — trial runs, pilot judging, experience summarization, and next-round refinement
  • Three aligned interfaces — Web UI, CLI, and HTTP API share the same workflow semantics
  • Observable by design — stage artifacts, orchestration retries, token ledger, and round history are all inspectable
  • Open & extensible — modular subsystems, editable operator registry, and scriptable automation

🧠 How It Works

DataEvolver framework

flowchart LR
  A[Raw Data + Seeds] --> B[Understanding]
  B --> C[Orchestration]
  C --> D[Operator Evolution]
  D --> E[Instantiation]
  E --> F[Trial Run]
  F --> G[Quality Check]
  G --> H[Experience]
  H -->|not aligned| B
  G -->|ready| I[Full Run]

Core workflow loop

understanding → orchestration → operator_evolution → instantiation → trial_run → quality_check → experience

When quality criteria are met, DataEvolver runs the refined pipeline on the full dataset.

Three self-evolving layers

  1. Understanding — learn the target data profile from seeds and raw samples
  2. Operator evolution — fix DAG structure, dependencies, and missing capabilities
  3. Pipeline evolution — convert trial-vs-seed gaps into reusable experience for the next round

📊 Results

Overall downstream performance

Main Experiment Results

Across 7 benchmarks from 4 task categories (instruction following, multiple-choice QA, math reasoning, text-to-SQL), DataEvolver improves training data quality and downstream performance — about 12% relative gain on average vs. weaker preparation settings.

Comparison against strong baselines

Comparison Results

DataEvolver outperforms vanilla SFT on raw data and strong data-preparation baselines. In several settings, fewer but better-prepared samples match or exceed larger, weakly prepared alternatives.

Ablation: both evolution loops matter

Ablation Study

  • Without operator-level evolution → pipelines are less executable and coherent
  • Without pipeline-level evolution → outputs are less seed-aligned

Efficiency

DataEvolver improves training-readiness and seed alignment while reducing preparation overhead — about 40% lower amortized token cost on average in our experiments.

Case study

Case study: pipeline evolution

See how an initial logical plan evolves into a refined executable pipeline, and how trial feedback becomes constraints for later rounds.


🎬 Demo

Recommended (small download for a clean clone):

Download DataEvolver_Demo_small.mov

The Web UI shows the evolution canvas — DAG orchestration tabs, instantiation cards, sample evaluation, and experience reflow across rounds.


⚡ Quick Start

Full cross-platform guide: docs/INSTALL.md

Prerequisites

Component Version
Python 3.10+
Node.js 18+ LTS (Web UI)
LLM API OpenAI-compatible endpoint + key

1. Clone & install (pick your OS)

git clone https://github.com/Akanezora0/DataEvolver.git
cd DataEvolver
Platform One-command setup
Linux / macOS / Git Bash bash setup_env.sh
Windows PowerShell powershell -ExecutionPolicy Bypass -File .\setup_env.ps1
Windows CMD setup_env.bat
Any OS python scripts/setup_env.py

This creates .venv, installs Python + npm dependencies, and copies config/*.example.jsonconfig/*.json when missing.

Install from PyPI (CLI / API only)

pip install dataevolver
mkdir my_project && cd my_project
dataevolver init          # creates config/ + data/ from bundled templates
# edit config/api_config.json & config/api_keys.json
dataevolver --help
dataevolver-server --reload   # API on :8000 (Web UI still needs git clone + npm)

See docs/PUBLISHING.md for maintainers (Test PyPI / PyPI upload).

Optional flags: --skip-frontend (API/CLI only) · --frontend-only (npm only).

2. Configure LLM

config/api_config.json   # provider, base URL, model
config/api_keys.json     # API key (gitignored — do not commit)

3. Start services (two terminals)

Service Cross-platform Classic
Backend :8000 python scripts/dev.py backend python run_server.py --reload (after activating .venv)
Frontend :5173 python scripts/dev.py frontend cd frontend && npm run dev

Activate virtualenv if needed:

# Linux / macOS
source .venv/bin/activate

# Windows PowerShell
.\.venv\Scripts\Activate.ps1

# Windows Git Bash
source .venv/Scripts/activate

4. Open the app

Service URL
Web UI http://127.0.0.1:5173
HTTP API http://127.0.0.1:8000
OpenAPI docs http://127.0.0.1:8000/docs

5. First pipeline (CLI)

dataevolver session-start my_pipeline \
  --raw tmp/samples/finance_raw.jsonl \
  --seed tmp/samples/finance_seed.jsonl \
  --description tmp/samples/finance_description.txt

dataevolver workflow advance-all my_pipeline --max-steps 32
dataevolver workflow state my_pipeline

🛠️ Usage

DataEvolver exposes the same workflow through three interfaces.

Web UI (recommended for exploration)

  1. Create or select a pipeline session
  2. Upload raw data, seed data, and optional task description
  3. Advance step-by-step or run continuously
  4. Inspect DAG tabs, instantiation code, trial scores, and experience
  5. Trigger full run only after quality gates pass

CLI (recommended for reproducibility)

dataevolver --help
dataevolver state my_pipeline
dataevolver advance my_pipeline
dataevolver workflow advance-all my_pipeline --max-steps 32

Stage commands

Stage Command
Understanding dataevolver understand my_pipeline
Orchestration dataevolver orchestrate my_pipeline
Instantiation dataevolver instantiate my_pipeline
Trial run dataevolver trial my_pipeline
Quality check dataevolver quality-check my_pipeline
Experience dataevolver experience my_pipeline
Full run dataevolver run my_pipeline

Debugging & automation

dataevolver rerun my_pipeline orchestration
dataevolver tokens my_pipeline
dataevolver state --json my_pipeline
dataevolver advance --json my_pipeline

Operator pool (manual add)

Add custom operators to the task memory layer (data/operator_registry_user/<pipeline_id>.json). Same assimilation path as auto-evolution — eligible for domain/general promotion later.

# List pool for a pipeline
dataevolver operators list -p my_pipeline
dataevolver op list -p my_pipeline --source task   # short alias: op

# Add one operator
dataevolver op add my_task.clean_answer -p my_pipeline \
  -d "Strip boilerplate and keep direct answers" \
  -c semantic --requires-llm

# Interactive wizard
dataevolver op add -p my_pipeline -i

# Import from JSON (see examples/operator_template.json)
dataevolver op add -p my_pipeline --from-file examples/operator_template.json

# Clone spec from an existing operator
dataevolver op add my_task.custom_filter -p my_pipeline --copy-from remove_field -d "My variant"

# Remove from task memory (cannot delete base operators)
dataevolver op remove my_task.clean_answer -p my_pipeline

After adding operators, re-run orchestration so the DAG can pick them up:

dataevolver workflow orchestrate my_pipeline

HTTP API (recommended for integration)

Endpoint Purpose
POST /api/sessions/start Create session & register manifest
GET /api/workflow/{pipeline_id}/state Read workflow state
POST /api/workflow/{pipeline_id}/advance Advance one step
POST /api/workflow/{pipeline_id}/rerun Rerun from a stage
POST /api/pipeline/{pipeline_id}/run-full Full dataset execution
GET /api/operators/?pipeline_id= List merged operator pool
POST /api/operators/add Manually add operator(s)
POST /api/operators/remove Remove from task/domain/general memory

Interactive schema: http://127.0.0.1:8000/docs


🧩 Project Structure

DataEvolver/
├── core/           # config, paths, LLM client, logging, token ledger
├── subsystems/     # understanding, orchestration, instantiation, trial, workflow, …
├── web/            # FastAPI app & routers
├── frontend/       # React + Vite evolution canvas UI
├── cli/            # Typer CLI (`dataevolver`)
├── config/         # runtime configs & templates
├── data/           # artifacts, workflow state, uploads (runtime)
├── assets/         # paper figures, demo media
├── examples/       # sample configs (e.g. operator_template.json)
├── scripts/
│   ├── setup_env.py   # cross-platform installer (core)
│   └── dev.py         # dev server helpers
├── setup_env.sh       # Linux / macOS / Git Bash → setup_env.py
├── setup_env.ps1      # Windows PowerShell → setup_env.py
├── setup_env.bat      # Windows CMD → setup_env.py
└── docs/INSTALL.md    # full deployment guide

⚙️ Configuration

File Purpose
config/api_config.json LLM provider, model, endpoints
config/api_keys.json API credentials (keep out of git)
config/operator_registry*.json Built-in & custom operators
data/workflow_runs/{id}/state.json Per-pipeline workflow progress

Tips

  • Use --force / rerun when you want to regenerate a stage instead of reusing cached artifacts
  • Delete data/generated_pipelines/{id}.json to force re-instantiation
  • Token usage is tracked per workflow step via dataevolver tokens

❓ FAQ

Why does instantiation finish instantly?

If artifacts already exist, instantiation may reuse previous outputs (skipped). Built-in operators also use template delegation — only requires_llm operators trigger LLM codegen. Check the UI banner or dataevolver state message for reuse vs. LLM details.

Why does experience also finish quickly?

Experience summarization is rule-based aggregation over quality check, trial, and pilot results — it is designed for deterministic reflow, not LLM step-by-step rewriting.

Why do I see multiple orchestration tabs?

Each tab is a distinct orchestration attempt — typically a failed validation followed by a repaired DAG. Archives live under data/artifact_history/{pipeline_id}/.

Which data formats are supported today?

The current release focuses on text data preparation for LLM training: instruction tuning, QA-style supervision, math reasoning traces, and text-to-SQL. The architecture is extensible to broader modalities in future releases.


🤝 Community

We welcome issues, ideas, and contributions!

Channel Link
Bug reports & feature requests GitHub Issues
Questions & show-and-tell GitHub Discussions (enable if not yet active)
Demo video Release download

Contributing (lightweight)

  1. Fork the repo and create a feature branch
  2. Keep changes focused; match existing module boundaries (subsystems/, web/, frontend/, cli/)
  3. Run backend smoke tests / npm run build in frontend/ when touching UI
  4. Open a PR with: what changed, why, and how to verify

Good first contribution areas

  • New operators in the registry
  • Additional evaluation metrics or dataset adapters
  • UI polish on the evolution canvas
  • Docs, examples, and reproducible benchmark scripts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataevolver-0.1.0.tar.gz (164.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataevolver-0.1.0-py3-none-any.whl (195.0 kB view details)

Uploaded Python 3

File details

Details for the file dataevolver-0.1.0.tar.gz.

File metadata

  • Download URL: dataevolver-0.1.0.tar.gz
  • Upload date:
  • Size: 164.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dataevolver-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5aa25f5bbe8dd94624b1fba498ad75ab0a8a847b210d6b758cc3dba467b1e66c
MD5 1fc5f18f64feb02ffc5ff38aa8e38f34
BLAKE2b-256 b8f939056b73317c95ce3870b400fc64f35897b3d79143a776e296058501fe9e

See more details on using hashes here.

File details

Details for the file dataevolver-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataevolver-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 195.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for dataevolver-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b0eef310afb7adeb1d4c448f7820fc7ff52d7e382601ae42f95622fec466d42
MD5 da5f2c9f7df54fc54a57e36ea4ef5ac5
BLAKE2b-256 4581be7024c5ea1ae1a5f62767945c567a6c20e32cbeabfa891f6a165636b916

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page