Agentic synthetic-data generation framework inspired by Meta FAIR's Autodata / Agentic Self-Instruct.
Project description
autosynth
Generate synthetic datasets with an LLM loop that proposes, audits, solves, and judges its own work. Inspired by Meta FAIR's Autodata / Agentic Self-Instruct paper, but rewritten to be domain-agnostic: every domain-specific piece lives in a small Python plugin, and the runtime is the same regardless of whether you're generating math word problems, support-ticket triage data, or QA pairs from your own docs.
The headline trick: for each candidate datapoint, run a weak solver and a strong solver, score both against an LLM-generated rubric, and only keep the example if the strong solver clearly beats the weak one on a quality-passing example. Failed rounds are reflected on and fed back into the next attempt.
Status: alpha (0.1.0). The API is still moving. Pin a commit if you're depending on it.
Install
uv venv
uv pip install -e . # core
uv pip install -e ".[dev]" # + pytest, ruff
uv pip install -e ".[hf]" # + Hugging Face export
Python 3.10+. Either activate the venv (source .venv/bin/activate) or prefix commands with uv run.
Quick start (no API keys)
uv run autosynth run --config configs/mock_demo.yaml
uv run autosynth status outputs/mock-demo
uv run autosynth export --run outputs/mock-demo --format jsonl
The mock demo uses an in-process scripted "provider" and finishes in about a second. It writes outputs/mock-demo/run.db plus a frozen config snapshot. The export step is opt-in — the SQLite database is the source of truth.
Real providers
LLM calls go through LiteLLM, so any provider it supports should work. Set the relevant key and reference the model in YAML:
export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
challenger: { provider_model: anthropic/claude-haiku-4-5, temperature: 0.8 }
weak_solver: { provider_model: openai/gpt-4o-mini }
strong_solver: { provider_model: openai/gpt-4o }
judge: { provider_model: anthropic/claude-haiku-4-5, temperature: 0.0 }
You can mix providers across roles. The cheaper-vs-frontier split between the two solvers is the whole point — that's what produces the weak/strong gap that drives acceptance.
${VAR} and ${VAR:default} substitution works in any string field, so api_base: ${OLLAMA_HOST:http://localhost:11434} does what you'd expect.
See configs/example_qa.yaml and configs/example_math.yaml for full real-provider configs.
How it works
For each source item, autosynth runs the same five-step loop until the candidate is accepted or loop.max_rounds is exhausted:
- Challenger proposes a candidate
(input, reference_output, rubric). - Quality audits the candidate for obvious problems.
- Weak and strong solvers each take N attempts at the input.
- Judge scores every attempt against the rubric.
- Evaluator decides accept / reject. If reject, reflector writes feedback for the next round.
The acceptance defaults come from §3 of the paper:
- weak average ≤ 0.65, weak max ≤ 0.75
- strong average in [0.60, 0.95)
- strong − weak gap ≥ 0.20
- quality must have passed
All of these are overridable in acceptance: in your config.
Architecture
The runtime is an event-sourced pipeline over a SQLite database. A pure step() function advances item state; the dispatcher fulfills LLM requests and writes responses back; the store is the durable record.
pipeline.step() pure state machine: (state, responses) -> (state, requests)
dispatcher reads ready items, calls step(), fulfills requests
├─ fulfill_local threadpool over HTTP
└─ fulfill_batch provider batch APIs (see "Batch" below)
store SQLite + WAL, one run.db per run
llm provider routing, rate-limit, retry, cost accounting
Item states: PENDING → NEED_CANDIDATE → NEED_QUALITY → NEED_SCORES with NEED_REFLECTION on the reject branch and ACCEPTED / REJECTED as terminals. NEED_SCORES fans out N × weak + N × strong solver requests in parallel; each judge fires the moment its solver lands. Concurrency is bounded by cfg.dispatcher.concurrency.
The fact that step() is pure is the only reason resume works. Kill the process at any point — including mid-batch — and autosynth resume picks up exactly where it left off. In-flight local requests revert to pending; in-flight batch requests stay tagged and get polled.
CLI
autosynth run --config CONFIG.yaml [--run-id ID] [--resume RUN_ID] [-v]
autosynth resume RUN_DIR
autosynth status RUN_DIR
autosynth inspect-run RUN_DIR [--stuck]
autosynth export --run RUN_DIR --format jsonl|hf [--out PATH]
autosynth metaopt --config CONFIG.yaml
autosynth init-domain NAME --out my_domain.py
status is the one-liner; inspect-run is the detailed per-item table. --stuck filters to items that haven't reached a terminal state, which is what you want when something looks wrong.
Run outputs
Everything for a run lives under outputs/<run_id>/:
run.db— SQLite. Tables:runs,items,rounds,requests,responses,solver_scores,accepted. Queryable with thesqlite3CLI and safe to share.config.snapshot.yaml— the exact config used. Resume reads this if you don't pass--config.accepted.jsonl/hf_export/— produced onautosynth export, not written automatically.
Each accepted record contains input, reference_output, rubric, domain, source_id, metadata, the weak/strong/gap scores, per-attempt solver scores, and the acceptance rationale.
Writing a domain
A domain plugin is one class subclassing DomainAdapter with six methods. Scaffold one with:
uv run autosynth init-domain customer_support -o my_domain.py
Fill in load_grounding, generation_prompt, validate_candidate, solver_prompt, quality_prompt, and judge_prompt, then point your config at it:
domain:
path: ./my_domain.py:CustomerSupport
params:
source_csv: ./tickets.csv
The two bundled domains (src/autosynth/domains/qa_from_documents.py, src/autosynth/domains/math_word_problems.py) are short and worth reading before you write your own.
Meta-optimization
autosynth metaopt --config CONFIG.yaml runs the paper's secondary loop: evolve the orchestrator's prompts over generations. The unit of evolution is a HarnessSpec — a structured bag of rule strings that get injected into each agent's system prompt, plus a couple of numeric knobs.
The loop, roughly:
- Score the seed harness on training and validation source items.
- Each iteration: Boltzmann-sample a parent from the population (T=0.1 over training scores), summarize that parent's most recent rejection reasons, ask the mutator LLM for a structured diff, apply it, dedupe, and re-evaluate.
- Accept the mutation only if
child.val > parent.val— the paper's gate.
Mutations operate on the harness, not on Python source. That preserves the main lever the paper exercises (prompt-text edits) without the sandboxing headache of a code-editing agent. Swap in your own mutator if you want richer edits.
Try it without keys:
uv run autosynth metaopt --config configs/metaopt_mock.yaml
The mock scenario seeds at 0% accept, the mutator proposes a source-specificity rule on iteration 1 that lifts both train and val to 100%, that mutation is accepted, and subsequent iterations get deduplicated. Population, lineage, and per-iteration decisions are written under outputs/metaopt/<run_id>/iterations/.
To run for real, add metaopt: { enabled: true, max_iterations: 50, ... } to your existing config and point metaopt.mutator at a strong reasoning model. Meta-opt reuses your existing domain, acceptance, loop, and agent settings.
Batch mode
The dispatcher can submit requests through provider batch APIs (OpenAI /v1/batches, Anthropic message batches) for the 50% cost discount. The BatchProvider protocol and a MockBatchProvider are in the box. Real provider implementations are not — wiring those up is the next piece of work. If you only have a few thousand requests, fulfill_local is fine.
Safety and quality notes
- Every accepted datapoint carries an
acceptance_rationaleand a serializedEvalReport. There is no silent acceptance path. - The built-in PII filter (
safety.enabled: true) is a conservative heuristic, not a real DLP. For anything regulated, plug your own module in viasafety.filter. - Solvers are never told they're the weak or strong solver — the differential comes from the model/temperature choice. The paper flags adversarial prompting here as a gaming vector, so don't.
- There is no diversity / near-duplicate check on accepted examples yet. If you need that, extend
store.insert_acceptedwith MinHash or embedding-based dedupe. - LLM-as-judge bias is what it is. The rubric weight cap (≤ 7) and the positive-only rule from the paper help, but don't pretend they eliminate it.
Tests
uv run pytest
The full suite (~130 tests) runs against the in-process mock provider — no keys, no network. The interesting bits to look at if you're touching the core:
test_pure_pipeline.py— exhaustive state-transition coverage ofstep(), including the partial-completion no-op invariant.test_store.py—claim_pendingatomicity under threads, resume normalization.test_dispatcher.py— end-to-end accept, 100-request concurrent fulfill, budget abort, kill/resume.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autosynth-0.1.1.tar.gz.
File metadata
- Download URL: autosynth-0.1.1.tar.gz
- Upload date:
- Size: 98.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
992bea30c81c28490eddb2089a69e18cda254c2408a04ec48a722033b2727c7d
|
|
| MD5 |
7f6f527f93c5c202f821aaf09f9b4d94
|
|
| BLAKE2b-256 |
f94f336483743e90df7b6a0a5680fc065c208339be0a6e3c5bafd9286b45c215
|
File details
Details for the file autosynth-0.1.1-py3-none-any.whl.
File metadata
- Download URL: autosynth-0.1.1-py3-none-any.whl
- Upload date:
- Size: 83.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cad9ecaf46227b46eae0d99c0f451d05698947c9b1d0627fa56e73203266066
|
|
| MD5 |
a349da2560ca09593c1f3dd533eef63c
|
|
| BLAKE2b-256 |
003b3019cb99770756fe3ea07aa55c21867d7d389b02adf18ac0ddf379992842
|