Data synthesis toolkit - batch generate high-quality training data from seed examples using LLMs

These details have not been verified by PyPI

Project links

Project description

DataSynth

LLM-Powered Synthetic Dataset Generation
with Quality-Diversity Optimization

LLM 驱动的合成数据生成引擎 — 智能模板 · 并发生成 · Schema 验证 · 成本精算
Seed-to-scale synthetic data engine with auto-detected templates, concurrent generation, schema validation, and precise cost estimation

Abstract · Problem Statement · Formal Framework · Architecture · Key Innovations · Quick Start · MCP Server · Ecosystem · References

Abstract

高质量训练数据是 LLM 性能的关键瓶颈。人工标注成本高（$0.1–$10/条）、速度慢（100 条/天）、一致性差（标注员理解差异），而简单的 LLM 批量调用又缺少质量保证——重复样本、违反 Schema 约束、分布偏斜等问题无法自动检测。

DataSynth 提出种子驱动的合成数据生成框架 (seed-driven synthetic generation)：从少量种子数据（50 条）出发，通过智能模板选择 (auto-detected templates) 匹配最佳 Prompt 策略，并发批量生成 + Schema 验证 + 跨批次去重，以 $0.001–$0.01/条的成本生产高质量训练数据。系统实现「种子 → 模板 → 生成 → 验证 → 去重 → 统计」的完整管线，支持增量续跑和后置钩子自动触发质检。

DataSynth implements a seed-driven synthetic data generation framework. The system auto-detects data types (instruction-response / preference pairs / multi-turn dialogue), selects specialized prompt templates, generates data via concurrent LLM calls (Anthropic / OpenAI), validates against Schema constraints (type / range / enum / length), deduplicates across batches, and provides precise cost estimation based on per-model pricing. Supports incremental resume, retry with temperature escalation, and post-generation hooks.

Problem Statement

合成数据生产面临三个根本性挑战：

根本性问题	形式化定义	现有方案局限	DataSynth 的方法
成本-规模矛盾 Cost-Scale Dilemma	人工标注成本 $c_h \gg c_{llm}$，但 LLM 生成缺少质量保证	简单批量调用无验证，"量大质低"	Schema 验证 + 去重 + 重试温度递增，成本降至 $0.001–$0.01/条
模板盲选 Template Blindness	指令-回复、偏好对、多轮对话需要不同的生成策略	通用 Prompt 生成所有类型，质量低	自动检测数据类型，选用专用 Prompt 模板
生成断裂 Generation Fragmentation	大批量生成中断后需从头重来，已有结果浪费	无增量续跑，重复消耗 API 和成本	增量续跑 (`--resume`) + 并发批量 + 后置钩子自动质检

DataSynth 不是通用 LLM 调用工具。它是 LLM 训练数据的生产线——从种子数据到大规模合成数据的端到端管线，质量可验证、成本可预估、流程可恢复。

Formal Framework

Generation Model

合成数据生成形式化为映射函数：

$$G: (\mathcal{S}, \mathcal{T}, \theta) \to D'$$

其中 $\mathcal{S} = {s_1, \ldots, s_k}$ 为种子数据集（$k \approx 50$），$\mathcal{T}$ 为模板函数（由数据类型自动选择），$\theta = (\text{model}, \text{temperature}, \text{max_tokens})$ 为生成参数，$D'$ 为合成数据集。

Quality-Diversity Trade-off

合成数据需要同时满足质量和多样性：

$$\max_\theta ;\mathbb{E}{d \sim D'}[Q(d)] \quad \text{s.t.} \quad H(D') \geq H{\min}$$

其中 $Q(d)$ 为样本质量（Schema 合规性），$H(D')$ 为数据集熵（多样性度量）。

Schema 验证确保质量：类型检查 + 约束校验（range / enum / length），不合规样本自动过滤。

温度递增确保多样性：重试时 $\theta_{\text{temp}} \leftarrow \theta_{\text{temp}} + 0.05$，逐步增加生成多样性。

Deduplication

精确匹配去重（种子集 + 跨批次），避免重复数据稀释多样性：

$$D'{\text{final}} = {d \in D' : d \notin \mathcal{S} ;\land; \forall d' \in D'{\text{prev}}, d \neq d'}$$

Cost Model

精确成本估算基于模型实际定价：

$$\text{Cost}(D') = \sum_{d \in D'} (t_{\text{in}}(d) \cdot p_{\text{in}} + t_{\text{out}}(d) \cdot p_{\text{out}})$$

其中 $t_{\text{in}}, t_{\text{out}}$ 为输入/输出 token 数，$p_{\text{in}}, p_{\text{out}}$ 为对应模型的每 token 单价。

Architecture

graph LR
    Seed["Seed Data<br/>(~50 samples)"] --> Detect["Type Detector<br/>Auto-detect"]
    Detect --> Template["Template<br/>Specialized Prompt"]
    Template --> Gen["Generator<br/>Concurrent Batches"]
    Gen --> Val["Validator<br/>Schema Constraints"]
    Val --> Dedup["Deduplicator<br/>Seed + Cross-batch"]
    Dedup --> Stats["Statistics<br/>Distribution Report"]
    Stats --> Hook["Post Hook<br/>(Optional)"]

    style Gen fill:#0969da,color:#fff,stroke:#0969da
    style Val fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style Dedup fill:#2da44e,color:#fff,stroke:#2da44e
    style Seed fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Detect fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Template fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Stats fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Hook fill:#1a1a2e,color:#e0e0e0,stroke:#444

Key Innovations

1. Auto-Detected Data Type Templates

根据 Schema 字段名自动检测数据类型，选用专用 Prompt 模板：

字段特征	检测为	专用模板
`instruction` + `response`	`instruction_response`	指令-回复生成
`prompt` + `chosen` + `rejected`	`preference`	偏好对比数据（DPO/RLHF）
`conversation`	`multi_turn`	多轮对话生成

也可手动指定：--data-type preference

2. Concurrent Generation with Incremental Resume

多批次并行调用 LLM（线程安全去重），中断后从已有输出继续：

# 并发 3 批次
knowlyr-datasynth generate ./output/ -n 1000 --concurrency 3

# 中断后续跑（自动跳过已有数据）
knowlyr-datasynth generate ./output/ -n 1000 --resume

重试策略：自动重试 + 温度递增，兼顾容错和多样性：

knowlyr-datasynth generate ... --max-retries 5 --retry-delay 3 --temperature 0.4

3. Schema Validation and Deduplication

生成的数据自动校验，不合规样本被过滤：

类型检查: text / int / float / bool / list
约束检查: range（数值范围）、enum（枚举值）、min_length / max_length
精确去重: 种子集 + 跨批次，避免重复数据

4. Precise Cost Estimation

按模型实际定价计算成本，--dry-run 先估再生：

knowlyr-datasynth generate ./output/ -n 1000 --dry-run

模型定价表

模型	输入 ($/1K tokens)	输出 ($/1K tokens)
Claude Opus	$0.015	$0.075
Claude Sonnet	$0.003	$0.015
Claude Haiku	$0.00025	$0.00125
GPT-4o	$0.0025	$0.01
GPT-4o Mini	$0.00015	$0.0006

5. Post-Generation Hooks

生成完成后自动触发下游命令（如质检）：

knowlyr-datasynth generate ./output/ -n 1000 \
  --post-hook "knowlyr-datacheck validate {analysis_dir}"

支持变量: {analysis_dir} {output_path} {count}

6. Distribution Statistics

--stats 输出字段分布统计报告 (synthetic.stats.json)：

knowlyr-datasynth generate ./output/ -n 1000 --stats

Quick Start

pip install knowlyr-datasynth

可选依赖

pip install knowlyr-datasynth[anthropic]  # Anthropic Claude
pip install knowlyr-datasynth[openai]     # OpenAI GPT
pip install knowlyr-datasynth[llm]        # 两者都装
pip install knowlyr-datasynth[mcp]        # MCP 服务器
pip install knowlyr-datasynth[all]        # 全部功能

API Mode

export ANTHROPIC_API_KEY=your_key

# 从 DataRecipe 分析结果生成
knowlyr-datasynth generate ./analysis_output/my_dataset/ -n 100

# 并发 + JSONL 输出
knowlyr-datasynth generate ./analysis_output/my_dataset/ -n 1000 --concurrency 3 --format jsonl

# 估算成本
knowlyr-datasynth generate ./analysis_output/my_dataset/ -n 1000 --dry-run

Interactive Mode (无需 API key)

# 生成 Prompt，在 Claude Code 中手动调用
knowlyr-datasynth prepare ./analysis_output/my_dataset/ -n 10

Python SDK

from datasynth import SynthEngine

engine = SynthEngine(model="claude-sonnet-4-20250514")
result = engine.generate(
    analysis_dir="./analysis_output/my_dataset/",
    target_count=100,
    concurrency=3,
)
print(f"Generated: {result.generated_count}")
print(f"Deduped: {result.dedup_count}")
print(f"Cost: ${result.cost_usd:.4f}")

配置文件

knowlyr-datasynth init    # 生成配置模板
knowlyr-datasynth generate ./output/ --config datasynth.config.json

{
  "target_count": 1000,
  "model": "claude-sonnet-4-20250514",
  "temperature": 0.8,
  "batch_size": 5,
  "concurrency": 3,
  "data_type": "auto"
}

MCP Server

{
  "mcpServers": {
    "knowlyr-datasynth": {
      "command": "uv",
      "args": ["--directory", "/path/to/data-synth", "run", "python", "-m", "datasynth.mcp_server"]
    }
  }
}

9 个 MCP 工具覆盖完整的合成数据工作流。

CLI Reference

完整命令列表

命令	功能
`knowlyr-datasynth generate <dir> -n <count>`	生成合成数据
`knowlyr-datasynth generate ... --concurrency 3`	并发批次
`knowlyr-datasynth generate ... --resume`	增量续跑
`knowlyr-datasynth generate ... --dry-run`	成本估算
`knowlyr-datasynth generate ... --stats`	分布统计
`knowlyr-datasynth generate ... --data-type preference`	手动指定数据类型
`knowlyr-datasynth generate ... --post-hook "cmd"`	后置钩子
`knowlyr-datasynth generate ... --config config.json`	配置文件
`knowlyr-datasynth prepare <dir> -n <count>`	交互模式 Prompt 生成
`knowlyr-datasynth validate <data> <schema>`	数据验证
`knowlyr-datasynth init`	生成配置模板

Ecosystem

Architecture Diagram

graph LR
    Radar["Radar<br/>Discovery"] --> Recipe["Recipe<br/>Analysis"]
    Recipe --> Synth["Synth<br/>Generation"]
    Recipe --> Label["Label<br/>Annotation"]
    Synth --> Check["Check<br/>Quality"]
    Label --> Check
    Check --> Audit["Audit<br/>Model Audit"]
    Crew["Crew<br/>Deliberation Engine"]
    Agent["Agent<br/>RL Framework"]
    ID["ID<br/>Identity Runtime"]
    Crew -.->|能力定义| ID
    ID -.->|身份 + 记忆| Crew
    Crew -.->|轨迹 + 奖励| Agent
    Agent -.->|优化策略| Crew

    style Synth fill:#0969da,color:#fff,stroke:#0969da
    style Crew fill:#2da44e,color:#fff,stroke:#2da44e
    style Agent fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style ID fill:#e5534b,color:#fff,stroke:#e5534b
    style Radar fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Recipe fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Label fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Check fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Audit fill:#1a1a2e,color:#e0e0e0,stroke:#444

Layer	Project	Description	Repo
Discovery	AI Dataset Radar	数据集竞争情报、趋势分析	GitHub
Analysis	DataRecipe	逆向分析、Schema 提取、成本估算	GitHub
Production	DataSynth	LLM 合成 · 智能模板 · Schema 验证 · 成本精算	You are here
Production	DataLabel	零服务器标注 · LLM 预标注 · IAA 分析	GitHub
Quality	DataCheck	规则验证、重复检测、分布分析	GitHub
Audit	ModelAudit	蒸馏检测、模型指纹	GitHub
Identity	knowlyr-id	身份系统 + AI 员工运行时	GitHub
Deliberation	Crew	对抗式多智能体协商 · 持久记忆进化 · MCP 原生	GitHub
Agent Training	knowlyr-agent	Gymnasium 风格 RL 框架 · 过程奖励模型 · SFT/DPO/GRPO	GitHub

Development

git clone https://github.com/liuxiaotong/data-synth.git
cd data-synth
pip install -e ".[all,dev]"
pytest

CI: GitHub Actions，Python 3.10+。Tag push 自动发布 PyPI + GitHub Release。

References

Self-Instruct — Wang, Y. et al., 2023. Self-Instruct: Aligning LM with Self-Generated Instructions. arXiv:2212.10560 — 自指令生成方法
Alpaca — Taori, R. et al., 2023. Stanford Alpaca: An Instruction-following LLaMA Model. — 种子数据驱动的合成指令生成
WizardLM — Xu, C. et al., 2023. WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv:2304.12244 — 指令进化方法
UltraFeedback — Cui, G. et al., 2023. UltraFeedback: Boosting LMs with High-quality Feedback. — 偏好数据合成
Constitutional AI — Bai, Y. et al., 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073 — AI 反馈驱动的数据质量

License

MIT

_{knowlyr — LLM-powered synthetic dataset generation with quality-diversity optimization}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.2

Feb 18, 2026

0.4.1

Feb 9, 2026

0.4.0

Feb 9, 2026

0.3.1

Feb 9, 2026

0.3.0

Feb 9, 2026

0.2.1

Feb 9, 2026

0.2.0

Feb 9, 2026

0.1.0

Feb 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

knowlyr_datasynth-0.4.2.tar.gz (112.8 kB view details)

Uploaded Feb 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

knowlyr_datasynth-0.4.2-py3-none-any.whl (33.9 kB view details)

Uploaded Feb 18, 2026 Python 3

File details

Details for the file knowlyr_datasynth-0.4.2.tar.gz.

File metadata

Download URL: knowlyr_datasynth-0.4.2.tar.gz
Upload date: Feb 18, 2026
Size: 112.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for knowlyr_datasynth-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`a69aa4f1f4bba6356c68376086d867353f098e9a3a3b5aa72f7902548bbd3374`
MD5	`c04772f48c9e2ddc5aca5b390eb51019`
BLAKE2b-256	`5bbf13bc1ca2d3afa5609e5a48432b80a7adf3efc1552031c86f4e2a9e617ee2`

See more details on using hashes here.

File details

Details for the file knowlyr_datasynth-0.4.2-py3-none-any.whl.

File metadata

Download URL: knowlyr_datasynth-0.4.2-py3-none-any.whl
Upload date: Feb 18, 2026
Size: 33.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for knowlyr_datasynth-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b25647141d845fef722e155ae004cae61e5e3f1dad08804d5d8907498b07bb3`
MD5	`0cbe8791fb666d50d5bc7abb081bfde5`
BLAKE2b-256	`4c2d0a0446a5ba1b50b1412a4c850f87efcfbfcb77bd379de0547a80496720a3`

See more details on using hashes here.

knowlyr-datasynth 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataSynth

LLM-Powered Synthetic Dataset Generationwith Quality-Diversity Optimization

Abstract

Problem Statement

Formal Framework

Generation Model

Quality-Diversity Trade-off

Deduplication

Cost Model

Architecture

Key Innovations

1. Auto-Detected Data Type Templates

2. Concurrent Generation with Incremental Resume

3. Schema Validation and Deduplication

4. Precise Cost Estimation

5. Post-Generation Hooks

6. Distribution Statistics

Quick Start

API Mode

Interactive Mode (无需 API key)

MCP Server

CLI Reference

Ecosystem

Development

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

LLM-Powered Synthetic Dataset Generation
with Quality-Diversity Optimization