Synthetic Data Generation
Project description
SDG Hub: Synthetic Data Generation Toolkit
A modular, scalable, and efficient solution for creating synthetic data generation flows in a "low-code" manner.
Documentation | Examples | Video Tutorial
SDG Hub is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful flows for generating data and processing tasks. Define complex workflows using nothing but YAML configuration files.
📖 Full documentation available at: https://ai-innovation.team/sdg_hub
✨ Key Features
-
Low-Code Flow Creation: Build sophisticated data generation pipelines using simple YAML configuration files without writing any code.
-
Modular Block System: Compose workflows from reusable, self-contained blocks that handle LLM calls, data transformations, and filtering.
-
LLM-Agnostic: Works with any language model through configurable prompt templates and generation parameters.
-
Prompt Engineering Friendly: Tune LLM behavior by editing declarative YAML prompts.
🚀 Installation
Stable Release (Recommended)
pip install sdg-hub
Development Version
pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
🏁 Quick Start
Prerequisites
Before getting started, make sure you have:
- Python 3.8 or higher
- LLM Inference Endpoint exposed through OpenAI API
Simple Example
Here's the simplest way to get started:
from sdg_hub.flow_runner import run_flow
# Run a basic knowledge generation flow
run_flow(
ds_path="my_data.jsonl",
save_path="output.jsonl",
endpoint="http://0.0.0.0:8000/v1",
flow_path="flows/generation/knowledge/synth_knowledge.yaml"
)
Advanced Configuration
You can invoke any built-in flow using run_flow:
from sdg_hub.flow_runner import run_flow
run_flow(
ds_path="path/to/dataset.jsonl",
save_path="path/to/output.jsonl",
endpoint="http://0.0.0.0:8000/v1",
flow_path="path/to/flow.yaml",
checkpoint_dir="path/to/checkpoints",
batch_size=8,
num_workers=32,
save_freq=2,
)
📂 Available Built-in Flows
You can start with any of these YAML flows out of the box:
🔎 Knowledge Flows
| Flow | Description |
|---|---|
synth_knowledge.yaml |
Produces document-grounded questions and answers for factual memorization |
synth_knowledge1.5.yaml |
Improved version that builds intermediate representations for better recall |
🧠 Skills Flows
| Flow | Description |
|---|---|
synth_skills.yaml |
Freeform skills QA generation (eg: "Create a new github issue to add type hints") |
synth_grounded_skills.yaml |
Domain-specific skill generation (eg: "From the given conversation create a table for feature requests") |
improve_responses.yaml |
Uses planning and critique-based refinement to improve generated answers |
All these can be found here: flows
📺 Video Tutorial
For a comprehensive walkthrough of sdg_hub:
🤝 Contributing
We welcome contributions from the community! Whether it's bug reports, feature requests, documentation improvements, or code contributions, please check out our contribution guidelines.
📄 License
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Built with ❤️ by the Red Hat AI Innovation Team
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sdg_hub-0.1.2.tar.gz.
File metadata
- Download URL: sdg_hub-0.1.2.tar.gz
- Upload date:
- Size: 5.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cf6dfdc8eaaf4172134ffa06866127c6ec289925fabc5c2c2b090f7de629278
|
|
| MD5 |
29822e23bacba9bd943cba3fc6fa9347
|
|
| BLAKE2b-256 |
c709e4026009692d80bc19406f653cbd27fd9052172e332279bd12afc195f567
|
Provenance
The following attestation bundles were made for sdg_hub-0.1.2.tar.gz:
Publisher:
pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdg_hub-0.1.2.tar.gz -
Subject digest:
0cf6dfdc8eaaf4172134ffa06866127c6ec289925fabc5c2c2b090f7de629278 - Sigstore transparency entry: 252877741
- Sigstore integration time:
-
Permalink:
Red-Hat-AI-Innovation-Team/sdg_hub@0bef495608c67b07ceb352a1f0ee9b14bece1130 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Red-Hat-AI-Innovation-Team
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@0bef495608c67b07ceb352a1f0ee9b14bece1130 -
Trigger Event:
release
-
Statement type:
File details
Details for the file sdg_hub-0.1.2-py3-none-any.whl.
File metadata
- Download URL: sdg_hub-0.1.2-py3-none-any.whl
- Upload date:
- Size: 111.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1dae62ebadc241d2d28f772c0ebb89b2a8e510de41f828b2da2b0e5a5f27054
|
|
| MD5 |
096e9296aed791e09c6104938ba3cc96
|
|
| BLAKE2b-256 |
80831bdfa46142bc4d25d55d1766ba3056d9fab27b8b98b5bc92494ffbffd8c5
|
Provenance
The following attestation bundles were made for sdg_hub-0.1.2-py3-none-any.whl:
Publisher:
pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdg_hub-0.1.2-py3-none-any.whl -
Subject digest:
f1dae62ebadc241d2d28f772c0ebb89b2a8e510de41f828b2da2b0e5a5f27054 - Sigstore transparency entry: 252877751
- Sigstore integration time:
-
Permalink:
Red-Hat-AI-Innovation-Team/sdg_hub@0bef495608c67b07ceb352a1f0ee9b14bece1130 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/Red-Hat-AI-Innovation-Team
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@0bef495608c67b07ceb352a1f0ee9b14bece1130 -
Trigger Event:
release
-
Statement type: