Synthetic Data Generation
Project description
sdg_hub: Synthetic Data Generation Toolkit
A modular Python framework for building synthetic data generation pipelines using composable blocks and flows. Transform datasets through building-block composition - mix and match LLM-powered and traditional processing blocks to create sophisticated data generation workflows.
📖 Full documentation available at: https://ai-innovation.team/sdg_hub
✨ Key Features
🔧 Modular Composability - Mix and match blocks like Lego pieces. Build simple transformations or complex multi-stage pipelines with YAML-configured flows.
⚡ Async Performance - High-throughput LLM processing with built-in error handling.
🛡️ Built-in Validation - Pydantic-based type safety ensures your configurations and data are correct before execution.
🔍 Auto-Discovery - Automatic block and flow registration. No manual imports or complex setup.
📊 Rich Monitoring - Detailed logging with progress bars and execution summaries.
🧩 Easily Extensible - Create custom blocks with simple inheritance. Rich logging and monitoring built-in.
📦 Installation
Recommended: Install uv — see https://docs.astral.sh/uv/getting-started/installation/
# Production
uv pip install sdg-hub
# Development
git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
cd sdg_hub
uv pip install .[dev]
# or: uv sync --extra dev
Optional Dependencies
# For vLLM support
uv pip install sdg-hub[vllm]
# For examples
uv pip install sdg-hub[examples]
🚀 Quick Start
🧱 Core Concepts
Blocks are composable units that transform datasets - think of them as data processing Lego pieces. Each block performs a specific task: LLM chat, text parsing, evaluation, or transformation.
Flows orchestrate multiple blocks into complete pipelines defined in YAML. Chain blocks together to create complex data generation workflows with validation and parameter management.
# Simple concept: Blocks transform data, Flows chain blocks together
dataset → Block₁ → Block₂ → Block₃ → enriched_dataset
Try it out!
Flow Discovery
from sdg_hub import FlowRegistry
# Auto-discover all available flows (no setup needed!)
FlowRegistry.discover_flows()
# List available flows
flows = FlowRegistry.list_flows()
print(f"Available flows: {flows}")
# Search for specific types
qa_flows = FlowRegistry.search_flows(tag="question-generation")
print(f"QA flows: {qa_flows}")
Using Flows
from sdg_hub import FlowRegistry, Flow
from datasets import Dataset
# Load the flow by name
flow_name = "Advanced Document Grounded Question-Answer Generation Flow for Knowledge Tuning"
flow_path = FlowRegistry.get_flow_path(flow_name)
flow = Flow.from_yaml(flow_path)
# Discover recommended models
default_model = flow.get_default_model()
recommendations = flow.get_model_recommendations()
# Configure model settings at runtime
# This assumes you have a hosted vLLM instance of meta-llama/Llama-3.3-70B-Instruct running at http://localhost:8000/v1
flow.set_model_config(
model=f"hosted_vllm/{default_model}",
api_base="http://localhost:8000/v1",
api_key="your_key",
)
# Create your dataset with required columns
dataset = Dataset.from_dict({
'document': ['Your document text here...'],
'document_outline': ['1. Topic A; 2. Topic B; 3. Topic C'],
'domain': ['Computer Science'],
'icl_document': ['Example document for in-context learning...'],
'icl_query_1': ['Example question 1?'],
'icl_response_1': ['Example answer 1'],
'icl_query_2': ['Example question 2?'],
'icl_response_2': ['Example answer 2'],
'icl_query_3': ['Example question 3?'],
'icl_response_3': ['Example answer 3']
})
# Generate high-quality QA pairs
result = flow.generate(dataset)
# Access generated content
questions = result['question']
answers = result['response']
faithfulness_scores = result['faithfulness_judgment']
relevancy_scores = result['relevancy_score']
Quick Testing with Dry Run
# Test the flow with a small sample first
dry_result = flow.dry_run(dataset, sample_size=1)
print(f"Dry run completed in {dry_result['execution_time_seconds']:.2f}s")
print(f"Output columns: {dry_result['final_dataset']['columns']}")
📄 License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
🤝 Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.
Built with ❤️ by the Red Hat AI Innovation Team
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sdg_hub-0.2.0.tar.gz.
File metadata
- Download URL: sdg_hub-0.2.0.tar.gz
- Upload date:
- Size: 4.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af69891e82bec5140632dc7eec4f48d8355943cfeafe9858a2ac76870ee3b776
|
|
| MD5 |
62e35c5ad9ba9597d9bf10433eb83116
|
|
| BLAKE2b-256 |
8d2e3a54831e8da4b332de847e87751c052ab947fa2e8e66f630da5c3ef37d93
|
Provenance
The following attestation bundles were made for sdg_hub-0.2.0.tar.gz:
Publisher:
pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdg_hub-0.2.0.tar.gz -
Subject digest:
af69891e82bec5140632dc7eec4f48d8355943cfeafe9858a2ac76870ee3b776 - Sigstore transparency entry: 368831286
- Sigstore integration time:
-
Permalink:
Red-Hat-AI-Innovation-Team/sdg_hub@8c01e31012083685c626caf54bebbac4e7bffcaf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Red-Hat-AI-Innovation-Team
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@8c01e31012083685c626caf54bebbac4e7bffcaf -
Trigger Event:
release
-
Statement type:
File details
Details for the file sdg_hub-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sdg_hub-0.2.0-py3-none-any.whl
- Upload date:
- Size: 113.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66eb4a41e6f7505ae4a1879b946422167519eb283b4a4d27490119e37aa58c3e
|
|
| MD5 |
01a4d4ff5c9d49fd0d6e7a3cf5c65fd6
|
|
| BLAKE2b-256 |
513457edb271fadb250ff8e969335d2cfd0b113194d8416e4fc8bb0b14826ad1
|
Provenance
The following attestation bundles were made for sdg_hub-0.2.0-py3-none-any.whl:
Publisher:
pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdg_hub-0.2.0-py3-none-any.whl -
Subject digest:
66eb4a41e6f7505ae4a1879b946422167519eb283b4a4d27490119e37aa58c3e - Sigstore transparency entry: 368831314
- Sigstore integration time:
-
Permalink:
Red-Hat-AI-Innovation-Team/sdg_hub@8c01e31012083685c626caf54bebbac4e7bffcaf -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Red-Hat-AI-Innovation-Team
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yaml@8c01e31012083685c626caf54bebbac4e7bffcaf -
Trigger Event:
release
-
Statement type: