Synthetic Data Generation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abhi1092 meyceoz shiver

These details have not been verified by PyPI

Project links

homepage

Project description

`sdg_hub`: Synthetic Data Generation Toolkit

A modular Python framework for building synthetic data generation pipelines using composable blocks and flows. Transform datasets through building-block composition - mix and match LLM-powered and traditional processing blocks to create sophisticated data generation workflows.

📖 Full documentation available at: https://ai-innovation.team/sdg_hub

✨ Key Features

🔧 Modular Composability - Mix and match blocks like Lego pieces. Build simple transformations or complex multi-stage pipelines with YAML-configured flows.

⚡ Async Performance - High-throughput LLM processing with built-in error handling.

🛡️ Built-in Validation - Pydantic-based type safety ensures your configurations and data are correct before execution.

🔍 Auto-Discovery - Automatic block and flow registration. No manual imports or complex setup.

📊 Rich Monitoring - Detailed logging with progress bars and execution summaries.

📋 Dataset Schema Discovery - Instantly discover required data formats. Get empty datasets with correct schema for easy validation and data preparation.

🧩 Easily Extensible - Create custom blocks with simple inheritance. Rich logging and monitoring built-in.

📦 Installation

Recommended: Install uv — see https://docs.astral.sh/uv/getting-started/installation/

# Production
uv pip install sdg-hub

# Development
git clone https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
cd sdg_hub
uv pip install .[dev]
# or: uv sync --extra dev

Optional Dependencies

# For vLLM support
uv pip install sdg-hub[vllm]

# For examples
uv pip install sdg-hub[examples]

🚀 Quick Start

Core Concepts

Blocks are composable units that transform datasets - think of them as data processing Lego pieces. Each block performs a specific task: LLM chat, text parsing, evaluation, or transformation.

Flows orchestrate multiple blocks into complete pipelines defined in YAML. Chain blocks together to create complex data generation workflows with validation and parameter management.

# Simple concept: Blocks transform data, Flows chain blocks together
dataset → Block₁ → Block₂ → Block₃ → enriched_dataset

Try it out!

Flow Discovery

from sdg_hub import FlowRegistry, Flow

# Auto-discover all available flows (no setup needed!)
FlowRegistry.discover_flows()

# List available flows
flows = FlowRegistry.list_flows()
print(f"Available flows: {flows}")

# Search for specific types
qa_flows = FlowRegistry.search_flows(tag="question-generation")
print(f"QA flows: {qa_flows}")

Each flow has a unique, human-readable ID automatically generated from its name. These IDs provide a convenient shorthand for referencing flows:

# Every flow gets a deterministic ID 
# Same flow name always generates the same ID
flow_id = "small-rock-799" 

# Use ID to reference the flow
flow_path = FlowRegistry.get_flow_path(flow_id)
flow = Flow.from_yaml(flow_path)

Discovering Models and Configuring them

# Discover recommended models
default_model = flow.get_default_model()
recommendations = flow.get_model_recommendations()

# Configure model settings at runtime
# This assumes you have a hosted vLLM instance of meta-llama/Llama-3.3-70B-Instruct running at http://localhost:8000/v1
flow.set_model_config(
    model=f"hosted_vllm/{default_model}",
    api_base="http://localhost:8000/v1",
    api_key="your_key",
)

Discover dataset requirements and create your dataset

# First, discover what data the flow needs
# Get an empty dataset with the exact schema needed
schema_dataset = flow.get_dataset_schema()  # Get empty dataset with correct schema
print(f"Required columns: {schema_dataset.column_names}")
print(f"Schema: {schema_dataset.features}")

# Option 1: Add data directly to the schema dataset
dataset = schema_dataset.add_item({
    'document': 'Your document text here...',
    'document_outline': '1. Topic A; 2. Topic B; 3. Topic C',
    'domain': 'Computer Science',
    'icl_document': 'Example document for in-context learning...',
    'icl_query_1': 'Example question 1?',
    'icl_response_1': 'Example answer 1',
    'icl_query_2': 'Example question 2?', 
    'icl_response_2': 'Example answer 2',
    'icl_query_3': 'Example question 3?',
    'icl_response_3': 'Example answer 3'
})

# Option 2: Create your own dataset and validate the schema
my_dataset = Dataset.from_dict(my_data_dict)
if my_dataset.features == schema_dataset.features:
    print("✅ Schema matches - ready to generate!")
    dataset = my_dataset
else:
    print("❌ Schema mismatch - check your columns")

# Option 3: Get raw requirements for detailed inspection
requirements = flow.get_dataset_requirements()
if requirements:
    print(f"Required: {requirements.required_columns}")
    print(f"Optional: {requirements.optional_columns}")
    print(f"Min samples: {requirements.min_samples}")

Dry Run and Generate

# Quick Testing with Dry Run
dry_result = flow.dry_run(dataset, sample_size=1)
print(f"Dry run completed in {dry_result['execution_time_seconds']:.2f}s")
print(f"Output columns: {dry_result['final_dataset']['columns']}")

# Generate high-quality QA pairs
result = flow.generate(dataset)

# Access generated content
questions = result['question']
answers = result['response']
faithfulness_scores = result['faithfulness_judgment']
relevancy_scores = result['relevancy_score']

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to contribute to this project.

Built with ❤️ by the Red Hat AI Innovation Team

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abhi1092 meyceoz shiver

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

0.9.3

May 9, 2026

0.9.2

Apr 27, 2026

0.9.1

Apr 11, 2026

0.9.0

Mar 25, 2026

0.8.8

Mar 13, 2026

0.8.7

Mar 5, 2026

0.8.6

Feb 23, 2026

0.8.5

Feb 20, 2026

0.8.4

Feb 19, 2026

0.8.3

Feb 17, 2026

0.8.2

Feb 13, 2026

0.8.1

Feb 12, 2026

0.8.0

Feb 4, 2026

0.7.3

Jan 16, 2026

0.7.2

Dec 18, 2025

0.7.1

Dec 2, 2025

0.7.0

Dec 1, 2025

0.6.1

Nov 21, 2025

0.6.0

Oct 18, 2025

0.5.1

Oct 17, 2025

0.5.0

Oct 10, 2025

0.4.2

Oct 7, 2025

0.4.1

Oct 3, 2025

0.4.0

Sep 30, 2025

0.3.1

Sep 23, 2025

0.3.0

Sep 18, 2025

This version

0.2.2

Aug 29, 2025

0.2.1

Aug 15, 2025

0.2.0

Aug 8, 2025

0.1.4

Jul 11, 2025

0.1.3

Jul 6, 2025

0.1.2

Jun 27, 2025

0.1.1

Jun 21, 2025

0.1.0

Jun 14, 2025

0.1.0a4 pre-release

May 6, 2025

0.1.0a3 pre-release

Apr 18, 2025

0.1.0a2 pre-release

Apr 15, 2025

0.1.0a1 pre-release

Apr 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdg_hub-0.2.2.tar.gz (4.7 MB view details)

Uploaded Aug 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sdg_hub-0.2.2-py3-none-any.whl (128.2 kB view details)

Uploaded Aug 29, 2025 Python 3

File details

Details for the file sdg_hub-0.2.2.tar.gz.

File metadata

Download URL: sdg_hub-0.2.2.tar.gz
Upload date: Aug 29, 2025
Size: 4.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`de332e3dfebb668a9ecf6d47509f361294a28bc8c48c95df620e3f167cbdf6ca`
MD5	`e908537f04c3fb665bd36c0e23fba0cb`
BLAKE2b-256	`285fd40f222e154de49364d4f969dd93e259060e36a25b058742898e4e8c6f66`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.2.2.tar.gz:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sdg_hub-0.2.2.tar.gz
- Subject digest: de332e3dfebb668a9ecf6d47509f361294a28bc8c48c95df620e3f167cbdf6ca
- Sigstore transparency entry: 450227895
- Sigstore integration time: Aug 29, 2025
Source repository:
- Permalink: Red-Hat-AI-Innovation-Team/sdg_hub@aaaa1ea331be529ff7634de0b622d3c351b937c9
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/Red-Hat-AI-Innovation-Team
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@aaaa1ea331be529ff7634de0b622d3c351b937c9
- Trigger Event: release

File details

Details for the file sdg_hub-0.2.2-py3-none-any.whl.

File metadata

Download URL: sdg_hub-0.2.2-py3-none-any.whl
Upload date: Aug 29, 2025
Size: 128.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e0f1053097edb703c739b69dbaf863ce69493337d3b03e9613755c0695b8c0a`
MD5	`0842f60591b5cfd71698eb173a60b8cc`
BLAKE2b-256	`6233961325e095952e36f75ed63080c7fe6783e78f582d2aeac1d71a44490849`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.2.2-py3-none-any.whl:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sdg_hub-0.2.2-py3-none-any.whl
- Subject digest: 2e0f1053097edb703c739b69dbaf863ce69493337d3b03e9613755c0695b8c0a
- Sigstore transparency entry: 450227909
- Sigstore integration time: Aug 29, 2025
Source repository:
- Permalink: Red-Hat-AI-Innovation-Team/sdg_hub@aaaa1ea331be529ff7634de0b622d3c351b937c9
- Branch / Tag: refs/tags/v0.2.2
- Owner: https://github.com/Red-Hat-AI-Innovation-Team
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@aaaa1ea331be529ff7634de0b622d3c351b937c9
- Trigger Event: release

sdg-hub 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sdg_hub: Synthetic Data Generation Toolkit

✨ Key Features

📦 Installation

Optional Dependencies

🚀 Quick Start

Core Concepts

Try it out!

Flow Discovery

Discovering Models and Configuring them

Discover dataset requirements and create your dataset

Dry Run and Generate

📄 License

🤝 Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`sdg_hub`: Synthetic Data Generation Toolkit