Skip to main content

Synthetic Data Generation

Project description

SDG Hub: Synthetic Data Generation Toolkit

Build Release License Tests codecov

A modular, scalable, and efficient solution for creating synthetic data generation flows in a "low-code" manner.

Documentation | Examples | Video Tutorial

SDG Hub is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful flows for generating data and processing tasks. Define complex workflows using nothing but YAML configuration files.

📖 Full documentation available at: https://ai-innovation.team/sdg_hub


✨ Key Features

  • Low-Code Flow Creation: Build sophisticated data generation pipelines using simple YAML configuration files without writing any code.

  • Modular Block System: Compose workflows from reusable, self-contained blocks that handle LLM calls, data transformations, and filtering.

  • LLM-Agnostic: Works with any language model through configurable prompt templates and generation parameters.

  • Prompt Engineering Friendly: Tune LLM behavior by editing declarative YAML prompts.

🚀 Installation

Stable Release (Recommended)

pip install sdg-hub

Development Version

pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git

🏁 Quick Start

Prerequisites

Before getting started, make sure you have:

  • Python 3.8 or higher
  • LLM Inference Endpoint exposed through OpenAI API

Simple Example

Here's the simplest way to get started:

from sdg_hub.flow_runner import run_flow

# Run a basic knowledge generation flow
run_flow(
    ds_path="my_data.jsonl",
    save_path="output.jsonl", 
    endpoint="http://0.0.0.0:8000/v1",
    flow_path="flows/generation/knowledge/synth_knowledge.yaml"
)

Advanced Configuration

You can invoke any built-in flow using run_flow:

from sdg_hub.flow_runner import run_flow

run_flow(
    ds_path="path/to/dataset.jsonl",
    save_path="path/to/output.jsonl",
    endpoint="http://0.0.0.0:8000/v1",
    flow_path="path/to/flow.yaml",
    checkpoint_dir="path/to/checkpoints",
    batch_size=8,
    num_workers=32,
    save_freq=2,
)

📂 Available Built-in Flows

You can start with any of these YAML flows out of the box:

🔎 Knowledge Flows

Flow Description
synth_knowledge.yaml Produces document-grounded questions and answers for factual memorization
synth_knowledge1.5.yaml Improved version that builds intermediate representations for better recall

🧠 Skills Flows

Flow Description
synth_skills.yaml Freeform skills QA generation (eg: "Create a new github issue to add type hints")
synth_grounded_skills.yaml Domain-specific skill generation (eg: "From the given conversation create a table for feature requests")
improve_responses.yaml Uses planning and critique-based refinement to improve generated answers

All these can be found here: flows

📺 Video Tutorial

For a comprehensive walkthrough of sdg_hub:

SDG Hub Tutorial

🤝 Contributing

We welcome contributions from the community! Whether it's bug reports, feature requests, documentation improvements, or code contributions, please check out our contribution guidelines.

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


Built with ❤️ by the Red Hat AI Innovation Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdg_hub-0.1.0.tar.gz (5.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdg_hub-0.1.0-py3-none-any.whl (96.8 kB view details)

Uploaded Python 3

File details

Details for the file sdg_hub-0.1.0.tar.gz.

File metadata

  • Download URL: sdg_hub-0.1.0.tar.gz
  • Upload date:
  • Size: 5.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0cb678ecb083843de38dba9b4f6a254fb633c5159cbc6f6e932d610e4adc7159
MD5 81d0eaf1a5db0608fc89c9c8b0f5ac3d
BLAKE2b-256 a206682a611927b226e5a0acbb9ab8fd7b9d433181ee8c34a2fc3f0878838656

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.0.tar.gz:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdg_hub-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sdg_hub-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 96.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 12ce50be18cd8079c2829bf412da231b00bcb02fa4c728a7e87a1dc1319601f5
MD5 9d064aef948b83aad249300203799577
BLAKE2b-256 4db41ae0343b262fe9a5b9784d25e0ce1bfbb84642b8057413a7c5697d4b96d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.0-py3-none-any.whl:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page