Skip to main content

Synthetic Data Generation

Project description

SDG Hub: Synthetic Data Generation Toolkit

Build Release License Tests codecov

A modular, scalable, and efficient solution for creating synthetic data generation flows in a "low-code" manner.

Documentation | Examples | Video Tutorial

SDG Hub is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful flows for generating data and processing tasks. Define complex workflows using nothing but YAML configuration files.

📖 Full documentation available at: https://ai-innovation.team/sdg_hub


✨ Key Features

  • Low-Code Flow Creation: Build sophisticated data generation pipelines using simple YAML configuration files without writing any code.

  • Modular Block System: Compose workflows from reusable, self-contained blocks that handle LLM calls, data transformations, and filtering.

  • LLM-Agnostic: Works with any language model through configurable prompt templates and generation parameters.

  • Prompt Engineering Friendly: Tune LLM behavior by editing declarative YAML prompts.

🚀 Installation

Stable Release (Recommended)

pip install sdg-hub

Development Version

pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git

🏁 Quick Start

Prerequisites

Before getting started, make sure you have:

  • Python 3.8 or higher
  • LLM Inference Endpoint exposed through OpenAI API

Simple Example

Here's the simplest way to get started:

from sdg_hub.flow_runner import run_flow

# Run a basic knowledge generation flow
run_flow(
    ds_path="my_data.jsonl",
    save_path="output.jsonl", 
    endpoint="http://0.0.0.0:8000/v1",
    flow_path="flows/generation/knowledge/synth_knowledge.yaml"
)

Advanced Configuration

You can invoke any built-in flow using run_flow:

from sdg_hub.flow_runner import run_flow

run_flow(
    ds_path="path/to/dataset.jsonl",
    save_path="path/to/output.jsonl",
    endpoint="http://0.0.0.0:8000/v1",
    flow_path="path/to/flow.yaml",
    checkpoint_dir="path/to/checkpoints",
    batch_size=8,
    num_workers=32,
    save_freq=2,
)

📂 Available Built-in Flows

You can start with any of these YAML flows out of the box:

🔎 Knowledge Flows

Flow Description
synth_knowledge.yaml Produces document-grounded questions and answers for factual memorization
synth_knowledge1.5.yaml Improved version that builds intermediate representations for better recall

🧠 Skills Flows

Flow Description
synth_skills.yaml Freeform skills QA generation (eg: "Create a new github issue to add type hints")
synth_grounded_skills.yaml Domain-specific skill generation (eg: "From the given conversation create a table for feature requests")
improve_responses.yaml Uses planning and critique-based refinement to improve generated answers

All these can be found here: flows

📺 Video Tutorial

For a comprehensive walkthrough of sdg_hub:

SDG Hub Tutorial

🤝 Contributing

We welcome contributions from the community! Whether it's bug reports, feature requests, documentation improvements, or code contributions, please check out our contribution guidelines.

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


Built with ❤️ by the Red Hat AI Innovation Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdg_hub-0.1.4.tar.gz (5.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdg_hub-0.1.4-py3-none-any.whl (112.3 kB view details)

Uploaded Python 3

File details

Details for the file sdg_hub-0.1.4.tar.gz.

File metadata

  • Download URL: sdg_hub-0.1.4.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.4.tar.gz
Algorithm Hash digest
SHA256 31532a62e5874227cfb072d1082429edc6cd4d6ecbce7c7390180706a7cad3bd
MD5 c87b46d3e30ca8297df623c0d98e668b
BLAKE2b-256 5f107b79eb906d78602f5663b6e80cfeb98ef236368028552c45b836a8a75526

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.4.tar.gz:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdg_hub-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: sdg_hub-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 112.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1677efc3a39e565f6496863d870cb641ebf7277470668106e7df5c15425ff407
MD5 4cd6c2b23dbc62b9bc8fa71f7734db48
BLAKE2b-256 c94a4ad720af44f91c48f5a1e9f7e01acfc479f503f08b114197b14188c3d646

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.4-py3-none-any.whl:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page