Skip to main content

Synthetic Data Generation

Project description

SDG Hub: Synthetic Data Generation Toolkit

Build Release License Tests codecov

A modular, scalable, and efficient solution for creating synthetic data generation flows in a "low-code" manner.

Documentation | Examples | Video Tutorial

SDG Hub is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful flows for generating data and processing tasks. Define complex workflows using nothing but YAML configuration files.

📖 Full documentation available at: https://ai-innovation.team/sdg_hub


✨ Key Features

  • Low-Code Flow Creation: Build sophisticated data generation pipelines using simple YAML configuration files without writing any code.

  • Modular Block System: Compose workflows from reusable, self-contained blocks that handle LLM calls, data transformations, and filtering.

  • LLM-Agnostic: Works with any language model through configurable prompt templates and generation parameters.

  • Prompt Engineering Friendly: Tune LLM behavior by editing declarative YAML prompts.

🚀 Installation

Stable Release (Recommended)

pip install sdg-hub

Development Version

pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git

🏁 Quick Start

Prerequisites

Before getting started, make sure you have:

  • Python 3.8 or higher
  • LLM Inference Endpoint exposed through OpenAI API

Simple Example

Here's the simplest way to get started:

from sdg_hub.flow_runner import run_flow

# Run a basic knowledge generation flow
run_flow(
    ds_path="my_data.jsonl",
    save_path="output.jsonl", 
    endpoint="http://0.0.0.0:8000/v1",
    flow_path="flows/generation/knowledge/synth_knowledge.yaml"
)

Advanced Configuration

You can invoke any built-in flow using run_flow:

from sdg_hub.flow_runner import run_flow

run_flow(
    ds_path="path/to/dataset.jsonl",
    save_path="path/to/output.jsonl",
    endpoint="http://0.0.0.0:8000/v1",
    flow_path="path/to/flow.yaml",
    checkpoint_dir="path/to/checkpoints",
    batch_size=8,
    num_workers=32,
    save_freq=2,
)

📂 Available Built-in Flows

You can start with any of these YAML flows out of the box:

🔎 Knowledge Flows

Flow Description
synth_knowledge.yaml Produces document-grounded questions and answers for factual memorization
synth_knowledge1.5.yaml Improved version that builds intermediate representations for better recall

🧠 Skills Flows

Flow Description
synth_skills.yaml Freeform skills QA generation (eg: "Create a new github issue to add type hints")
synth_grounded_skills.yaml Domain-specific skill generation (eg: "From the given conversation create a table for feature requests")
improve_responses.yaml Uses planning and critique-based refinement to improve generated answers

All these can be found here: flows

📺 Video Tutorial

For a comprehensive walkthrough of sdg_hub:

SDG Hub Tutorial

🤝 Contributing

We welcome contributions from the community! Whether it's bug reports, feature requests, documentation improvements, or code contributions, please check out our contribution guidelines.

📄 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


Built with ❤️ by the Red Hat AI Innovation Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdg_hub-0.1.3.tar.gz (5.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdg_hub-0.1.3-py3-none-any.whl (111.8 kB view details)

Uploaded Python 3

File details

Details for the file sdg_hub-0.1.3.tar.gz.

File metadata

  • Download URL: sdg_hub-0.1.3.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.3.tar.gz
Algorithm Hash digest
SHA256 78b3953e61e3e1cab700e648db47757535c85e76624fb2d27e54523c1be1e8fa
MD5 a62a2ae27c19a59d401e2c25a4d1b2b9
BLAKE2b-256 ad6e133f9904c2b46d638b12cb69b1e954691dbbd4fe4c30c948874bb2ec1fb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.3.tar.gz:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdg_hub-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: sdg_hub-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 111.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 daca5bde134a56503c911ce7c7db20e49d535ebaf70ad6e77f3ce222b8c113d4
MD5 c4bac2de45f163244605f28ab48644f7
BLAKE2b-256 65e24d715d8abcc578811339df9dc8705dd58d1b724545057b76b4432ca50a11

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.3-py3-none-any.whl:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page