Synthetic Data Generation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abhi1092 meyceoz shiver

These details have not been verified by PyPI

Project links

homepage

Project description

sdg_hub: Synthetic Data Generation Toolkit for LLMs

Build Release License

sdg_hub is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a "no-code" manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.

Installation

Latest release from PyPI

pip install sdg-hub

Latest main branch

pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git

Core Design Principles

The framework is built around the following principles:

Modular Design: Highly composable blocks form the building units of the framework, allowing users to build workflows effortlessly.
No-Code Workflow Creation: Specify workflows using simple YAML configuration files.
Scalability and Performance: Optimized for handling large-scale workflows with millions of records.

Framework Architecture

overview

Blocks: The Fundamental Unit

At the heart of the framework is the Block. Each block is a self-contained computational unit that performs specific tasks, such as:

Making LLM calls
Performing data transformations
Applying filters

Blocks are designed to be:

Modular: Reusable across multiple pipelines.
Composable: Easily chained together to create workflows.

These blocks are implemented in the src/sdg_hub/blocks directory.

Prompts

Prompts are at the core of how LLMs are instructed within SDG Hub. Each LLMBlock is associated with a prompt configuration file written in YAML, allowing users to define the exact behavior of the language model — including system instructions, generation principles, and output formatting.

Prompt YAML Structure

A typical prompt YAML file looks like this:

system: You are a helpful assistant that can summarize text.
introduction: Give me a short summary of the text.
principles:
  - Do not add any new information.
  - Do not miss any key points from the provided text.
examples:
  - input: Red Hat announced the acquisition of Neural Magic...
    output: Red Hat acquired Neural Magic to enhance its AI optimization capabilities.
generation: Here is the document to summarize: {{document}}

Key Fields

system: A high-level instruction that sets the persona or behavior of the model.
introduction: Optional introduction to set context for the user.
principles: A list of guiding constraints or rules the model should follow during generation.
examples: Few-shot examples (optional) to guide output format or tone.
generation: The actual template used to generate the model input. This supports variable injection using {{variable_name}}.

YAML-Based Workflow: The Flow

The YAML configuration file, known as the Flow, is central to defining data generation workflows in the SDG Framework. A Flow describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.

Key Features of a Flow

Modular Design:
- Flows are composed of blocks, which can be chained together into pipelines.
- Each block performs a specific task, such as generating, filtering, or transforming data.
Reusability:
- Blocks and configurations defined in a Flow can be reused across different workflows.
- YAML makes it easy to tweak or extend workflows without significant changes.
Ease of Configuration:
- Users can specify block types, configurations, and data processing details in a simple and intuitive manner.

Hello World Example

Let’s say you have a document and want to generate a concise summary using an LLM. Here’s how simple that is in sdg_hub:

- block_type: LLMBlock
  block_config:
    block_name: gen_summary
    config_path: prompts/summarization.yaml
    model_id: meta-llama/Llama-3.3-70B-Instruct
    output_cols:
      - summary
  gen_kwargs:
    max_tokens: 512

Want to go further? Add another block to extract keywords from the summary:

- block_type: LLMBlock
  block_config:
    block_name: gen_keywords
    config_path: prompts/keywords.yaml
    model_id: meta-llama/Llama-3.3-70B-Instruct
    output_cols:
      - keywords
  gen_kwargs:
    max_tokens: 64

Just like that, you’ve built a multi-step LLM workflow using nothing but YAML.

Available Blocks

The SDG Framework provides a rich set of blocks for different data processing needs. Here's a comprehensive overview of the available blocks and when to use them:

Base Block Class

The framework is built around the abstract Block class, which serves as the foundation for all other blocks:

Purpose: Provides core functionality and interface for all blocks
Key Features:
- Template validation for input data
- Configuration loading from YAML files
- Standardized block initialization
- Common interface for all blocks
Core Methods:
- _validate: Validates input data against templates
- _load_config: Loads configuration from YAML files
- generate: Abstract method for block execution

All blocks inherit from this base class, ensuring consistent behavior and interface across the framework.

LLM Blocks

LLMBlock
- Purpose: Generate text using language models
- Use Cases:
  - Generating questions, responses, or any text content
  - Single-prompt generation with structured outputs
- Features:
  - Supports batched processing
  - Configurable output parsing
  - Template-based prompt generation
ConditionalLLMBlock
- Purpose: Generate text based on conditional logic
- Use Cases:
  - Different prompt templates based on input conditions
  - Multi-path text generation workflows
- Features:
  - Multiple config paths for different conditions
  - Dynamic prompt selection
LLMLogProbBlock
- Purpose: Generate text with log probabilities
- Use Cases:
  - Analyzing model confidence
  - Quality scoring of generations
- Features:
  - Returns top-k log probabilities
  - JSON-formatted output
LLMMessagesBlock
- Purpose: Chat-based text generation
- Use Cases:
  - Multi-turn conversations
  - Chat-based interactions
- Features:
  - Supports message history
  - Chat completion API

Filtering and Processing Blocks

FilterByValueBlock
- Purpose: Filter datasets based on column values
- Use Cases:
  - Removing unwanted samples
  - Data cleaning
  - Quality filtering
- Features:
  - Multiple filter operations
  - Type conversion support
  - Parallel processing
IterBlock
- Purpose: Iterative processing of data
- Use Cases:
  - Multiple generation attempts
  - Iterative refinement
- Features:
  - Configurable number of iterations
  - Nested block execution

Utility Blocks

SamplePopulatorBlock
- Purpose: Populate samples with configuration data
- Use Cases:
  - Adding metadata
  - Configuration injection
SelectorBlock
- Purpose: Select data based on mapping
- Use Cases:
  - Conditional data selection
  - Data routing
CombineColumnsBlock
- Purpose: Merge multiple columns
- Use Cases:
  - Text concatenation
  - Feature combination
FlattenColumnsBlock
- Purpose: Convert wide to long format
- Use Cases:
  - Data reshaping
  - Variable-value pairs
DuplicateColumns
- Purpose: Create column copies
- Use Cases:
  - Data preservation
  - Multiple processing paths
RenameColumns
- Purpose: Rename dataset columns
- Use Cases:
  - Standardizing column names
  - Data reorganization
SetToMajorityValue
- Purpose: Replace values with majority
- Use Cases:
  - Data normalization
  - Outlier handling

Dataflow and Storage

Data Representation: Dataflow between blocks and pipelines is handled using Hugging Face Datasets, which are based on Arrow tables. This provides:
- Native parallelization capabilities (e.g., maps, filters).
- Support for efficient data transformations.
Data Checkpoints: Intermediate caches of generated data. Checkpoints allow users to:
- Resume workflows from the last successful state if interrupted.
- Improve reliability for long-running workflows.

Examples

For sample use cases and implementation examples, please refer to the examples directory. This directory contains various examples demonstrating different workflows and use cases of the SDG Framework.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abhi1092 meyceoz shiver

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

0.9.3

May 9, 2026

0.9.2

Apr 27, 2026

0.9.1

Apr 11, 2026

0.9.0

Mar 25, 2026

0.8.8

Mar 13, 2026

0.8.7

Mar 5, 2026

0.8.6

Feb 23, 2026

0.8.5

Feb 20, 2026

0.8.4

Feb 19, 2026

0.8.3

Feb 17, 2026

0.8.2

Feb 13, 2026

0.8.1

Feb 12, 2026

0.8.0

Feb 4, 2026

0.7.3

Jan 16, 2026

0.7.2

Dec 18, 2025

0.7.1

Dec 2, 2025

0.7.0

Dec 1, 2025

0.6.1

Nov 21, 2025

0.6.0

Oct 18, 2025

0.5.1

Oct 17, 2025

0.5.0

Oct 10, 2025

0.4.2

Oct 7, 2025

0.4.1

Oct 3, 2025

0.4.0

Sep 30, 2025

0.3.1

Sep 23, 2025

0.3.0

Sep 18, 2025

0.2.2

Aug 29, 2025

0.2.1

Aug 15, 2025

0.2.0

Aug 8, 2025

0.1.4

Jul 11, 2025

0.1.3

Jul 6, 2025

0.1.2

Jun 27, 2025

0.1.1

Jun 21, 2025

0.1.0

Jun 14, 2025

This version

0.1.0a4 pre-release

May 6, 2025

0.1.0a3 pre-release

Apr 18, 2025

0.1.0a2 pre-release

Apr 15, 2025

0.1.0a1 pre-release

Apr 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdg_hub-0.1.0a4.tar.gz (4.9 MB view details)

Uploaded May 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sdg_hub-0.1.0a4-py3-none-any.whl (100.8 kB view details)

Uploaded May 6, 2025 Python 3

File details

Details for the file sdg_hub-0.1.0a4.tar.gz.

File metadata

Download URL: sdg_hub-0.1.0a4.tar.gz
Upload date: May 6, 2025
Size: 4.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.0a4.tar.gz
Algorithm	Hash digest
SHA256	`67cd80ed679312bb52354bd2577a6af3caf130aac7e51fdd60a22a229daed2d2`
MD5	`886e24b80ab29c3c9391be6741283e1a`
BLAKE2b-256	`616eb9166f4c8bd46714813c4ee548d3f80d2384e8dce9e5c9cd3275708f7961`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.0a4.tar.gz:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sdg_hub-0.1.0a4.tar.gz
- Subject digest: 67cd80ed679312bb52354bd2577a6af3caf130aac7e51fdd60a22a229daed2d2
- Sigstore transparency entry: 207544893
- Sigstore integration time: May 6, 2025
Source repository:
- Permalink: Red-Hat-AI-Innovation-Team/sdg_hub@4bec81926a99337b07ea11698c02fc42bf8e64a8
- Branch / Tag: refs/tags/v0.1.0a4
- Owner: https://github.com/Red-Hat-AI-Innovation-Team
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@4bec81926a99337b07ea11698c02fc42bf8e64a8
- Trigger Event: release

File details

Details for the file sdg_hub-0.1.0a4-py3-none-any.whl.

File metadata

Download URL: sdg_hub-0.1.0a4-py3-none-any.whl
Upload date: May 6, 2025
Size: 100.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.0a4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`15211ffb0402dc455c2f703e355ce96942e71f025c9f33bacbababe966b5bf93`
MD5	`2ea8c2eb90c55a23f8f32c94731feefd`
BLAKE2b-256	`e4f432107aea3ae9a35a30ea4efbfca3b6eb46b1d09bdb1c6f842fcff81afd68`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.0a4-py3-none-any.whl:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sdg_hub-0.1.0a4-py3-none-any.whl
- Subject digest: 15211ffb0402dc455c2f703e355ce96942e71f025c9f33bacbababe966b5bf93
- Sigstore transparency entry: 207544899
- Sigstore integration time: May 6, 2025
Source repository:
- Permalink: Red-Hat-AI-Innovation-Team/sdg_hub@4bec81926a99337b07ea11698c02fc42bf8e64a8
- Branch / Tag: refs/tags/v0.1.0a4
- Owner: https://github.com/Red-Hat-AI-Innovation-Team
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@4bec81926a99337b07ea11698c02fc42bf8e64a8
- Trigger Event: release

sdg-hub 0.1.0a4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sdg_hub: Synthetic Data Generation Toolkit for LLMs

Installation

Core Design Principles

Framework Architecture

Blocks: The Fundamental Unit

Prompts

Prompt YAML Structure

Key Fields

YAML-Based Workflow: The Flow

Key Features of a Flow

Hello World Example

Available Blocks

Base Block Class

LLM Blocks

Filtering and Processing Blocks

Utility Blocks

Dataflow and Storage

Examples

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance