Synthetic Data Generation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cpacheco ktdreyer nathan-weinberg russellb

These details have not been verified by PyPI

Project links

homepage

Project description

Synthetic Data Generation (SDG)

Lint Build Release License

e2e-nvidia-t4-x1.yaml on main e2e-nvidia-l4-x1.yaml on main e2e-nvidia-l40s-x4.yml on main

The SDG Framework is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a “no-code” manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.

Core Design Principles

The framework is built around the following principles:

Modular Design: Highly composable blocks form the building units of the framework, allowing users to build workflows effortlessly.
No-Code Workflow Creation: Specify workflows using simple YAML configuration files.
Scalability and Performance: Optimized for handling large-scale workflows with millions of records.

Framework Architecture

overview

Blocks: The Fundamental Unit

At the heart of the framework is the Block. Each block is a self-contained computational unit that performs specific tasks, such as:

Making LLM calls
Performing data transformations
Applying filters

Blocks are designed to be:

Modular: Reusable across multiple pipelines.
Composable: Easily chained together to create workflows.

These blocks are implemented in the src/instructlab/sdg/blocks directory.

Pipelines: Higher-Level Abstraction

Blocks can be chained together to form a Pipeline. Pipelines enable:

Linear or recursive chaining of blocks.
Execution of complex workflows by chaining multiple pipelines together.

There are four default pipelines shipped in SDG: simple, full, eval and llama. Each pipeline requires specific hardware specifications.

Simple Pipeline

The simple pipeline is designed to be used with quantized Merlinite as the teacher model. It enables basic data generation results on low-end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs.

Full Pipeline

The full pipeline is designed to be used with Mixtral-8x7B-Instruct-v0.1 as the the teacher model, but has also been successfully tested with smaller models such as Mistral-7B-Instruct-v0.2 and even some quantized versions of the two teacher models. This is the preferred data generation pipeline on higher end consumer grade hardware and all enterprise hardware.

Eval Pipeline

The eval pipeline is used to generate MMLU benchmark data that can be used to later evaluate a trained model on your knowledge dataset. It does not generate data for use during model training.

Llama Pipeline

The Llama pipeline is designed for use with the Llama-3.3-70B-Instruct as the teacher model. Currently, our support for Llama pipelines focuses on generating knowledge pipelines, aimed at producing high-quality, context-aware educational content on higher-end consumer hardware and enterprise systems.

Note: Support for Llama-based skills pipelines is still under development and will be rolled out in future releases.

YAML-Based Workflow: The Pipeline Configuration

The Pipeline YAML configuration file is central to defining data generation workflows in the SDG Framework. This configuration file describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.

Pipeline configuration must adhere to our JSON schema to be considered valid.

Key Features of Pipeline Configuration

Modular Design:
- Pipelines are composed of blocks, which can be chained together.
- Each block performs a specific task, such as generating, filtering, or transforming data.
Reusability:
- Blocks and their configurations can be reused across different workflows.
- YAML makes it easy to tweak or extend workflows without significant changes.
Ease of Configuration:
- Users can specify block types, configurations, and data processing details in a simple and intuitive manner.

Sample Pipeline Configuration

Here is an example of a Pipeline configuration:

version: "1.0"
blocks:
  - name: gen_questions
    type: LLMBlock
    config:
      config_path: configs/skills/freeform_questions.yaml
      output_cols:
        - question
      batch_kwargs:
        num_samples: 30
    drop_duplicates:
      - question
  - name: filter_questions
    type: FilterByValueBlock
    config:
      filter_column: score
      filter_value: 1.0
      operation: eq
      convert_dtype: float
    drop_columns:
      - evaluation
      - score
      - num_samples
  - name: gen_responses
    type: LLMBlock
    block_config:
      config_path: configs/skills/freeform_responses.yaml
      output_cols:
        - response

Data Flow and Storage

Data Representation: Data flow between blocks and pipelines is handled using Hugging Face Datasets, which are based on Arrow tables. This provides:
- Native parallelization capabilities (e.g., maps, filters).
- Support for efficient data transformations.
Data Checkpoints: Intermediate caches of generated data. Checkpoints allow users to:
- Resume workflows from the last successful state if interrupted.
- Improve reliability for long-running workflows.

Installing the SDG library

Clone the library and navigate to the repo:

git clone https://github.com/instructlab/sdg
cd sdg

Install the library:

pip install .

Using the library

You can import SDG into your Python files with the following items:

 from instructlab.sdg.generate_data import generate_data
 from instructlab.sdg.utils import GenerateException

Repository structure

|-- src/instructlab/ (1)
|-- docs/ (2)
|-- scripts/ (3)
|-- tests/ (4)

Contains the SDG code that interacts with InstructLab.
Contains documentation on various SDG methodologies.
Contains some utility scripts, but not part of any supported API.
Contains all the tests for the SDG repository.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cpacheco ktdreyer nathan-weinberg russellb

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

This version

0.8.3

Jun 16, 2025

0.8.2

Apr 23, 2025

0.8.1

Apr 21, 2025

0.8.0

Apr 14, 2025

0.7.3

Mar 31, 2025

0.7.2

Mar 13, 2025

0.7.1

Feb 17, 2025

0.7.0

Jan 22, 2025

0.6.3

Jan 10, 2025

0.6.2

Dec 10, 2024

0.6.1

Nov 27, 2024

0.6.0

Nov 15, 2024

0.5.0

Nov 12, 2024

0.5.0a2 pre-release

Nov 8, 2024

0.5.0a1 pre-release

Nov 1, 2024

0.4.2

Oct 15, 2024

0.4.1

Oct 15, 2024

0.4.0

Oct 10, 2024

0.3.3

Nov 13, 2024

0.3.2

Oct 18, 2024

0.3.1

Oct 7, 2024

0.3.0

Aug 22, 2024

0.2.7

Aug 22, 2024

0.2.6

Aug 16, 2024

0.2.5

Aug 9, 2024

0.2.4

Jul 30, 2024

0.2.3

Jul 29, 2024

0.2.2

Jul 28, 2024

0.2.1

Jul 26, 2024

0.2.0

Jul 23, 2024

0.1.3

Jul 22, 2024

0.1.2

Jul 11, 2024

0.1.1

Jul 11, 2024

0.1.0

Jul 8, 2024

0.0.4.1

Jun 30, 2024

0.0.4

Jun 25, 2024

0.0.3

Jun 25, 2024

0.0.2

Jun 19, 2024

0.0.1

Jun 17, 2024

0.0.1rc4 pre-release

Jun 17, 2024

0.0.1rc3 pre-release

Jun 13, 2024

0.0.1rc2 pre-release

Jun 13, 2024

0.0.1rc1 pre-release

Jun 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructlab_sdg-0.8.3.tar.gz (3.4 MB view details)

Uploaded Jun 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

instructlab_sdg-0.8.3-py3-none-any.whl (107.7 kB view details)

Uploaded Jun 16, 2025 Python 3

File details

Details for the file instructlab_sdg-0.8.3.tar.gz.

File metadata

Download URL: instructlab_sdg-0.8.3.tar.gz
Upload date: Jun 16, 2025
Size: 3.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for instructlab_sdg-0.8.3.tar.gz
Algorithm	Hash digest
SHA256	`6c5d038f87f55892450775e3b58e6b9425fad952d1b8244252e094173cdabf0b`
MD5	`4117b79cc0db79fced13378cada11c19`
BLAKE2b-256	`0ed76d4523d42935dde1f2db2453615a9a367751816a4ce993f24c5ccaeb8f2b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructlab_sdg-0.8.3.tar.gz:

Publisher: pypi.yaml on instructlab/sdg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: instructlab_sdg-0.8.3.tar.gz
- Subject digest: 6c5d038f87f55892450775e3b58e6b9425fad952d1b8244252e094173cdabf0b
- Sigstore transparency entry: 239753622
- Sigstore integration time: Jun 16, 2025
Source repository:
- Permalink: instructlab/sdg@2268a6f7771091f9415e0b8380cb9ed9bde07ee3
- Branch / Tag: refs/tags/v0.8.3
- Owner: https://github.com/instructlab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@2268a6f7771091f9415e0b8380cb9ed9bde07ee3
- Trigger Event: release

File details

Details for the file instructlab_sdg-0.8.3-py3-none-any.whl.

File metadata

Download URL: instructlab_sdg-0.8.3-py3-none-any.whl
Upload date: Jun 16, 2025
Size: 107.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for instructlab_sdg-0.8.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0a1df2c8b47a9929e27750ccbe1c908dd5755ac9fbefce2027080df2e13b146`
MD5	`12fae79c8cd90eab724719e048702bc6`
BLAKE2b-256	`1f3512be45620e704d4c66cc7387b842dd92176a6b4049ffc39ec971c997e8a6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructlab_sdg-0.8.3-py3-none-any.whl:

Publisher: pypi.yaml on instructlab/sdg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: instructlab_sdg-0.8.3-py3-none-any.whl
- Subject digest: a0a1df2c8b47a9929e27750ccbe1c908dd5755ac9fbefce2027080df2e13b146
- Sigstore transparency entry: 239753625
- Sigstore integration time: Jun 16, 2025
Source repository:
- Permalink: instructlab/sdg@2268a6f7771091f9415e0b8380cb9ed9bde07ee3
- Branch / Tag: refs/tags/v0.8.3
- Owner: https://github.com/instructlab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yaml@2268a6f7771091f9415e0b8380cb9ed9bde07ee3
- Trigger Event: release

instructlab-sdg 0.8.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Synthetic Data Generation (SDG)

Core Design Principles

Framework Architecture

Blocks: The Fundamental Unit

Pipelines: Higher-Level Abstraction

Simple Pipeline

Full Pipeline

Eval Pipeline

Llama Pipeline

YAML-Based Workflow: The Pipeline Configuration

Key Features of Pipeline Configuration

Sample Pipeline Configuration

Data Flow and Storage

Installing the SDG library

Using the library

Repository structure

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance