Skip to main content

Synthetic Data Generation

Project description

Synthetic Data Generation for LLMs

The SDG Framework is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a "no-code" manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.

Core Design Principles

The framework is built around the following principles:

  1. Modular Design: Highly composable blocks form the building units of the framework, allowing users to build workflows effortlessly.
  2. No-Code Workflow Creation: Specify workflows using simple YAML configuration files.
  3. Scalability and Performance: Optimized for handling large-scale workflows with millions of records.

Framework Architecture

overview

Blocks: The Fundamental Unit

At the heart of the framework is the Block. Each block is a self-contained computational unit that performs specific tasks, such as:

  • Making LLM calls
  • Performing data transformations
  • Applying filters

Blocks are designed to be:

  • Modular: Reusable across multiple pipelines.
  • Composable: Easily chained together to create workflows.

These blocks are implemented in the src/sdg_hub/blocks directory.

Pipelines: Higher-Level Abstraction

Blocks can be chained together to form a Pipeline. Pipelines enable:

  • Linear or recursive chaining of blocks.
  • Execution of complex workflows by chaining multiple pipelines together.

SDG Workflow: Full Workflow Automation

Pipelines are further orchestrated into SDG Workflows, enabling seamless end-to-end processing. When invoking sdg_hub.generate, it triggers a pipeline/ or multiple pipelines that processes data through all the configured blocks.


YAML-Based Workflow: The Flow

The YAML configuration file, known as the Flow, is central to defining data generation workflows in the SDG Framework. A Flow describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.

Key Features of a Flow

  1. Modular Design:

    • Flows are composed of blocks, which can be chained together into pipelines.
    • Each block performs a specific task, such as generating, filtering, or transforming data.
  2. Reusability:

    • Blocks and configurations defined in a Flow can be reused across different workflows.
    • YAML makes it easy to tweak or extend workflows without significant changes.
  3. Ease of Configuration:

    • Users can specify block types, configurations, and data processing details in a simple and intuitive manner.

Sample Flow

Here is an example of a Flow configuration:

- block_type: LLMBlock
  block_config:
    block_name: gen_questions
    config_path: configs/skills/freeform_questions.yaml
    model_id: mistralai/Mixtral-8x7B-Instruct-v0.1
    output_cols:
      - question
    batch_kwargs:
      num_samples: 30
  drop_duplicates:
    - question
- block_type: FilterByValueBlock
  block_config:
    block_name: filter_questions
    filter_column: score
    filter_value: 1.0
    operation: operator.eq
    convert_dtype: float
    batch_kwargs:
      num_procs: 8
  drop_columns:
    - evaluation
    - score
    - num_samples
- block_type: LLMBlock
  block_config:
    block_name: gen_responses
    config_path: configs/skills/freeform_responses.yaml
    model_id: mistralai/Mixtral-8x7B-Instruct-v0.1
    output_cols:
      - response

Dataflow and Storage

  • Data Representation: Dataflow between blocks and pipelines is handled using Hugging Face Datasets, which are based on Arrow tables. This provides:

    • Native parallelization capabilities (e.g., maps, filters).
    • Support for efficient data transformations.
  • Data Checkpoints: Intermediate caches of generated data. Checkpoints allow users to:

    • Resume workflows from the last successful state if interrupted.
    • Improve reliability for long-running workflows.

Examples

For sample use cases and implementation examples, please refer to the examples directory. This directory contains various examples demonstrating different workflows and use cases of the SDG Framework.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdg_hub-0.1.0a1.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdg_hub-0.1.0a1-py3-none-any.whl (105.3 kB view details)

Uploaded Python 3

File details

Details for the file sdg_hub-0.1.0a1.tar.gz.

File metadata

  • Download URL: sdg_hub-0.1.0a1.tar.gz
  • Upload date:
  • Size: 4.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 7ea85c5ea17a50ad7e99538ad87cedf460ac8bedd5e8f9550c3ee2dc8cab3ce5
MD5 380503d7a2fa155da87745774703e3c9
BLAKE2b-256 ed48e158dc4e494e9701921db8844aeb3b941cfae0d3576376fbda6f878aacad

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.0a1.tar.gz:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdg_hub-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: sdg_hub-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 105.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sdg_hub-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 07eb9778cccd6274eec9f1e2385142b53df290189fef747002fa1e59e41ac5e8
MD5 ae844c1f346ea63a71293f5c4d216d23
BLAKE2b-256 8c1c01f051b00a9294f0630665f8a82e7850905f1f326c2faf8aac222ce45fa1

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdg_hub-0.1.0a1-py3-none-any.whl:

Publisher: pypi.yaml on Red-Hat-AI-Innovation-Team/sdg_hub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page