Skip to main content

Build datasets using natural language

Project description


title: Synthetic Data Generator short_description: Build datasets using natural language emoji: 🧬 colorFrom: yellow colorTo: pink sdk: gradio sdk_version: 5.8.0 app_file: app.py pinned: true license: apache-2.0 hf_oauth: true #header: mini hf_oauth_scopes:

  • read-repos
  • write-repos
  • manage-repos
  • inference-api


Synthetic Data Generator Logo

Build datasets using natural language

Synthetic Data Generator

Introduction

Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. The announcement blog goes over a practical example of how to use it but you can also watch the video to see it in action.

Supported Tasks:

  • Text Classification
  • Chat Data for Supervised Fine-Tuning
  • Retrieval Augmented Generation

This tool simplifies the process of creating custom datasets, enabling you to:

  • Describe the characteristics of your desired application
  • Iterate on sample datasets
  • Produce full-scale datasets
  • Push your datasets to the Hugging Face Hub and/or Argilla

By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.

Installation

You can simply install the package with:

pip install synthetic-dataset-generator

Quickstart

from synthetic_dataset_generator import launch

launch()

Environment Variables

  • HF_TOKEN: Your Hugging Face token to push your datasets to the Hugging Face Hub and generate free completions from Hugging Face Inference Endpoints. You can find some configuration examples in the examples folder.

You can set the following environment variables to customize the generation process.

  • MAX_NUM_TOKENS: The maximum number of tokens to generate, defaults to 2048.
  • MAX_NUM_ROWS: The maximum number of rows to generate, defaults to 1000.
  • DEFAULT_BATCH_SIZE: The default batch size to use for generating the dataset, defaults to 5.

Optionally, you can use different API providers and models.

  • MODEL: The model to use for generating the dataset, e.g. meta-llama/Meta-Llama-3.1-8B-Instruct, gpt-4o, llama3.1.
  • API_KEY: The API key to use for the generation API, e.g. hf_..., sk-.... If not provided, it will default to the HF_TOKEN environment variable.
  • OPENAI_BASE_URL: The base URL for any OpenAI compatible API, e.g. https://api.openai.com/v1/.
  • OLLAMA_BASE_URL: The base URL for any Ollama compatible API, e.g. http://127.0.0.1:11434/.
  • HUGGINGFACE_BASE_URL: The base URL for any Hugging Face compatible API, e.g. TGI server or Dedicated Inference Endpoints. If you want to use serverless inference, only set the MODEL.
  • VLLM_BASE_URL: The base URL for any VLLM compatible API, e.g. http://localhost:8000/.

To use a specific model exclusively for generating completions, set the corresponding environment variables by appending _COMPLETION to the ones mentioned earlier. For example, you can use MODEL_COMPLETION and OPENAI_BASE_URL_COMPLETION.

SFT and Chat Data generation is not supported with OpenAI Endpoints. Additionally, you need to configure it per model family based on their prompt templates using the right TOKENIZER_ID and MAGPIE_PRE_QUERY_TEMPLATE environment variables.

  • TOKENIZER_ID: The tokenizer ID to use for the magpie pipeline, e.g. meta-llama/Meta-Llama-3.1-8B-Instruct.
  • MAGPIE_PRE_QUERY_TEMPLATE: Enforce setting the pre-query template for Magpie, which is only supported with Hugging Face Inference Endpoints. llama3 and qwen2 are supported out of the box and will use "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n" and "<|im_start|>user\n", respectively. For other models, you can pass a custom pre-query template string.

Optionally, you can also push your datasets to Argilla for further curation by setting the following environment variables:

  • ARGILLA_API_KEY: Your Argilla API key to push your datasets to Argilla.
  • ARGILLA_API_URL: Your Argilla API URL to push your datasets to Argilla.

To save the generated datasets to a local directory instead of pushing them to the Hugging Face Hub, set the following environment variable:

  • SAVE_LOCAL_DIR: The local directory to save the generated datasets to.

You can use our environment template as a starting point:

cp .env.local.template .env

Argilla integration

Argilla is an open source tool for data curation. It allows you to annotate and review datasets, and push curated datasets to the Hugging Face Hub. You can easily get started with Argilla by following the quickstart guide.

Argilla integration

Custom synthetic data generation?

Each pipeline is based on distilabel, so you can easily change the LLM or the pipeline steps.

Check out the distilabel library for more information.

Development

Install the dependencies:

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install the dependencies
pip install -e . # pdm install

Run the app:

python app.py

🐳 Docker Setup

Quick setup with all services (App + Ollama + Argilla):

# Copy environment template
cp docker/.env.docker.template .env # Add your HF_TOKEN in .env

# Build all services (this may take a few minutes)
docker compose -f docker-compose.yml -f docker/ollama/compose.yml -f docker/argilla/compose.yml build

# Start all services
docker compose -f docker-compose.yml -f docker/ollama/compose.yml -f docker/argilla/compose.yml up -d

For more detailed Docker configurations and setups, check docker/README.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic_dataset_generator-0.2.0.tar.gz (50.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthetic_dataset_generator-0.2.0-py3-none-any.whl (61.5 kB view details)

Uploaded Python 3

File details

Details for the file synthetic_dataset_generator-0.2.0.tar.gz.

File metadata

  • Download URL: synthetic_dataset_generator-0.2.0.tar.gz
  • Upload date:
  • Size: 50.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.22.1 CPython/3.12.8 Darwin/24.3.0

File hashes

Hashes for synthetic_dataset_generator-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fb1d3b331e2dfa68195c68c7bd1f278be5338d534b2f01353f78c8c654263704
MD5 110324a65538538f41718747c1287c13
BLAKE2b-256 9be654f41c18d8f2618a4bc9e3ddad3e2ef82bc1e83ac2e65f2cb4e510e42904

See more details on using hashes here.

File details

Details for the file synthetic_dataset_generator-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_dataset_generator-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eaa86d421e40549fd5d6e272853a2ac431ada27975bd33c5118655ef88ea99e9
MD5 ebb68ccb197073d80421e5c56e7d961a
BLAKE2b-256 a37c0cbab535a12cd1c184ea64e2642e5133f37f6cdaca318b35ebefed05823d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page