Skip to main content

Build datasets using natural language

Project description


title: Synthetic Data Generator short_description: Build datasets using natural language emoji: 🧬 colorFrom: yellow colorTo: pink sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: true license: apache-2.0 hf_oauth: true #header: mini hf_oauth_scopes:

  • read-repos
  • write-repos
  • manage-repos
  • inference-api


Synthetic Data Generator

Build datasets using natural language

Synthetic Data Generator

CI CI

Introduction

Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs.

Supported Tasks:

  • Text Classification
  • Supervised Fine-Tuning
  • Judging and rationale evaluation

This tool simplifies the process of creating custom datasets, enabling you to:

  • Describe the characteristics of your desired application
  • Iterate on sample datasets
  • Produce full-scale datasets
  • Push your datasets to the Hugging Face Hub and/or Argilla

By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.

Installation

You can simply install the package with:

pip install synthetic-dataset-generator

Quickstart

from synthetic_dataset_generator.app import demo

demo.launch()

Environment Variables

  • HF_TOKEN: Your Hugging Face token to push your datasets to the Hugging Face Hub and generate free completions from Hugging Face Inference Endpoints.

Optionally, you can set the following environment variables to customize the generation process.

  • BASE_URL: The base URL for any OpenAI compatible API, e.g. https://api-inference.huggingface.co/v1/.
  • MODEL: The model to use for generating the dataset, e.g. meta-llama/Meta-Llama-3.1-8B-Instruct.
  • API_KEY: The API key to use for the corresponding API, e.g. hf_....

Optionally, you can also push your datasets to Argilla for further curation by setting the following environment variables:

  • ARGILLA_API_KEY: Your Argilla API key to push your datasets to Argilla.
  • ARGILLA_API_URL: Your Argilla API URL to push your datasets to Argilla.

Argilla integration

Argilla is a open source tool for data curation. It allows you to annotate and review datasets, and push curated datasets to the Hugging Face Hub. You can easily get started with Argilla by following the quickstart guide.

Argilla integration

Custom synthetic data generation?

Each pipeline is based on distilabel, so you can easily change the LLM or the pipeline steps.

Check out the distilabel library for more information.

Development

Install the dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -e .

Run the app:

python app.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleaners-0.1.0.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleaners-0.1.0-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file cleaners-0.1.0.tar.gz.

File metadata

  • Download URL: cleaners-0.1.0.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.21.0 CPython/3.12.7 Darwin/24.1.0

File hashes

Hashes for cleaners-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c986f0a76a9e469fed16c0a780f69bf16547675131d38091be4256a34787f663
MD5 fe86055f4e6128cafe0ee6f2547e9df0
BLAKE2b-256 83fc38b91bd6871c9be4bc677d7c295d761fa24785c46c578bda68d659e1b197

See more details on using hashes here.

File details

Details for the file cleaners-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cleaners-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.21.0 CPython/3.12.7 Darwin/24.1.0

File hashes

Hashes for cleaners-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8c01c25b791b54dffd5fc955b3052102bf73b022cfbd9c942aba0b12293d59c7
MD5 eb53d3dfab98f42a4a60504a3a186364
BLAKE2b-256 8490031f84729e6fbc242a5798a58aef8535cd75403f855a7ac3d0f846216d17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page