Build datasets using natural language
Project description
title: Synthetic Data Generator short_description: Build datasets using natural language emoji: 🧬 colorFrom: yellow colorTo: pink sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: true license: apache-2.0 hf_oauth: true #header: mini hf_oauth_scopes:
- read-repos
- write-repos
- manage-repos
- inference-api
Synthetic Data Generator
Build datasets using natural language
Introduction
Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs.
Supported Tasks:
- Text Classification
- Supervised Fine-Tuning
- Judging and rationale evaluation
This tool simplifies the process of creating custom datasets, enabling you to:
- Describe the characteristics of your desired application
- Iterate on sample datasets
- Produce full-scale datasets
- Push your datasets to the Hugging Face Hub and/or Argilla
By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.
Installation
You can simply install the package with:
pip install synthetic-dataset-generator
Quickstart
from synthetic_dataset_generator.app import demo
demo.launch()
Environment Variables
HF_TOKEN: Your Hugging Face token to push your datasets to the Hugging Face Hub and generate free completions from Hugging Face Inference Endpoints.
Optionally, you can set the following environment variables to customize the generation process.
BASE_URL: The base URL for any OpenAI compatible API, e.g.https://api-inference.huggingface.co/v1/.MODEL: The model to use for generating the dataset, e.g.meta-llama/Meta-Llama-3.1-8B-Instruct.API_KEY: The API key to use for the corresponding API, e.g.hf_....
Optionally, you can also push your datasets to Argilla for further curation by setting the following environment variables:
ARGILLA_API_KEY: Your Argilla API key to push your datasets to Argilla.ARGILLA_API_URL: Your Argilla API URL to push your datasets to Argilla.
Argilla integration
Argilla is a open source tool for data curation. It allows you to annotate and review datasets, and push curated datasets to the Hugging Face Hub. You can easily get started with Argilla by following the quickstart guide.
Custom synthetic data generation?
Each pipeline is based on distilabel, so you can easily change the LLM or the pipeline steps.
Check out the distilabel library for more information.
Development
Install the dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -e .
Run the app:
python app.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleaners-0.1.0.tar.gz.
File metadata
- Download URL: cleaners-0.1.0.tar.gz
- Upload date:
- Size: 32.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.21.0 CPython/3.12.7 Darwin/24.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c986f0a76a9e469fed16c0a780f69bf16547675131d38091be4256a34787f663
|
|
| MD5 |
fe86055f4e6128cafe0ee6f2547e9df0
|
|
| BLAKE2b-256 |
83fc38b91bd6871c9be4bc677d7c295d761fa24785c46c578bda68d659e1b197
|
File details
Details for the file cleaners-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cleaners-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.21.0 CPython/3.12.7 Darwin/24.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c01c25b791b54dffd5fc955b3052102bf73b022cfbd9c942aba0b12293d59c7
|
|
| MD5 |
eb53d3dfab98f42a4a60504a3a186364
|
|
| BLAKE2b-256 |
8490031f84729e6fbc242a5798a58aef8535cd75403f855a7ac3d0f846216d17
|