A Python package for synthetic text dataset generation
Project description
Generate text datasets for LLMs in minutes, not weeks.
Intended use cases
- Get initial evaluation text data instead of starting your LLM project blind.
- Increase diversity and coverage of an existing dataset by generating more data.
- Experiment and test quickly LLM-based application PoCs.
- Make your own datasets to fine-tune and evaluate language models for your application.
🌟 Star this repo if you find this useful!
Supported Dataset Types
- ✅ Text Classification Dataset
- ✅ Raw Text Generation Dataset
- ✅ Instruction Dataset (Ultrachat-like)
- ✅ Multiple Choice Question (MCQ) Dataset
- ✅ Preference Dataset
- ⏳ more to come...
Supported LLM Providers
Currently we support the following LLM providers:
- ✔︎ OpenAI
- ✔︎ Anthropic
- ✔︎ Google Gemini
- ✔︎ Ollama (local LLM server)
- ⏳ more to come...
Try it in Colab:
Installation
pip install datafast
Quick Start
1. Environment Setup
Make sure you have created a secrets.env file with your API keys.
HF token is needed if you want to push the dataset to your HF hub.
Other keys depends on which LLM providers you use.
GEMINI_API_KEY=XXXX
OPENAI_API_KEY=sk-XXXX
ANTHROPIC_API_KEY=sk-ant-XXXXX
HF_TOKEN=hf_XXXXX
2. Import Dependencies
from datafast.datasets import ClassificationDataset
from datafast.schema.config import ClassificationDatasetConfig, PromptExpansionConfig
from datafast.llms import OpenAIProvider, AnthropicProvider, GeminiProvider
from dotenv import load_dotenv
# Load environment variables
load_dotenv("secrets.env") # <--- your API keys
3. Configure Dataset
# Configure the dataset for text classification
config = ClassificationDatasetConfig(
classes=[
{"name": "positive", "description": "Text expressing positive emotions or approval"},
{"name": "negative", "description": "Text expressing negative emotions or criticism"}
],
num_samples_per_prompt=5,
output_file="outdoor_activities_sentiments.jsonl",
languages={
"en": "English",
"fr": "French"
},
prompts=[
(
"Generate {num_samples} reviews in {language_name} which are diverse "
"and representative of a '{label_name}' sentiment class. "
"{label_description}. The reviews should be {{style}} and in the "
"context of {{context}}."
)
],
expansion=PromptExpansionConfig(
placeholders={
"context": ["hike review", "speedboat tour review", "outdoor climbing experience"],
"style": ["brief", "detailed"]
},
combinatorial=True
)
)
4. Setup LLM Providers
# Create LLM providers
providers = [
OpenAIProvider(model_id="gpt-4.1-mini-2025-04-14"),
AnthropicProvider(model_id="claude-3-5-haiku-latest"),
GeminiProvider(model_id="gemini-2.0-flash")
]
5. Generate and Push Dataset
# Generate dataset and local save
dataset = ClassificationDataset(config)
dataset.generate(providers)
# Optional: Push to Hugging Face Hub
dataset.push_to_hub(
repo_id="YOUR_USERNAME/YOUR_DATASET_NAME",
train_size=0.6
)
Next Steps
Check out our guides for different dataset types:
- How to Generate a Text Classification Dataset
- How to Create a Raw Text Dataset
- How to Create a Preference Dataset
- How to Create a Multiple Choice Question (MCQ) Dataset
- How to Create an Instruction (Ultrachat) Dataset
- Star and watch this github repo to get updates 🌟
Key Features
- Easy-to-use and simple interface 🚀
- Multi-lingual datasets generation 🌍
- Multiple LLMs used to boost dataset diversity 🤖
- Flexible prompt: use our default prompts or provide your own custom prompts 📝
- Prompt expansion: Combinatorial variation of prompts to maximize diversity 🔄
- Hugging Face Integration: Push generated datasets to the Hub 🤗
[!WARNING] This library is in its early stages of development and might change significantly.
Roadmap:
- RAG datasets
- Personas
- Seeds
- More types of instructions datasets (not just ultrachat)
- More LLM providers
- Deduplication, filtering
- Dataset cards generation
Creator
Made with ❤️ by Patrick Fleith.
This is volunteer work, star this repo to show your support! 🙏
Project Details
- Status: Work in Progress (APIs may change)
- License: Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datafast-0.0.20.tar.gz.
File metadata
- Download URL: datafast-0.0.20.tar.gz
- Upload date:
- Size: 53.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
024df92e17bd5cd1260d82fe1aa1b43b783eb988f55b8c6e3e19e7cd5691d448
|
|
| MD5 |
263cbfa59bb3914ce9c1d039c2b06756
|
|
| BLAKE2b-256 |
8277bb051b051ec62c04563f8f2d33dc5e1feeb053813d91029e6eb48b35f57d
|
File details
Details for the file datafast-0.0.20-py3-none-any.whl.
File metadata
- Download URL: datafast-0.0.20-py3-none-any.whl
- Upload date:
- Size: 56.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a419c9e2f998af9028105d7c4e9d8d25c715128957fd495abd896646790bd43f
|
|
| MD5 |
1da26460c34d4aed1dcb13ebf82b4eb6
|
|
| BLAKE2b-256 |
8d38131018e483b67ea3e3f7f98a7c2ca85d72ecc5d3948234d8f2cf770f4f87
|