Skip to main content

A Python package for synthetic text dataset generation

Project description

Datafast Logo

Generate text datasets for LLMs in minutes, not weeks.

Intended use cases

  • Get initial evaluation text data instead of starting your LLM project blind.
  • Increase diversity and coverage of an existing dataset by generating more data.
  • Experiment and test quickly LLM-based application PoCs.
  • Make your own datasets to fine-tune and evaluate language models for your application.

🌟 Star this repo if you find this useful!

Supported Dataset Types

  • ✅ Text Classification Dataset
  • ✅ Raw Text Generation Dataset
  • ✅ Instruction Dataset (Ultrachat-like)
  • ✅ Multiple Choice Question (MCQ) Dataset
  • ✅ Preference Dataset
  • ⏳ more to come...

Supported LLM Providers

Currently we support the following LLM providers:

  • ✔︎ OpenAI
  • ✔︎ Anthropic
  • ✔︎ Google Gemini
  • ✔︎ Ollama (local LLM server)
  • ✔︎ Mistral AI
  • ⏳ more to come...

Try it in Colab:

Open In Colab

Installation

pip install datafast

Quick Start

1. Environment Setup

Make sure you have created a .env file with your API keys. HF token is needed if you want to push the dataset to your HF hub. Other keys depends on which LLM providers you use.

GEMINI_API_KEY=XXXX
OPENAI_API_KEY=sk-XXXX
ANTHROPIC_API_KEY=sk-ant-XXXXX
MISTRAL_API_KEY=XXXX
HF_TOKEN=hf_XXXXX

2. Import Dependencies

from datafast.datasets import ClassificationDataset
from datafast.schema.config import ClassificationDatasetConfig, PromptExpansionConfig
from datafast.llms import OpenAIProvider, AnthropicProvider, GeminiProvider
from dotenv import load_dotenv

# Load environment variables
load_dotenv() # <--- your API keys

3. Configure Dataset

# Configure the dataset for text classification
config = ClassificationDatasetConfig(
    classes=[
        {"name": "positive", "description": "Text expressing positive emotions or approval"},
        {"name": "negative", "description": "Text expressing negative emotions or criticism"}
    ],
    num_samples_per_prompt=5,
    output_file="outdoor_activities_sentiments.jsonl",
    languages={
        "en": "English", 
        "fr": "French"
    },
    prompts=[
        (
            "Generate {num_samples} reviews in {language_name} which are diverse "
            "and representative of a '{label_name}' sentiment class. "
            "{label_description}. The reviews should be {{style}} and in the "
            "context of {{context}}."
        )
    ],
    expansion=PromptExpansionConfig(
        placeholders={
            "context": ["hike review", "speedboat tour review", "outdoor climbing experience"],
            "style": ["brief", "detailed"]
        },
        combinatorial=True
    )
)

4. Setup LLM Providers

# Create LLM providers
providers = [
    OpenAIProvider(model_id="gpt-5-mini-2025-08-07"),
    AnthropicProvider(model_id="claude-haiku-4-5-20251001"),
    GeminiProvider(model_id="gemini-2.0-flash")
]

5. Generate and Push Dataset

# Generate dataset and local save
dataset = ClassificationDataset(config)
dataset.generate(providers)

# Optional: Push to Hugging Face Hub
dataset.push_to_hub(
    repo_id="YOUR_USERNAME/YOUR_DATASET_NAME",
    train_size=0.6
)

Next Steps

Check out our guides for different dataset types:

Key Features

  • Easy-to-use and simple interface 🚀
  • Multi-lingual datasets generation 🌍
  • Multiple LLMs used to boost dataset diversity 🤖
  • Flexible prompt: use our default prompts or provide your own custom prompts 📝
  • Prompt expansion: Combinatorial variation of prompts to maximize diversity 🔄
  • Hugging Face Integration: Push generated datasets to the Hub 🤗

[!WARNING] This library is in its early stages of development and might change significantly.

Contributing

Contributions are welcome! If you are new to the project, pick an issue labelled "good first issue".

How to proceed?

  1. Pick an issue
  2. Comment on the issue to let others know you are working on it
  3. Fork the repository
  4. Clone your fork locally
  5. Create a new branch and give it a name like feature/my-awsome-feature
  6. Make your changes
  7. If you feel like it, write a few tests for your changes
  8. To run the current tests, you can run pytest in the root directory. Don't pay attention to UserWarning: Pydantic serializer warnings. Note that for the LLMs test to run successfully you'll need to have:
  • openai API key
  • anthropic API key
  • gemini API key
  • mistral API key
  • an ollama server running (use ollama serve from command line)
  1. Commit your change, push to your fork and create a pull request from your fork branch to datafast main branch.
  2. Explain your pull request in a clear and concise way, I'll review it as soon as possible.

Roadmap:

  • RAG datasets
  • Personas
  • Seeds
  • More types of instructions datasets (not just ultrachat)
  • More LLM providers
  • Deduplication, filtering
  • Dataset cards generation

Creator

Made with ❤️ by Patrick Fleith.


This is volunteer work, star this repo to show your support! 🙏

Project Details

  • Status: Work in Progress (APIs may change)
  • License: Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafast-0.0.36.tar.gz (72.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datafast-0.0.36-py3-none-any.whl (71.3 kB view details)

Uploaded Python 3

File details

Details for the file datafast-0.0.36.tar.gz.

File metadata

  • Download URL: datafast-0.0.36.tar.gz
  • Upload date:
  • Size: 72.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for datafast-0.0.36.tar.gz
Algorithm Hash digest
SHA256 46e6567754d7c0de50501d1d7636ac57a21f6da8964e9d5c660200a36a20672c
MD5 b6fdea3adf2fe6682dfc80120d7c94e8
BLAKE2b-256 b25086331d3a789f2741d447ce8b761d66af24ffa82ffcd1853737ff0a8d555f

See more details on using hashes here.

File details

Details for the file datafast-0.0.36-py3-none-any.whl.

File metadata

  • Download URL: datafast-0.0.36-py3-none-any.whl
  • Upload date:
  • Size: 71.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for datafast-0.0.36-py3-none-any.whl
Algorithm Hash digest
SHA256 ece6ed61d602026c8d29849ca9b8cb2804e309cc043d7dda0686efff264c2c9d
MD5 e3627fd9cbe87e43a4931aa29c22d24b
BLAKE2b-256 3abd130e2e3ed1b5239405a0a7b11e2396fb89293dbfb90385cb9c08097651fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page