Skip to main content

Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning

Project description

Data4AI 🚀

AI-powered dataset generation for instruction tuning and model fine-tuning

PyPI version License: MIT Python 3.9+

Generate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.

✨ Key Features

  • 🤖 100+ AI Models - Access to GPT-4, Claude, Llama, and more via OpenRouter
  • 📊 Multiple Formats - Support for ChatML (default) and Alpaca schemas
  • 🔮 DSPy Integration - Dynamic prompt optimization for better quality
  • 📄 Document Support - Generate datasets from PDFs, Word docs, Markdown, and text files
  • 🎯 Quality Features - Optional Bloom's taxonomy, provenance tracking, and quality verification
  • 🤖 Smart Generation - Both prompt-based and document-based dataset creation
  • ☁️ HuggingFace Hub - Direct dataset publishing
  • Production Ready - Rate limiting, checkpointing, deduplication

🚀 Quick Start

Installation

pip install data4ai              # All features included
pip install data4ai[all]         # All features

Set Up Environment Variables

Data4AI requires environment variables to be set in your terminal:

Option 1: Quick Setup (Current Session)

# Get your API key from https://openrouter.ai/keys
export OPENROUTER_API_KEY="your_key_here"

# Optional: Set a specific model (default: openai/gpt-4o-mini)
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"  # Or another model

# Optional: Set default dataset schema (default: chatml)
export DEFAULT_SCHEMA="chatml"  # Options: chatml, alpaca

# Optional: For publishing to HuggingFace
export HF_TOKEN="your_huggingface_token"

Option 2: Using .env File

# Create a .env file in your project directory
echo 'OPENROUTER_API_KEY=your_key_here' > .env
# The tool will automatically load from .env

Option 3: Permanent Setup

# Add to your shell config (~/.bashrc, ~/.zshrc, or ~/.profile)
echo 'export OPENROUTER_API_KEY="your_key_here"' >> ~/.bashrc
source ~/.bashrc

Check Your Setup

# Verify environment variables are set
echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:10}..." # Shows first 10 chars

Generate Your First Dataset

# Generate from description
data4ai prompt \
  --repo my-dataset \
  --description "Create 10 Python programming questions with answers" \
  --count 10

# View results
cat my-dataset/data.jsonl

📚 Common Use Cases

1. Generate from Natural Language

data4ai prompt \
  --repo customer-support \
  --description "Create customer support Q&A for a SaaS product" \
  --count 100

2. Generate from Documents

# From single PDF document
data4ai doc research-paper.pdf \
  --repo paper-qa \
  --type qa \
  --count 100

# From entire folder of documents
data4ai doc /path/to/docs/folder \
  --repo multi-doc-dataset \
  --type qa \
  --count 500 \
  --recursive

# Process only specific file types in folder
data4ai doc /path/to/docs \
  --repo pdf-only-dataset \
  --file-types pdf \
  --count 200

# From Word document with summaries
data4ai doc manual.docx \
  --repo manual-summaries \
  --type summary \
  --count 50

# From Markdown with advanced extraction
data4ai doc README.md \
  --repo docs-dataset \
  --type instruction \
  --advanced

# Generate with optional quality features
data4ai doc document.pdf \
  --repo high-quality-dataset \
  --count 200 \
  --taxonomy balanced \    # Use Bloom's taxonomy for diverse questions
  --provenance \           # Include source references
  --verify \               # Verify quality (2x API calls)
  --long-context           # Merge chunks for better coherence

4. High-Quality Generation

# Basic generation (simple and fast)
data4ai doc document.pdf --repo basic-dataset --count 100

# With cognitive diversity using Bloom's Taxonomy
data4ai doc document.pdf \
  --repo taxonomy-dataset \
  --count 100 \
  --taxonomy balanced  # Creates questions at all cognitive levels

# With source tracking for verifiable datasets
data4ai doc research-papers/ \
  --repo cited-dataset \
  --count 500 \
  --provenance  # Includes character offsets for each answer

# Full quality mode for production datasets
data4ai doc documents/ \
  --repo production-dataset \
  --count 1000 \
  --chunk-tokens 250 \     # Token-based chunking
  --taxonomy balanced \    # Cognitive diversity
  --provenance \          # Source tracking
  --verify \              # Quality verification
  --long-context          # Optimized context usage

5. Publish to HuggingFace

# Generate and publish
data4ai prompt \
  --repo my-public-dataset \
  --description "Educational content about machine learning" \
  --count 200 \
  --huggingface

📚 Available Commands

data4ai prompt

Generate dataset from natural language description using AI.

data4ai prompt --repo <name> --description <text> [options]

data4ai doc

Generate dataset from document(s) - supports PDF, DOCX, MD, and TXT files.

data4ai doc <file_or_folder> --repo <name> [options]

data4ai push

Upload existing dataset to HuggingFace Hub.

data4ai push --repo <name> [options]

🐍 Python API

from data4ai import generate_from_description, generate_from_document

# Generate from description (uses ChatML by default)
result = generate_from_description(
    description="Create Python interview questions",
    repo="python-interviews",
    count=50,
    schema="chatml"  # Optional, ChatML is default
)

# Generate from document with quality features
result = generate_from_document(
    document_path="research-paper.pdf",
    repo="paper-qa",
    extraction_type="qa",
    count=100,
    taxonomy="balanced",      # Optional: Bloom's taxonomy
    include_provenance=True,   # Optional: Source tracking
    verify_quality=True        # Optional: Quality verification
)

print(f"Generated {result['row_count']} examples")

📋 Supported Schemas

ChatML (Default - OpenAI format)

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."}
  ]
}

Alpaca (Instruction tuning)

{
  "instruction": "What is machine learning?",
  "input": "Explain in simple terms",
  "output": "Machine learning is..."
}

🎯 Quality Features (Optional)

All quality features are optional - use them when you need higher quality datasets:

Feature Flag Description Performance Impact
Token Chunking --chunk-tokens N Use token count instead of characters Minimal
Bloom's Taxonomy --taxonomy balanced Create cognitively diverse questions None
Provenance --provenance Include source references Minimal
Quality Verification --verify Verify and improve examples 2x API calls
Long Context --long-context Merge chunks for coherence May reduce API calls

When to Use Quality Features

  • Quick Prototyping: No features needed - fast and simple
  • Production Datasets: Use --taxonomy and --verify
  • Academic/Research: Use all features for maximum quality
  • Citation Required: Always use --provenance

⚙️ Configuration

Create .env file:

OPENROUTER_API_KEY=your_key_here
OPENROUTER_MODEL=openai/gpt-4o-mini  # Optional (this is the default)
DEFAULT_SCHEMA=chatml                # Optional (this is the default)
HF_TOKEN=your_huggingface_token      # For publishing

📖 Documentation

🛠️ Development

# Clone repository
git clone https://github.com/zysec/data4ai.git
cd data4ai

# Install for development
pip install -e ".[dev]"

# Run tests
pytest

# Check code quality
ruff check .
black --check .

🤝 Contributing

Contributions welcome! Please check our Contributing Guide.

📄 License

MIT License - see LICENSE file.

🔗 Links


Made with ❤️ by ZySec AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data4ai-0.2.3.tar.gz (99.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data4ai-0.2.3-py3-none-any.whl (76.5 kB view details)

Uploaded Python 3

File details

Details for the file data4ai-0.2.3.tar.gz.

File metadata

  • Download URL: data4ai-0.2.3.tar.gz
  • Upload date:
  • Size: 99.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for data4ai-0.2.3.tar.gz
Algorithm Hash digest
SHA256 7087c41201d45f23e6c86289c9f77235acfed4b139df86e369b5b3847bf2f88d
MD5 4a099d7a43f07fb1856f0e383939f48d
BLAKE2b-256 1edd0d55fde6b5cdaf2bd543adc9f85e660ae6f3969ad4a3d4fa23ad7a50d07a

See more details on using hashes here.

File details

Details for the file data4ai-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: data4ai-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 76.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for data4ai-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 812ec5a702be69f6bd8aa6209698a2a7e64cdb9d4ef2b9977c05e5739afb0c55
MD5 e0edd607abbb79db83cccce971d43911
BLAKE2b-256 9c69ef23c069e1b5758815d073fdbd80b610da1dbc33d9a09a94341961ca4db2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page