Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning

These details have not been verified by PyPI

Project links

Project description

Data4AI 🚀

AI-powered dataset generation for instruction tuning and model fine-tuning

Generate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.

✨ Key Features

🤖 100+ AI Models - Access to GPT-4, Claude, Llama, and more via OpenRouter
📊 Multiple Formats - Support for ChatML (default), Alpaca, Dolly, ShareGPT schemas
🔮 DSPy Integration - Dynamic prompt optimization for better quality
📄 Document Support - Generate datasets from PDFs, Word docs, Markdown, and text files
🎯 Quality Features - Optional Bloom's taxonomy, provenance tracking, and quality verification
🤖 Smart Generation - Both prompt-based and document-based dataset creation
☁️ HuggingFace Hub - Direct dataset publishing
⚡ Production Ready - Rate limiting, checkpointing, deduplication

🚀 Quick Start

Installation

pip install data4ai              # All features included
pip install data4ai[all]         # All features

Set Up Environment Variables

Data4AI requires environment variables to be set in your terminal:

Option 1: Quick Setup (Current Session)

# Get your API key from https://openrouter.ai/keys
export OPENROUTER_API_KEY="your_key_here"

# Optional: Set a specific model (default: openai/gpt-4o-mini)
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"  # Or another model

# Optional: Set default dataset schema (default: chatml)
export DEFAULT_SCHEMA="chatml"  # Options: chatml, alpaca, dolly, sharegpt

# Optional: For publishing to HuggingFace
export HF_TOKEN="your_huggingface_token"

Option 2: Interactive Setup

# Use our setup helper
source setup_env.sh

Option 3: Permanent Setup

# Add to your shell config (~/.bashrc, ~/.zshrc, or ~/.profile)
echo 'export OPENROUTER_API_KEY="your_key_here"' >> ~/.bashrc
source ~/.bashrc

Check Your Setup

# Verify environment variables are set
data4ai env --check

Generate Your First Dataset

# Generate from description
data4ai prompt \
  --repo my-dataset \
  --description "Create 10 Python programming questions with answers" \
  --count 10

# View results
cat my-dataset/data.jsonl

📚 Common Use Cases

1. Generate from Natural Language

data4ai prompt \
  --repo customer-support \
  --description "Create customer support Q&A for a SaaS product" \
  --count 100

2. Generate from Documents

# From single PDF document
data4ai doc-to-dataset research-paper.pdf \
  --repo paper-qa \
  --type qa \
  --count 100

# From entire folder of documents
data4ai doc-to-dataset /path/to/docs/folder \
  --repo multi-doc-dataset \
  --type qa \
  --count 500 \
  --recursive

# Process only specific file types in folder
data4ai doc-to-dataset /path/to/docs \
  --repo pdf-only-dataset \
  --file-types pdf \
  --count 200

# From Word document with summaries
data4ai doc-to-dataset manual.docx \
  --repo manual-summaries \
  --type summary \
  --count 50

# From Markdown with advanced extraction
data4ai doc-to-dataset README.md \
  --repo docs-dataset \
  --type instruction \
  --advanced

# Convert PDFs to Markdown for better processing
data4ai pdf-to-markdown /path/to/pdfs --recursive

# Generate with optional quality features
data4ai doc-to-dataset document.pdf \
  --repo high-quality-dataset \
  --count 200 \
  --taxonomy balanced \    # Use Bloom's taxonomy for diverse questions
  --provenance \           # Include source references
  --verify \               # Verify quality (2x API calls)
  --long-context           # Merge chunks for better coherence

4. Advanced DSPy Plan→Generate Pipeline (New!)

Use the new budget-based generation for superior quality:

# Smart generation with token budget
data4ai doc-plan-generate document.pdf \
  --repo smart-dataset \
  --token-budget 10000 \
  --taxonomy balanced \
  --difficulty balanced

# Preview the plan first
data4ai doc-plan-generate research-paper.pdf \
  --repo research-qa \
  --token-budget 5000 \
  --dry-run

# With custom constraints
data4ai doc-plan-generate documents/ \
  --repo advanced-dataset \
  --token-budget 20000 \
  --min-examples 50 \
  --max-examples 200 \
  --taxonomy advanced    # Focus on higher-order thinking

This new pipeline:

🧠 Analyzes the entire document first
📊 Creates an intelligent generation plan
💰 Uses token budget instead of fixed counts
🎯 Dynamically allocates examples to important sections
🔬 Ensures Bloom's taxonomy coverage

5. Traditional High-Quality Generation

# Basic generation (simple and fast)
data4ai doc-to-dataset document.pdf --repo basic-dataset --count 100

# With cognitive diversity using Bloom's Taxonomy
data4ai doc-to-dataset document.pdf \
  --repo taxonomy-dataset \
  --count 100 \
  --taxonomy balanced  # Creates questions at all cognitive levels

# With source tracking for verifiable datasets
data4ai doc-to-dataset research-papers/ \
  --repo cited-dataset \
  --count 500 \
  --provenance  # Includes character offsets for each answer

# Full quality mode for production datasets
data4ai doc-to-dataset documents/ \
  --repo production-dataset \
  --count 1000 \
  --chunk-tokens 250 \     # Token-based chunking
  --taxonomy balanced \    # Cognitive diversity
  --provenance \          # Source tracking
  --verify \              # Quality verification
  --long-context          # Optimized context usage

6. Publish to HuggingFace

# Generate and publish
data4ai prompt \
  --repo my-public-dataset \
  --description "Educational content about machine learning" \
  --count 200 \
  --huggingface

🐍 Python API

from data4ai import generate_from_description, generate_from_document

# Generate from description (uses ChatML by default)
result = generate_from_description(
    description="Create Python interview questions",
    repo="python-interviews",
    count=50,
    schema="chatml"  # Optional, ChatML is default
)

# Generate from document with quality features
result = generate_from_document(
    document_path="research-paper.pdf",
    repo="paper-qa",
    extraction_type="qa",
    count=100,
    taxonomy="balanced",      # Optional: Bloom's taxonomy
    include_provenance=True,   # Optional: Source tracking
    verify_quality=True        # Optional: Quality verification
)

print(f"Generated {result['row_count']} examples")

📋 Supported Schemas

ChatML (Default - OpenAI format)

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."}
  ]
}

Alpaca (Instruction tuning)

{
  "instruction": "What is machine learning?",
  "input": "Explain in simple terms",
  "output": "Machine learning is..."
}

Dolly (Context-based)

{
  "instruction": "Summarize this text",
  "context": "Long text here...",
  "response": "Summary..."
}

ShareGPT (Conversations)

{
  "conversations": [
    {"from": "human", "value": "Hello"},
    {"from": "gpt", "value": "Hi there!"}
  ]
}

🎯 Quality Features (Optional)

All quality features are optional - use them when you need higher quality datasets:

Feature	Flag	Description	Performance Impact
Token Chunking	`--chunk-tokens N`	Use token count instead of characters	Minimal
Bloom's Taxonomy	`--taxonomy balanced`	Create cognitively diverse questions	None
Provenance	`--provenance`	Include source references	Minimal
Quality Verification	`--verify`	Verify and improve examples	2x API calls
Long Context	`--long-context`	Merge chunks for coherence	May reduce API calls

When to Use Quality Features

Quick Prototyping: No features needed - fast and simple
Production Datasets: Use --taxonomy and --verify
Academic/Research: Use all features for maximum quality
Citation Required: Always use --provenance

⚙️ Configuration

Create .env file:

OPENROUTER_API_KEY=your_key_here
OPENROUTER_MODEL=openai/gpt-4o-mini  # Optional (this is the default)
DEFAULT_SCHEMA=chatml                # Optional (this is the default)
HF_TOKEN=your_huggingface_token      # For publishing

Or use CLI:

data4ai config --save

📖 Documentation

Detailed Usage Guide - Complete CLI reference
Examples - Code examples and recipes
API Documentation - Python API reference
Publishing Guide - PyPI publishing instructions
All Documentation - Complete documentation index

🛠️ Development

# Clone repository
git clone https://github.com/zysec/data4ai.git
cd data4ai

# Install for development
pip install -e ".[dev]"

# Run tests
pytest

# Check code quality
ruff check .
black --check .

🤝 Contributing

Contributions welcome! Please check our Contributing Guide.

📄 License

MIT License - see LICENSE file.

🔗 Links

Made with ❤️ by ZySec AI

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Aug 22, 2025

0.2.3

Aug 21, 2025

0.2.2

Aug 18, 2025

0.2.1

Aug 18, 2025

This version

0.2.0

Aug 18, 2025

0.1.3

Aug 17, 2025

0.1.2

Aug 17, 2025

0.1.1

Aug 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data4ai-0.2.0.tar.gz (98.6 kB view details)

Uploaded Aug 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data4ai-0.2.0-py3-none-any.whl (75.9 kB view details)

Uploaded Aug 18, 2025 Python 3

File details

Details for the file data4ai-0.2.0.tar.gz.

File metadata

Download URL: data4ai-0.2.0.tar.gz
Upload date: Aug 18, 2025
Size: 98.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for data4ai-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b70cf1118fce2b1a93fbddac4549a47b95d082f13f8c45fc0f6f1947a11e9710`
MD5	`82a45ccc98c4a9cacb7c62385ff3f47f`
BLAKE2b-256	`36c2179376e20b1ea5893f8f58dc0f05b66ef28a4ae27653e8ad211574fec326`

See more details on using hashes here.

File details

Details for the file data4ai-0.2.0-py3-none-any.whl.

File metadata

Download URL: data4ai-0.2.0-py3-none-any.whl
Upload date: Aug 18, 2025
Size: 75.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for data4ai-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6181ee129fc67e69cbb76c08284010186b893f088435f9820a3567b2e973128d`
MD5	`6772d9ac669dcec616f738dcffd65a08`
BLAKE2b-256	`95c595a7e041cea18128fc734ebced186b8c12189a7f0482f3e28babe22093e2`

See more details on using hashes here.

data4ai 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data4AI 🚀

✨ Key Features

🚀 Quick Start

Installation

Set Up Environment Variables

Option 1: Quick Setup (Current Session)

Option 2: Interactive Setup

Option 3: Permanent Setup

Check Your Setup

Generate Your First Dataset

📚 Common Use Cases

1. Generate from Natural Language

2. Generate from Documents

4. Advanced DSPy Plan→Generate Pipeline (New!)

5. Traditional High-Quality Generation

6. Publish to HuggingFace

🐍 Python API

📋 Supported Schemas

🎯 Quality Features (Optional)

When to Use Quality Features

⚙️ Configuration

📖 Documentation

🛠️ Development

🤝 Contributing

📄 License

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes