Skip to main content

Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning

Project description

Data4AI 🚀

AI-powered dataset generation for instruction tuning and model fine-tuning

PyPI version License: MIT Python 3.9+

Generate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.

✨ Key Features

  • 🤖 100+ AI Models - Access to GPT-4, Claude, Llama, and more via OpenRouter
  • 📊 Multiple Formats - Support for Alpaca, Dolly, ShareGPT schemas
  • 🔮 DSPy Integration - Dynamic prompt optimization for better quality
  • 💾 Excel/CSV Support - Start from templates or existing data
  • ☁️ HuggingFace Hub - Direct dataset publishing
  • Production Ready - Rate limiting, checkpointing, deduplication

🚀 Quick Start

Installation

pip install data4ai              # Core features
pip install data4ai[excel]       # With Excel support
pip install data4ai[all]         # All features

Set Up Environment Variables

Data4AI requires environment variables to be set in your terminal:

Option 1: Quick Setup (Current Session)

# Get your API key from https://openrouter.ai/keys
export OPENROUTER_API_KEY="your_key_here"

# Optional: Set a specific model (default: openai/gpt-4o-mini)
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"  # Or another model

# Optional: For publishing to HuggingFace
export HF_TOKEN="your_huggingface_token"

Option 2: Interactive Setup

# Use our setup helper
source setup_env.sh

Option 3: Permanent Setup

# Add to your shell config (~/.bashrc, ~/.zshrc, or ~/.profile)
echo 'export OPENROUTER_API_KEY="your_key_here"' >> ~/.bashrc
source ~/.bashrc

Check Your Setup

# Verify environment variables are set
data4ai env --check

Generate Your First Dataset

# Generate from description
data4ai prompt \
  --repo my-dataset \
  --description "Create 10 Python programming questions with answers" \
  --count 10

# View results
cat my-dataset/data.jsonl

📚 Common Use Cases

1. Generate from Natural Language

data4ai prompt \
  --repo customer-support \
  --description "Create customer support Q&A for a SaaS product" \
  --count 100

2. Complete Partial Data from Excel

# Create template
data4ai create-sample template.xlsx

# Fill some examples in Excel, leave others blank
# Then generate completions
data4ai run template.xlsx --repo my-dataset --max-rows 100

3. Publish to HuggingFace

# Generate and publish
data4ai prompt \
  --repo my-public-dataset \
  --description "Educational content about machine learning" \
  --count 200 \
  --huggingface

🐍 Python API

from data4ai import generate_from_description

result = generate_from_description(
    description="Create Python interview questions",
    repo="python-interviews",
    count=50
)

print(f"Generated {result.row_count} examples")

📋 Supported Schemas

Alpaca (Default - Instruction tuning)

{
  "instruction": "What is machine learning?",
  "input": "Explain in simple terms",
  "output": "Machine learning is..."
}

Dolly (Context-based)

{
  "instruction": "Summarize this text",
  "context": "Long text here...",
  "response": "Summary..."
}

ShareGPT (Conversations)

{
  "conversations": [
    {"from": "human", "value": "Hello"},
    {"from": "gpt", "value": "Hi there!"}
  ]
}

⚙️ Configuration

Create .env file:

OPENROUTER_API_KEY=your_key_here
OPENROUTER_MODEL=openai/gpt-4o-mini  # Optional (this is the default)
HF_TOKEN=your_huggingface_token                   # For publishing

Or use CLI:

data4ai config --save

📖 Documentation

🛠️ Development

# Clone repository
git clone https://github.com/zysec/data4ai.git
cd data4ai

# Install for development
pip install -e ".[dev]"

# Run tests
pytest

# Check code quality
ruff check .
black --check .

🤝 Contributing

Contributions welcome! Please check our Contributing Guide.

📄 License

MIT License - see LICENSE file.

🔗 Links


Made with ❤️ by ZySec AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data4ai-0.1.3.tar.gz (82.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data4ai-0.1.3-py3-none-any.whl (61.6 kB view details)

Uploaded Python 3

File details

Details for the file data4ai-0.1.3.tar.gz.

File metadata

  • Download URL: data4ai-0.1.3.tar.gz
  • Upload date:
  • Size: 82.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for data4ai-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3212ce191f4c6c45470f4df27b1d0a0b1099feafb456307d73e09168b6b5bc32
MD5 0a21f20b05f06b1257412d4f6a5300b4
BLAKE2b-256 8170435c8a9684969eca336f36f42193666df9689b48502b7871256522025b99

See more details on using hashes here.

File details

Details for the file data4ai-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: data4ai-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 61.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for data4ai-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 31e9e5ed77c151454c79c2ad1913c6fbe332f30d79b96ce0c1b5f3d2ec1a48bc
MD5 229e14ad595c4a595ff4402ed2fbf290
BLAKE2b-256 ffaaa8b419ca947203a6333c36da396b5b7a76c77077b9d6841513d9fa79f882

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page