Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning
Project description
Data4AI 🚀
AI-powered dataset generation for instruction tuning and model fine-tuning
Generate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.
✨ Key Features
- 🤖 100+ AI Models - Access to GPT-4, Claude, Llama, and more via OpenRouter
- 📊 Multiple Formats - Support for Alpaca, Dolly, ShareGPT schemas
- 🔮 DSPy Integration - Dynamic prompt optimization for better quality
- 💾 Excel/CSV Support - Start from templates or existing data
- ☁️ HuggingFace Hub - Direct dataset publishing
- ⚡ Production Ready - Rate limiting, checkpointing, deduplication
🚀 Quick Start
Installation
pip install data4ai # Core features
pip install data4ai[excel] # With Excel support
pip install data4ai[all] # All features
Get API Key
Get your free API key from OpenRouter
export OPENROUTER_API_KEY="your_key_here"
Generate Your First Dataset
# Generate from description
data4ai prompt \
--repo my-dataset \
--description "Create 10 Python programming questions with answers" \
--count 10
# View results
cat my-dataset/data.jsonl
📚 Common Use Cases
1. Generate from Natural Language
data4ai prompt \
--repo customer-support \
--description "Create customer support Q&A for a SaaS product" \
--count 100
2. Complete Partial Data from Excel
# Create template
data4ai create-sample template.xlsx
# Fill some examples in Excel, leave others blank
# Then generate completions
data4ai run template.xlsx --repo my-dataset --max-rows 100
3. Publish to HuggingFace
# Generate and publish
data4ai prompt \
--repo my-public-dataset \
--description "Educational content about machine learning" \
--count 200 \
--huggingface
🐍 Python API
from data4ai import generate_from_description
result = generate_from_description(
description="Create Python interview questions",
repo="python-interviews",
count=50
)
print(f"Generated {result.row_count} examples")
📋 Supported Schemas
Alpaca (Default - Instruction tuning)
{
"instruction": "What is machine learning?",
"input": "Explain in simple terms",
"output": "Machine learning is..."
}
Dolly (Context-based)
{
"instruction": "Summarize this text",
"context": "Long text here...",
"response": "Summary..."
}
ShareGPT (Conversations)
{
"conversations": [
{"from": "human", "value": "Hello"},
{"from": "gpt", "value": "Hi there!"}
]
}
⚙️ Configuration
Create .env file:
OPENROUTER_API_KEY=your_key_here
OPENROUTER_MODEL=meta-llama/llama-3-8b-instruct # Optional
HF_TOKEN=your_huggingface_token # For publishing
Or use CLI:
data4ai config --save
📖 Documentation
- Detailed Usage Guide - Complete CLI reference
- Examples - Code examples and recipes
- API Documentation - Python API reference
- Publishing Guide - PyPI publishing instructions
- All Documentation - Complete documentation index
🛠️ Development
# Clone repository
git clone https://github.com/zysec/data4ai.git
cd data4ai
# Install for development
pip install -e ".[dev]"
# Run tests
pytest
# Check code quality
ruff check .
black --check .
🤝 Contributing
Contributions welcome! Please check our Contributing Guide.
📄 License
MIT License - see LICENSE file.
🔗 Links
Made with ❤️ by ZySec AI
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data4ai-0.1.2.tar.gz.
File metadata
- Download URL: data4ai-0.1.2.tar.gz
- Upload date:
- Size: 68.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15704fb0a3b506935d7ead3eba16b0ae516aa5f903f736f1f614d1a1be1c6d3f
|
|
| MD5 |
0e903411e99d319c1ada74c685ff31e9
|
|
| BLAKE2b-256 |
ecc300a2f27c2ca49b11bc9c147f5c32e06e5b3591474e45cb24a7f065c39c6a
|
File details
Details for the file data4ai-0.1.2-py3-none-any.whl.
File metadata
- Download URL: data4ai-0.1.2-py3-none-any.whl
- Upload date:
- Size: 52.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0262b7560c906d66b1dfc82e0e670a04e1286a154014a44cad964101363639b4
|
|
| MD5 |
ffc993061274a2073172653c389be5ea
|
|
| BLAKE2b-256 |
05f1bf10c16ea30203f66cb98176e49b038487e36d7cd04ceb6936f75a052d5f
|