Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning
Project description
Data4AI 🤖
Generate high-quality AI training datasets from simple descriptions or documents
Data4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.
✨ Features
- 🎯 Simple Commands - Generate datasets from descriptions or documents
- 📚 Multiple Formats - Support for ChatML, Alpaca, and custom schemas
- 🔄 Smart Processing - Automatic chunking, deduplication, and quality validation
- 🏷️ Cognitive Taxonomy - Built-in Bloom's taxonomy for balanced learning
- ☁️ Direct Upload - Push datasets directly to HuggingFace Hub
- 🌐 100+ Models - Access to GPT, Claude, Llama, and more via OpenRouter
🚀 Quick Start
Install
pip install data4ai
Get API Key
Get your free API key from OpenRouter:
export OPENROUTER_API_KEY="your_key_here"
Generate Your First Dataset
From a description:
data4ai prompt \
--repo my-dataset \
--description "Python programming questions for beginners" \
--count 100
From documents:
data4ai doc document.pdf \
--repo doc-dataset \
--count 100
From YouTube videos:
data4ai youtube @3Blue1Brown \
--repo math-videos \
--count 100
Upload to HuggingFace:
data4ai push --repo my-dataset
That's it! Your dataset is ready at outputs/datasets/my-dataset/data.jsonl 🎉
📖 Documentation
- Examples - Real-world usage examples
- Commands - Complete CLI reference
- Features - Advanced features and options
- YouTube Integration - Extract datasets from YouTube videos
- Troubleshooting - Common issues and solutions
- Runnable Examples - Ready-to-run example scripts
🤝 Community
Contributing
We welcome contributions! See our Contributing Guide for:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
Getting Help
- 🐛 Bug reports: GitHub Issues
- 💬 Questions: GitHub Discussions
- 📧 Contact: research@zysec.ai
Project Structure
data4ai/
├── data4ai/ # Core library code
├── docs/ # User documentation
├── tests/ # Test suite
├── README.md # You are here
├── CONTRIBUTING.md # How to contribute
└── CHANGELOG.md # Release history
🎯 Use Cases
🏥 Medical Training Data
data4ai prompt --repo medical-qa \
--description "Medical diagnosis Q&A for common symptoms" \
--count 500
⚖️ Legal Assistant Data
data4ai doc legal-docs/ --repo legal-assistant --count 1000
💻 Code Training Data
data4ai prompt --repo code-qa \
--description "Python debugging and best practices" \
--count 300
📺 Educational Video Content
# Programming tutorials
data4ai youtube --search "python tutorial,programming" --repo python-course --count 200
# Educational channels
data4ai youtube @3Blue1Brown --repo math-education --count 150
# Conference talks
data4ai youtube @pycon --repo conference-talks --count 100
🛠️ Advanced Usage
Quality Control
data4ai doc document.pdf \
--repo high-quality \
--verify \
--taxonomy advanced \
--dedup-strategy content
Batch Processing
data4ai doc documents/ \
--repo batch-dataset \
--count 1000 \
--batch-size 20 \
--recursive
Custom Models
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"
data4ai prompt --repo custom-model --description "..." --count 100
🏗️ Architecture
Data4AI is built with:
- Async Processing - Fast concurrent generation
- DSPy Integration - Advanced prompt optimization
- Quality Validation - Automatic content verification
- Atomic Writes - Safe file operations
- Schema Validation - Ensures data consistency
📊 Sample Output
{
"messages": [
{
"role": "user",
"content": "How do I handle exceptions in Python?"
},
{
"role": "assistant",
"content": "In Python, use try-except blocks to handle exceptions: ..."
}
],
"taxonomy_level": "understand"
}
🔧 Configuration
Environment Variables
# Required
export OPENROUTER_API_KEY="your_key"
# Optional
export OPENROUTER_MODEL="openai/gpt-4o-mini" # Default model
export HF_TOKEN="your_hf_token" # For HuggingFace uploads
export OUTPUT_DIR="./outputs/datasets" # Default output directory
Config File
Create .data4ai.yaml in your project:
default_model: "anthropic/claude-3-5-sonnet"
default_schema: "chatml"
default_count: 100
quality_check: true
🚀 Roadmap
- Custom Schema Support - Define your own data formats
- Local Model Support - Use local LLMs (Ollama, vLLM)
- Multi-language Datasets - Generate data in multiple languages
- Dataset Analytics - Advanced quality metrics and visualization
- API Service - RESTful API for dataset generation
📈 Performance
- Speed: Generate 100 examples in ~2 minutes
- Quality: Built-in validation and deduplication
- Scale: Tested with datasets up to 100K examples
- Memory: Efficient streaming for large documents
⭐ Show Your Support
If Data4AI helps you, please:
- ⭐ Star this repository
- 🐦 Share on social media
- 🤝 Contribute improvements
- 💝 Sponsor the project
📄 License
MIT License - see LICENSE file for details.
🏢 About ZySec AI
ZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiable—helping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by ZySec AI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data4ai-0.3.0.tar.gz.
File metadata
- Download URL: data4ai-0.3.0.tar.gz
- Upload date:
- Size: 130.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
391bab837e64c39571e71c1d18c07150916462482539393211f8739ca4eaf3a3
|
|
| MD5 |
f24382f7bf89e02a92b460e0eebeaebc
|
|
| BLAKE2b-256 |
7760c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc
|
File details
Details for the file data4ai-0.3.0-py3-none-any.whl.
File metadata
- Download URL: data4ai-0.3.0-py3-none-any.whl
- Upload date:
- Size: 103.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f92be143b0c7d22970484016274860573b25f157c1a5a4c58c1de2edee3e57db
|
|
| MD5 |
e9a96fa3b821ed13b44833a9ef355dfa
|
|
| BLAKE2b-256 |
150dd9b63e5182c466398642727cbea230b5c8a1d7d9451a4426eab28230d736
|