Skip to main content

Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning

Project description

Data4AI 🤖

PyPI version License: MIT Python 3.9+ GitHub Stars

Generate high-quality AI training datasets from simple descriptions or documents

Data4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.

✨ Features

  • 🎯 Simple Commands - Generate datasets from descriptions or documents
  • 📚 Multiple Formats - Support for ChatML, Alpaca, and custom schemas
  • 🔄 Smart Processing - Automatic chunking, deduplication, and quality validation
  • 🏷️ Cognitive Taxonomy - Built-in Bloom's taxonomy for balanced learning
  • ☁️ Direct Upload - Push datasets directly to HuggingFace Hub
  • 🌐 100+ Models - Access to GPT, Claude, Llama, and more via OpenRouter

🚀 Quick Start

Install

pip install data4ai

Get API Key

Get your free API key from OpenRouter:

export OPENROUTER_API_KEY="your_key_here"

Generate Your First Dataset

From a description:

data4ai prompt \
  --repo my-dataset \
  --description "Python programming questions for beginners" \
  --count 100

From documents:

data4ai doc document.pdf \
  --repo doc-dataset \
  --count 100

From YouTube videos:

data4ai youtube @3Blue1Brown \
  --repo math-videos \
  --count 100

Upload to HuggingFace:

data4ai push --repo my-dataset

That's it! Your dataset is ready at outputs/datasets/my-dataset/data.jsonl 🎉

📖 Documentation

🤝 Community

Contributing

We welcome contributions! See our Contributing Guide for:

  • Development setup
  • Code style guidelines
  • Testing requirements
  • Pull request process

Getting Help

Project Structure

data4ai/
├── data4ai/           # Core library code
├── docs/             # User documentation  
├── tests/            # Test suite
├── README.md         # You are here
├── CONTRIBUTING.md   # How to contribute
└── CHANGELOG.md      # Release history

🎯 Use Cases

🏥 Medical Training Data

data4ai prompt --repo medical-qa \
  --description "Medical diagnosis Q&A for common symptoms" \
  --count 500

⚖️ Legal Assistant Data

data4ai doc legal-docs/ --repo legal-assistant --count 1000

💻 Code Training Data

data4ai prompt --repo code-qa \
  --description "Python debugging and best practices" \
  --count 300

📺 Educational Video Content

# Programming tutorials
data4ai youtube --search "python tutorial,programming" --repo python-course --count 200

# Educational channels  
data4ai youtube @3Blue1Brown --repo math-education --count 150

# Conference talks
data4ai youtube @pycon --repo conference-talks --count 100

🛠️ Advanced Usage

Quality Control

data4ai doc document.pdf \
  --repo high-quality \
  --verify \
  --taxonomy advanced \
  --dedup-strategy content

Batch Processing

data4ai doc documents/ \
  --repo batch-dataset \
  --count 1000 \
  --batch-size 20 \
  --recursive

Custom Models

export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"
data4ai prompt --repo custom-model --description "..." --count 100

🏗️ Architecture

Data4AI is built with:

  • Async Processing - Fast concurrent generation
  • DSPy Integration - Advanced prompt optimization
  • Quality Validation - Automatic content verification
  • Atomic Writes - Safe file operations
  • Schema Validation - Ensures data consistency

📊 Sample Output

{
  "messages": [
    {
      "role": "user", 
      "content": "How do I handle exceptions in Python?"
    },
    {
      "role": "assistant",
      "content": "In Python, use try-except blocks to handle exceptions: ..."
    }
  ],
  "taxonomy_level": "understand"
}

🔧 Configuration

Environment Variables

# Required
export OPENROUTER_API_KEY="your_key"

# Optional  
export OPENROUTER_MODEL="openai/gpt-4o-mini"  # Default model
export HF_TOKEN="your_hf_token"               # For HuggingFace uploads
export OUTPUT_DIR="./outputs/datasets"       # Default output directory

Config File

Create .data4ai.yaml in your project:

default_model: "anthropic/claude-3-5-sonnet"
default_schema: "chatml" 
default_count: 100
quality_check: true

🚀 Roadmap

  • Custom Schema Support - Define your own data formats
  • Local Model Support - Use local LLMs (Ollama, vLLM)
  • Multi-language Datasets - Generate data in multiple languages
  • Dataset Analytics - Advanced quality metrics and visualization
  • API Service - RESTful API for dataset generation

📈 Performance

  • Speed: Generate 100 examples in ~2 minutes
  • Quality: Built-in validation and deduplication
  • Scale: Tested with datasets up to 100K examples
  • Memory: Efficient streaming for large documents

⭐ Show Your Support

If Data4AI helps you, please:

  • ⭐ Star this repository
  • 🐦 Share on social media
  • 🤝 Contribute improvements
  • 💝 Sponsor the project

📄 License

MIT License - see LICENSE file for details.

🏢 About ZySec AI

ZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiable—helping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by ZySec AI.


Made with ❤️ by ZySec AI to the open source community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data4ai-0.3.0.tar.gz (130.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data4ai-0.3.0-py3-none-any.whl (103.7 kB view details)

Uploaded Python 3

File details

Details for the file data4ai-0.3.0.tar.gz.

File metadata

  • Download URL: data4ai-0.3.0.tar.gz
  • Upload date:
  • Size: 130.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for data4ai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 391bab837e64c39571e71c1d18c07150916462482539393211f8739ca4eaf3a3
MD5 f24382f7bf89e02a92b460e0eebeaebc
BLAKE2b-256 7760c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc

See more details on using hashes here.

File details

Details for the file data4ai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: data4ai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 103.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for data4ai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f92be143b0c7d22970484016274860573b25f157c1a5a4c58c1de2edee3e57db
MD5 e9a96fa3b821ed13b44833a9ef355dfa
BLAKE2b-256 150dd9b63e5182c466398642727cbea230b5c8a1d7d9451a4426eab28230d736

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page