Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning

These details have not been verified by PyPI

Project links

Project description

Data4AI 🚀

AI-powered dataset generation for instruction tuning and model fine-tuning

Data4AI is a production-ready Python library and CLI tool that creates high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Generate, validate, and publish datasets in popular formats like Alpaca, Dolly, and ShareGPT.

✨ Features

Core Capabilities

🤖 AI-Powered Generation: Access 100+ models via OpenRouter API
🔮 DSPy Integration: Dynamic prompt generation using DSPy signatures for high-quality output
📊 Multiple Input Formats: Excel and CSV file support with auto-detection
💬 Natural Language Input: Generate datasets from text descriptions
🔧 Schema Support: Alpaca, Dolly, ShareGPT, and custom formats
☁️ HuggingFace Hub: Direct dataset publishing integration

Production Features

⚡ Rate Limiting: Adaptive token bucket algorithm with automatic backoff
💾 Atomic Operations: Data integrity with temp file + atomic rename pattern
🔄 Checkpoint/Resume: Fault-tolerant generation with session recovery
🎯 Deduplication: Multiple strategies (exact, fuzzy, content-based)
📈 Progress Tracking: Real-time metrics, progress bars, and ETA
🛡️ Error Handling: Comprehensive error recovery with user-friendly messages
🚀 Performance: Parallel processing with asyncio and streaming I/O
📦 Batch Processing: Configurable batch sizes with memory optimization

🚀 Quick Start

Method 1: DSPy Dynamic Prompt Generation (New!)

# Generate high-quality datasets using DSPy signatures
data4ai prompt \
  --repo dspy-example \
  --dataset alpaca \
  --description "Create programming questions about data structures" \
  --count 20 \
  --use-dspy  # Enable DSPy for dynamic prompts

# Compare with static prompts
data4ai prompt \
  --repo static-example \
  --dataset alpaca \
  --description "Create programming questions about data structures" \
  --count 20 \
  --no-use-dspy  # Use static prompts

Method 2: Excel Template Workflow (Recommended)

# 1. Create an Excel template
data4ai create-sample my_dataset.xlsx --dataset alpaca

# 2. Edit the Excel file (add some examples, leave blanks for AI to fill)
# Open my_dataset.xlsx in Excel/LibreOffice/Numbers

# 3. Generate the complete dataset
data4ai run my_dataset.xlsx --repo my-dataset --dataset alpaca --max-rows 1000

Method 2: Description-to-Dataset

# Generate dataset from a description
data4ai prompt \
  --repo code-review-assistant \
  --dataset alpaca \
  --description "Create code review examples that help developers improve their code quality" \
  --count 500

Method 3: Push to Hugging Face

# Generate and publish in one command
data4ai run my_dataset.xlsx --repo my-dataset --dataset alpaca --huggingface --private

📦 Installation

Prerequisites

Python 3.9 or higher
OpenRouter API key (Get one here)

Install Data4AI

# Recommended: Install with pipx for CLI isolation
pipx install data4ai

# Install with pip (choose your features)
pip install data4ai              # Core features only
pip install data4ai[excel]       # With Excel support
pip install data4ai[hf]          # With HuggingFace publishing
pip install data4ai[all]         # All features

# For development
git clone https://github.com/zysec/data4ai.git
cd data4ai
pip install -e .

Verify Installation

data4ai --version
data4ai --help

🧪 Local Development & Testing

For developers who want to test and modify the code:

# Quick setup for local testing
cd data4ai
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

# Configure
cp .env.example .env
# Edit .env with your OpenRouter API key

# Test the installation
data4ai --help
data4ai create-sample tests/samples/test.xlsx  # Works without API

# Run tests
pytest

# Run comprehensive tests
pytest tests/ -v

# With coverage report
pytest tests/ -v --cov=data4ai --cov-report=html

⚙️ Configuration

Environment Variables

Create a .env file or set these environment variables:

# Required
export OPENROUTER_API_KEY="your_openrouter_key_here"

# Optional (with defaults)
export OPENROUTER_MODEL="meta-llama/llama-3-8b-instruct"  # Default model
export DATA4AI_DATASET="alpaca"                           # Default schema
export HF_TOKEN="your_huggingface_token"                  # For HF publishing
export HF_ORG="ZySecAI"                                   # HF organization
export DATA4AI_TEMPERATURE="0.7"                          # Default temperature
export DATA4AI_MAX_ROWS="1000"                            # Default max rows

Configuration File

Create ~/.data4ai/config.yaml for persistent settings:

openrouter:
  api_key: "your_key_here"
  model: "meta-llama/llama-3-8b-instruct"
  temperature: 0.7

huggingface:
  token: "your_hf_token"
  org: "ZySecAI"

defaults:
  dataset: "alpaca"
  max_rows: 1000
  seed: 42

📚 Usage Examples

🚀 Quick Start (Copy-Paste Ready)

# 1. Set your API key
export OPENROUTER_API_KEY="your_key_here"

# 2. Generate a simple dataset from description
data4ai prompt \
  --repo my-first-dataset \
  --dataset alpaca \
  --description "Create 10 questions and answers about Python programming" \
  --count 10

# 3. Check the results
ls my-first-dataset/
cat my-first-dataset/data.jsonl | head -3

📊 Excel Template Workflow

# 1. Create an Excel template
data4ai create-sample my_data.xlsx --dataset alpaca

# 2. Open and edit the Excel file (add a few examples)
# Open my_data.xlsx in Excel/LibreOffice/Numbers
# Fill in some rows, leave others blank for AI to complete

# 3. Generate the complete dataset
data4ai run my_data.xlsx \
  --repo my-excel-dataset \
  --max-rows 100 \
  --temperature 0.7

💼 Real-World Examples

Example 1: Customer Support Dataset

# Generate customer support Q&A
data4ai prompt \
  --repo customer-support-qa \
  --dataset alpaca \
  --description "Create customer support questions and answers for a SaaS product. Include common issues like login problems, billing questions, and feature requests." \
  --count 200 \
  --temperature 0.6

Example 2: Code Review Examples

# Generate code review dataset
data4ai prompt \
  --repo code-review-dataset \
  --dataset alpaca \
  --description "Create code review examples that help developers improve code quality. Include security issues, performance problems, and best practices." \
  --count 150 \
  --model "anthropic/claude-3-5-sonnet"

Example 3: Financial Education Dataset

# Generate financial education content
data4ai prompt \
  --repo financial-education \
  --dataset alpaca \
  --description "Create educational content about personal finance. Cover topics like budgeting, investing, saving, and debt management." \
  --count 300 \
  --temperature 0.8

Example 4: Multi-language Support

# Generate Spanish language dataset
data4ai prompt \
  --repo spanish-tech-qa \
  --dataset alpaca \
  --description "Crear preguntas y respuestas en español sobre tecnología, programación y desarrollo de software" \
  --count 100 \
  --model "meta-llama/llama-3-8b-instruct"

🔧 Advanced Examples

Example 5: Custom Schema with Dolly

# Generate dataset using Dolly schema
data4ai prompt \
  --repo legal-summarizer \
  --dataset dolly \
  --description "Summarize legal case briefs into concise bullet points for junior lawyers" \
  --count 100 \
  --model "anthropic/claude-3-5-sonnet"

Example 6: Chat-style Dataset

# Generate conversation dataset
data4ai prompt \
  --repo ai-chat-examples \
  --dataset sharegpt \
  --description "Create conversations between users and AI assistants about various topics" \
  --count 50

Example 7: Reproducible Generation

# Generate with specific seed for reproducibility
data4ai prompt \
  --repo reproducible-dataset \
  --dataset alpaca \
  --description "Create math word problems for middle school students" \
  --count 100 \
  --seed 42 \
  --temperature 0.5

Example 8: Preview Generation

# Test generation without saving
data4ai prompt \
  --repo test-preview \
  --dataset alpaca \
  --description "Create 5 cooking recipe instructions" \
  --count 5 \
  --dry-run

📈 Publishing to Hugging Face

# Generate and publish in one command
data4ai prompt \
  --repo my-public-dataset \
  --dataset alpaca \
  --description "Create 50 programming interview questions" \
  --count 50 \
  --huggingface

# Or publish existing dataset
data4ai push --repo my-public-dataset --private

🧪 Testing and Validation

# Validate your dataset
data4ai validate --repo my-dataset

# Check dataset statistics
data4ai stats --repo my-dataset

# List available models
data4ai list-models

# Check your configuration
data4ai config

# Show version
data4ai version

🔧 CLI Reference

📋 Main Commands

# Get help
data4ai --help
data4ai <command> --help

# Create Excel template
data4ai create-sample my_data.xlsx --dataset alpaca

# Generate from Excel file (with AI completion)
data4ai run my_data.xlsx --repo my-dataset

# Convert file to dataset (without AI)
data4ai file-to-dataset my_data.xlsx --repo my-dataset

# Generate from description
data4ai prompt --repo my-dataset --description "Your description here"

# Push to Hugging Face
data4ai push --repo my-dataset --private

⚙️ Common Options

Option	Description	Default	Example
`--repo <name>`	Output directory and HF repo name	Required	`--repo my-dataset`
`--dataset <schema>`	Dataset schema (alpaca, dolly, sharegpt)	`alpaca`	`--dataset dolly`
`--model <model>`	OpenRouter model to use	From env var	`--model anthropic/claude-3-5-sonnet`
`--max-rows <N>`	Maximum rows to generate	`1000`	`--max-rows 500`
`--count <N>`	Number of rows (prompt mode)	`500`	`--count 200`
`--temperature <F>`	Sampling temperature (0.0-2.0)	`0.7`	`--temperature 0.8`
`--seed <N>`	Random seed for reproducibility	Random	`--seed 42`
`--use-dspy`	Use DSPy for dynamic prompt generation	`true`	`--use-dspy`
`--no-use-dspy`	Disable DSPy (use static prompts)	`false`	`--no-use-dspy`
`--huggingface`	Push to Hugging Face after generation	`false`	`--huggingface`
`--private`	Make HF dataset private	`false`	`--private`
`--verbose`	Show detailed output	`false`	`--verbose`
`--dry-run`	Show what would be generated	`false`	`--dry-run`

🚀 Quick Command Examples

# Generate 10 examples quickly
data4ai prompt --repo test --description "Create 10 cooking recipes" --count 10

# Use a specific model
data4ai prompt --repo test --description "Math problems" --model "anthropic/claude-3-5-sonnet" --count 50

# Generate with high creativity
data4ai prompt --repo test --description "Creative stories" --temperature 0.9 --count 20

# Generate reproducible results
data4ai prompt --repo test --description "Programming questions" --seed 42 --count 100

# Preview without saving
data4ai prompt --repo test --description "Test prompt" --count 5 --dry-run

# Generate and publish to HF
data4ai prompt --repo public-dataset --description "Educational content" --count 200 --huggingface

# Use DSPy for dynamic prompts (default)
data4ai prompt --repo dspy-dataset --description "Programming questions" --count 50 --use-dspy

# Use static prompts (disable DSPy)
data4ai prompt --repo static-dataset --description "Programming questions" --count 50 --no-use-dspy

📊 Excel Workflow Examples

# Create template
data4ai create-sample my_data.xlsx --dataset alpaca

# Generate from Excel (fill partial rows)
data4ai run my_data.xlsx --repo my-dataset --max-rows 100

# Generate from Excel with custom settings
data4ai run my_data.xlsx --repo my-dataset --model "anthropic/claude-3-5-sonnet" --temperature 0.6 --max-rows 500

# Convert Excel to dataset without AI (for complete files)
data4ai file-to-dataset my_data.xlsx --repo my-dataset

🔍 Utility Commands

# Validate your dataset
data4ai validate --repo my-dataset

# Get dataset statistics
data4ai stats --repo my-dataset

# List available models
data4ai list-models

# Check your configuration
data4ai config

# Show version
data4ai version

# Convert file to dataset (without AI)
data4ai file-to-dataset my_data.xlsx --repo my-dataset

🐍 Python API

🚀 Quick Start (Copy-Paste Ready)

import os
from data4ai import generate_from_description

# Set your API key
os.environ["OPENROUTER_API_KEY"] = "your_key_here"

# Generate a simple dataset
result = generate_from_description(
    description="Create 10 questions and answers about Python programming",
    repo="my-first-dataset",
    dataset="alpaca",
    count=10
)

print(f"✅ Generated {result.row_count} rows")
print(f"📁 Output: {result.jsonl_path}")

📊 Excel Template Workflow

from data4ai import create_sample_excel, generate_from_excel

# 1. Create Excel template
create_sample_excel("my_data.xlsx", dataset="alpaca")

# 2. Edit the Excel file manually (add some examples)
# Open my_data.xlsx in Excel/LibreOffice/Numbers

# 3. Generate complete dataset
result = generate_from_excel(
    excel_path="my_data.xlsx",
    repo="my-excel-dataset",
    dataset="alpaca",
    max_rows=100,
    temperature=0.7
)

print(f"✅ Generated {result.row_count} rows")

💼 Real-World Examples

Example 1: Customer Support Dataset

from data4ai import generate_from_description

result = generate_from_description(
    description="Create customer support questions and answers for a SaaS product. Include common issues like login problems, billing questions, and feature requests.",
    repo="customer-support-qa",
    dataset="alpaca",
    count=200,
    temperature=0.6
)

print(f"✅ Generated {result.row_count} customer support examples")

Example 2: Code Review Dataset

from data4ai import generate_from_description

result = generate_from_description(
    description="Create code review examples that help developers improve code quality. Include security issues, performance problems, and best practices.",
    repo="code-review-dataset",
    dataset="alpaca",
    count=150,
    model="anthropic/claude-3-5-sonnet"
)

print(f"✅ Generated {result.row_count} code review examples")

Example 3: Multi-language Dataset

from data4ai import generate_from_description

result = generate_from_description(
    description="Crear preguntas y respuestas en español sobre tecnología, programación y desarrollo de software",
    repo="spanish-tech-qa",
    dataset="alpaca",
    count=100,
    model="meta-llama/llama-3-8b-instruct"
)

print(f"✅ Generated {result.row_count} Spanish tech examples")

🔧 Advanced Python Usage

Example 4: Object-Oriented API

from data4ai import Data4AI

# Initialize with custom configuration
ai = Data4AI(
    openrouter_api_key="your_key_here",
    openrouter_model="anthropic/claude-3-5-sonnet",
    temperature=0.8
)

# Generate dataset
result = ai.generate_from_description(
    description="Create examples of Python code reviews",
    repo="python-reviews",
    dataset="alpaca",
    count=500,
    push_to_hf=True,
    private=True
)

# Access detailed metadata
print(f"📊 Schema: {result.schema}")
print(f"🤖 Model: {result.model}")
print(f"⚙️ Parameters: {result.params}")
print(f"📁 Output: {result.jsonl_path}")

Example 5: Batch Processing

from data4ai import generate_from_description

# Generate multiple datasets
datasets = [
    {
        "description": "Create cooking recipe instructions",
        "repo": "cooking-recipes",
        "count": 50
    },
    {
        "description": "Create math word problems",
        "repo": "math-problems", 
        "count": 100
    },
    {
        "description": "Create programming interview questions",
        "repo": "interview-qa",
        "count": 75
    }
]

for dataset in datasets:
    result = generate_from_description(
        description=dataset["description"],
        repo=dataset["repo"],
        dataset="alpaca",
        count=dataset["count"]
    )
    print(f"✅ Generated {result.row_count} rows for {dataset['repo']}")

Example 6: Custom Configuration

import os
from data4ai import generate_from_description

# Set multiple environment variables
os.environ.update({
    "OPENROUTER_API_KEY": "your_key_here",
    "OPENROUTER_MODEL": "meta-llama/llama-3-8b-instruct",
    "DATA4AI_TEMPERATURE": "0.7",
    "HF_TOKEN": "your_hf_token",
    "HF_ORG": "ZySecAI"
})

# Generate with custom parameters
result = generate_from_description(
    description="Create educational content about machine learning",
    repo="ml-education",
    dataset="alpaca",
    count=300,
    temperature=0.8,
    seed=42,  # For reproducibility
    push_to_hf=True,
    private=False
)

print(f"✅ Generated {result.row_count} ML education examples")
print(f"📁 Published to: https://huggingface.co/datasets/ZySecAI/ml-education")

🧪 Testing and Validation

from data4ai import validate_dataset, get_dataset_stats

# Validate your dataset
validation_result = validate_dataset("my-dataset")
print(f"✅ Validation: {validation_result.is_valid}")
print(f"📊 Quality score: {validation_result.quality_score}")

# Get statistics
stats = get_dataset_stats("my-dataset")
print(f"📈 Total rows: {stats.total_rows}")
print(f"📏 Avg instruction length: {stats.avg_instruction_length}")
print(f"📏 Avg output length: {stats.avg_output_length}")

📋 Supported Schemas

Alpaca Schema (Default)

{
  "instruction": "What is machine learning?",
  "input": "Explain in simple terms",
  "output": "Machine learning is a type of artificial intelligence..."
}

Dolly Schema

{
  "instruction": "Summarize this text",
  "context": "Long text to summarize...",
  "response": "Summary of the text..."
}

ShareGPT Schema (Chat)

{
  "conversations": [
    {"from": "human", "value": "Hello, how are you?"},
    {"from": "gpt", "value": "I'm doing well, thank you!"}
  ]
}

Custom Schema

# Define custom schema
custom_schema = {
    "columns": ["question", "answer", "category"],
    "template": "Create {category} questions and answers"
}

# Use custom schema
data4ai run data.xlsx --repo custom-dataset --schema custom_schema

📦 Output Structure

my-dataset/
├── data.jsonl          # Main dataset file (unsloth compatible)
├── meta.json           # Generation metadata and parameters
├── sample.xlsx         # Original Excel template (if used)
├── validation.json     # Data quality metrics
└── README.md           # Auto-generated dataset documentation

Metadata Example

{
  "schema": "alpaca",
  "model": "meta-llama/llama-3-8b-instruct",
  "row_count": 1000,
  "generated_at": "2024-01-15T10:30:00Z",
  "parameters": {
    "temperature": 0.7,
    "max_rows": 1000,
    "seed": 42
  },
  "quality_metrics": {
    "avg_instruction_length": 45,
    "avg_output_length": 120,
    "completion_rate": 0.98
  }
}

❓ FAQ & Troubleshooting

Common Issues

Q: "OpenRouter API key not found"

# Set your API key
export OPENROUTER_API_KEY="your_key_here"
# Or use a .env file
echo "OPENROUTER_API_KEY=your_key_here" > .env

Q: "Model not available"

# Check available models
data4ai models list
# Use a different model
data4ai run data.xlsx --model "anthropic/claude-3-5-sonnet"

Q: "Excel file not found"

# Create template first
data4ai create-sample my_data.xlsx --dataset alpaca
# Then edit and run
data4ai run my_data.xlsx --repo my-dataset

Q: "Hugging Face push failed"

# Set HF token
export HF_TOKEN="your_hf_token"
# Check token validity
data4ai hf test

Performance Tips

Start Small: Use --max-rows 100 to test quality before scaling
Use Specific Prompts: Detailed descriptions produce better results
Set Seeds: Use --seed 42 for reproducible results
Monitor Costs: Check OpenRouter usage dashboard
Batch Processing: Use multiple small runs instead of one large run

Quality Improvement

Template Examples: Provide 5-10 good examples in Excel
Clear Instructions: Be specific about desired output format
Temperature Tuning: Lower (0.3-0.5) for factual, higher (0.7-0.9) for creative
Model Selection: Use larger models for complex tasks

🔮 DSPy Integration

Data4AI now includes DSPy (Declarative Self-Improving Language Programs) integration for dynamic, high-quality prompt generation. DSPy uses signatures to optimize prompts automatically, resulting in better dataset quality.

Key Benefits

🎯 Dynamic Prompts: Generate context-aware prompts instead of static templates
🔄 Adaptive Learning: Improve prompts based on previous examples
📊 Schema Awareness: Optimized prompts for different dataset schemas
🛡️ Fallback Support: Automatic fallback to static prompts if DSPy fails
⚡ Performance: Efficient prompt generation with caching

Usage Examples

Basic DSPy Generation

# Enable DSPy (default)
data4ai prompt \
  --repo dspy-dataset \
  --description "Create educational content about machine learning" \
  --count 10 \
  --use-dspy

# Disable DSPy (use static prompts)
data4ai prompt \
  --repo static-dataset \
  --description "Create educational content about machine learning" \
  --count 10 \
  --no-use-dspy

Python API with DSPy

from data4ai.integrations.dspy_prompts import create_prompt_generator
from data4ai.generator import DatasetGenerator

# Create DSPy prompt generator
prompt_generator = create_prompt_generator(
    model_name="meta-llama/llama-3-8b-instruct",
    use_dspy=True
)

# Generate dynamic prompt
prompt = prompt_generator.generate_schema_prompt(
    description="Create programming questions",
    schema_name="alpaca",
    count=5,
    use_dspy=True
)

# Use with dataset generator
generator = DatasetGenerator(model="meta-llama/llama-3-8b-instruct")
result = generator.generate_from_prompt_sync(
    description="Create programming questions",
    output_dir="outputs/dspy-example",
    schema_name="alpaca",
    count=10
)

Adaptive Prompting

# Generate adaptive prompts using previous examples
previous_examples = [
    {"instruction": "Write a function", "input": "", "output": "def func(): pass"},
    {"instruction": "Create a class", "input": "", "output": "class MyClass: pass"}
]

adaptive_prompt = prompt_generator.generate_adaptive_prompt(
    description="Create more programming examples",
    schema_name="alpaca",
    count=3,
    previous_examples=previous_examples
)

Configuration

DSPy is enabled by default. You can configure it in your .env file:

# Enable/disable DSPy
DATA4AI_USE_DSPY=true

# DSPy model (defaults to your main model)
DATA4AI_DSPY_MODEL=meta-llama/llama-3-8b-instruct

Advanced Features

Schema-Specific Optimization: Different prompt strategies for Alpaca, Dolly, ShareGPT
Few-Shot Learning: Use previous examples to improve future prompts
Error Recovery: Automatic fallback to static prompts if DSPy fails
Performance Monitoring: Track prompt generation performance and quality

🧰 Advanced Features

Batch Processing

# Process multiple Excel files
for file in datasets/*.xlsx; do
  data4ai run "$file" --repo "$(basename "$file" .xlsx)" --dataset alpaca
done

Data Validation

# Validate generated dataset
data4ai validate --repo my-dataset

# Check quality metrics
data4ai stats --repo my-dataset

Custom Templates

# Create custom Excel template
from data4ai import create_custom_template

template = {
    "columns": ["question", "answer", "difficulty", "topic"],
    "examples": [
        ["What is Python?", "Python is a programming language", "easy", "programming"],
        ["Explain recursion", "Recursion is when a function calls itself", "medium", "algorithms"]
    ]
}

create_custom_template("custom.xlsx", template)

Integration with Training Pipelines

# Direct integration with unsloth
from data4ai import generate_from_excel
from unsloth import FastLanguageModel

# Generate dataset
result = generate_from_excel("data.xlsx", repo="training-data")

# Load for training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Train with generated data
trainer = SFTTrainer(
    model=model,
    train_dataset=result.load_dataset(),
    # ... other training params
)

📚 Documentation

Getting Started Guide - Quick start with examples
CLI Command Reference - Complete command documentation
Project Structure - Codebase organization
Documentation Index - All documentation links

🤝 Contributing

We welcome contributions! Here's how you can help:

Development Setup

git clone https://github.com/zysec/data4ai.git
cd data4ai
pip install -e ".[dev]"
pre-commit install

Areas for Contribution

New Schemas: Add support for more dataset formats
Quality Improvements: Better validation and error handling
Performance: Optimize generation speed and cost
Documentation: Improve examples and guides
Testing: Add more test cases and edge cases

Submitting Changes

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

🔗 Links

GitHub Repository
PyPI Package
OpenRouter - AI Model API
Unsloth - Training Framework
Hugging Face - Dataset Hosting

ZySec AI — Future Starts Here 🚀

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Aug 22, 2025

0.2.3

Aug 21, 2025

0.2.2

Aug 18, 2025

0.2.1

Aug 18, 2025

0.2.0

Aug 18, 2025

0.1.3

Aug 17, 2025

0.1.2

Aug 17, 2025

This version

0.1.1

Aug 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data4ai-0.1.1.tar.gz (80.1 kB view details)

Uploaded Aug 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data4ai-0.1.1-py3-none-any.whl (58.9 kB view details)

Uploaded Aug 17, 2025 Python 3

File details

Details for the file data4ai-0.1.1.tar.gz.

File metadata

Download URL: data4ai-0.1.1.tar.gz
Upload date: Aug 17, 2025
Size: 80.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for data4ai-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`4894ff20a99cc6d2258187e7f4ab724eaec0e42ef850a9887ddb1dcfe007cc92`
MD5	`a9e602beeccf4886c0d19d6b6fa8fdf3`
BLAKE2b-256	`1ab47706fa2c98c4fd59aa347fa069a12cafb9a715876eefe33fefa27f35d405`

See more details on using hashes here.

File details

Details for the file data4ai-0.1.1-py3-none-any.whl.

File metadata

Download URL: data4ai-0.1.1-py3-none-any.whl
Upload date: Aug 17, 2025
Size: 58.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for data4ai-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9113c8af7b104e910ffff317a07d335b57e43ca26009f01f050600f37f4ee20c`
MD5	`03d17677ece8a76c5dc2f4aaca07c0a4`
BLAKE2b-256	`57ee0aa8588cf3a2cd8a02058bb96be853409402cb82a9800fecab9f1027f4f1`

See more details on using hashes here.

data4ai 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data4AI 🚀

🎯 Quick Navigation

✨ Features

Core Capabilities

Production Features

🚀 Quick Start

Method 1: DSPy Dynamic Prompt Generation (New!)

Method 2: Excel Template Workflow (Recommended)

Method 2: Description-to-Dataset

Method 3: Push to Hugging Face

📦 Installation

Prerequisites

Install Data4AI

Verify Installation

🧪 Local Development & Testing

⚙️ Configuration

Environment Variables

Configuration File

📚 Usage Examples

🚀 Quick Start (Copy-Paste Ready)

📊 Excel Template Workflow

💼 Real-World Examples

Example 1: Customer Support Dataset

Example 2: Code Review Examples

Example 3: Financial Education Dataset

Example 4: Multi-language Support

🔧 Advanced Examples

Example 5: Custom Schema with Dolly

Example 6: Chat-style Dataset

Example 7: Reproducible Generation

Example 8: Preview Generation

📈 Publishing to Hugging Face

🧪 Testing and Validation

🔧 CLI Reference

📋 Main Commands

⚙️ Common Options

🚀 Quick Command Examples

📊 Excel Workflow Examples

🔍 Utility Commands

🐍 Python API

🚀 Quick Start (Copy-Paste Ready)

📊 Excel Template Workflow

💼 Real-World Examples

Example 1: Customer Support Dataset

Example 2: Code Review Dataset

Example 3: Multi-language Dataset

🔧 Advanced Python Usage

Example 4: Object-Oriented API

Example 5: Batch Processing

Example 6: Custom Configuration

🧪 Testing and Validation

📋 Supported Schemas

Alpaca Schema (Default)

Dolly Schema

ShareGPT Schema (Chat)

Custom Schema

📦 Output Structure

Metadata Example

❓ FAQ & Troubleshooting

Common Issues

Performance Tips

Quality Improvement

🔮 DSPy Integration

Key Benefits

Usage Examples

Basic DSPy Generation

Python API with DSPy

Adaptive Prompting

Configuration

Advanced Features

🧰 Advanced Features