Skip to main content

Large Scale Topic based Synthetic Data Generation

Project description

DeepFabric logo

Generate Fine-Tuning High-Quality Synthetic Datasets at Scale

Good First Issues   Join Discord

License CI Status PyPI Version Downloads Discord


DeepFabric is a powerful synthetic dataset generation framework that leverages LLMs to create high-quality, diverse training data at scale. Built for ML engineers, researchers, and AI developers, it streamlines the entire dataset creation pipeline from topic generation to model-ready formats.

No more unruly models failing to Tool call or comply with reams of natural language to try and yield structured formats. DeepFabric ensures your models are consistent, well-structured, and ready for fine-tuning or evaluation.

Key Features

Core Capabilities

  • 🌳 Hierarchical Topic Generation: Tree and graph-based architectures for comprehensive domain coverage
  • 🔄 Multi-Format Export: Direct export to popular training formats (no conversion scripts needed)
  • 🎭 Conversation Templates: Support for various dialogue patterns and reasoning styles
  • 🛠️ Tool Calling Support: Generate function-calling and agent interaction datasets
  • 📊 Structured Output: Pydantic & Outlines enforced schemas for consistent, high-quality data
  • ☁️ Multi-Provider Support: Works with OpenAI, Anthropic, Google, Ollama, and more
  • 🤗 HuggingFace Integration: Direct dataset upload with auto-generated cards

📊 Supported Output Formats

Format Template Use Case Framework Compatibility
Alpaca builtin://alpaca.py Instruction-following Stanford Alpaca, LLaMA
ChatML builtin://chatml.py Multi-turn conversations Most chat models
Unsloth builtin://unsloth.py Optimized fine-tuning Unsloth notebooks
GRPO builtin://grpo.py Mathematical reasoning GRPO training
Im Format builtin://im_format.py Chat with delimiters ChatML-compatible models
Tool Calling builtin://tool_calling.py Function calling Agent training
Harmony builtin://harmony.py Reasoning with tags (gpt-oss)
Custom file://your_format.py Your requirements Any framework

🧠 Conversation Templates

Template Type Description Example Use Case
Single-Turn Question → Answer FAQ, classification
Multi-Turn Extended dialogues Chatbots, tutoring
Chain of Thought (CoT) Step-by-step reasoning Math, logic problems
Structured CoT Explicit reasoning traces Educational content
Hybrid CoT Mixed reasoning styles Complex problem-solving
Tool Calling Function invocations Agent interactions
System-Prompted With system instructions Role-playing, personas

Something Missing?

If there's a format or feature you'd like to see, please open an issue.

Quickstart

1. Install DeepFabric

pip install deepfabric

2. Generate Your First Dataset

# Set your API key (or use Ollama for local generation)
export OPENAI_API_KEY="your-api-key"

# Generate a dataset with a single command
deepfabric generate \
  --mode tree \
  --provider openai \
  --model gpt-4o \
  --depth 3 \
  --degree 3 \
  --num-steps 9 \
  --batch-size 1 \
  --topic-prompt "This history Quantum physics" \
  --generation-system-prompt "You are an expert on academic history, with a specialism in the sciences" \
  --dataset-save-as dataset.jsonl

Deepfabric will automatically:

  • Generate a hierarchical topic tree (3 levels deep, 3 branches per level)
  • Create 9 diverse Q&A pairs across the generated topics
  • Save your dataset to dataset.jsonl

3. Use Your Dataset

Your dataset is ready in the OpenAI standard instruct format (JSONL):

{
  "messages": [
    {
      "role": "user",
      "content": "Can you explain Albert Einstein's contribution to quantum theory?"
    },
    {
      "role": "assistant",
      "content": "Albert Einstein made significant contributions to quantum theory, particularly through his explanation of the photoelectric effect, for which he won the Nobel Prize in 1921. He proposed that light could be thought of as discrete packets of energy called quanta or photons, which could explain how electrons are emitted from metals when exposed to light. This idea was instrumental in the development of quantum mechanics. He later became famous for his skepticism about quantum mechanics probabilistic interpretation, leading to his quote \"God does not play dice with the universe.\""
    }
  ]
}

4. Use local models.

Generate larger datasets with different models:

# With a depth of 4 and degree of 4^5 = 1,024
deepfabric generate \
  --provider ollama \
  --model qwen3:32b \
  --depth 4 \
  --degree 5 \
  --num-steps 100 \
  --batch-size 5 \
  --topic-prompt "Machine Learning Fundamentals"
  --generation-system-prompt "You are an expert on Machine Learning and its application in modern technologies" \
  --dataset-save-as dataset.jsonl

There are lots more examples to get you going.

🚀 Architecture Overview

Generation Pipeline

graph LR
    A[Topic Prompt] --> B[Topic Tree/Graph]
    B --> C[Data Generator]
    C --> D[Format Engine]
    D --> E[Export/Upload]

Topic Generation Modes

Mode Structure Use Case Max Topics
Tree Hierarchical branching Well-organized domains depth^degree
Graph DAG with cross-connections Interconnected concepts Flexible
Linear Sequential topics Simple lists User-defined
Custom User-provided structure Specific requirements Unlimited

Provider Support Matrix

Provider Models Best For Local/Cloud
OpenAI GPT-4, GPT-4o, GPT-3.5 High quality, complex tasks Cloud
Anthropic Claude 3.5 Sonnet, Haiku Nuanced reasoning Cloud
Google Gemini 2.0, 1.5 Cost-effective at scale Cloud
Ollama Llama, Mistral, Qwen, etc. Privacy, unlimited generation Local
Together Open models Fast inference Cloud
Groq Llama, Mixtral Ultra-fast generation Cloud

⚙️ Configuration System

DeepFabric uses a flexible YAML-based configuration with extensive CLI overrides:

# Main system prompt - used as fallback throughout the pipeline
dataset_system_prompt: "You are a helpful AI assistant providing clear, educational responses."

# Topic Tree Configuration
# Generates a hierarchical topic structure using tree generation
topic_tree:
  topic_prompt: "Python programming fundamentals and best practices"

  # LLM Settings
  provider: "ollama"                    # Options: openai, anthropic, gemini, ollama
  model: "qwen3:0.6b"                    # Change to your preferred model
  temperature: 0.7                      # 0.0 = deterministic, 1.0 = creative

  # Tree Structure
  degree: 2                             # Number of subtopics per node (1-10)
  depth: 2                              # Depth of the tree (1-5)

  # Topic generation prompt (optional - uses dataset_system_prompt if not specified)
  topic_system_prompt: "You are a curriculum designer creating comprehensive programming learning paths. Focus on practical concepts that beginners need to master."

  # Output
  save_as: "python_topics_tree.jsonl"  # Where to save the generated topic tree

# Data Engine Configuration
# Generates the actual training examples
data_engine:
  instructions: "Create clear programming tutorials with working code examples and explanations"

  # LLM Settings (can override main provider/model)
  provider: "ollama"
  model: "qwen3:0.6b"
  temperature: 0.3                      # Lower temperature for more consistent code
  max_retries: 3                        # Number of retries for failed generations

  # Content generation prompt
  generation_system_prompt: "You are a Python programming instructor creating educational content. Provide working code examples, clear explanations, and practical applications."

# Dataset Assembly Configuration
# Controls how the final dataset is created and formatted
dataset:
  creation:
    num_steps: 4                        # Number of training examples to generate
    batch_size: 1                       # Process 3 examples at a time
    sys_msg: true                       # Include system messages in output format

  # Output
  save_as: "python_programming_dataset.jsonl"

# Optional Hugging Face Hub configuration
huggingface:
  # Repository in format "username/dataset-name"
  repository: "your-username/your-dataset-name"
  # Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option
  token: "your-hf-token"
  # Additional tags for the dataset (optional)
  # "deepfabric" and "synthetic" tags are added automatically
  tags:
    - "deepfabric-generated-dataset"
    - "geography"

Run using the CLI:

deepfabric generate config.yaml

The CLI supports various options to override configuration values:

deepfabric generate config.yaml \
  --save-tree output_tree.jsonl \
  --dataset-save-as output_dataset.jsonl \
  --model-name ollama/qwen3:8b \
  --temperature 0.8 \
  --degree 4 \
  --depth 3 \
  --num-steps 10 \
  --batch-size 2 \
  --sys-msg true \  # Control system message inclusion (default: true)
  --hf-repo username/dataset-name \
  --hf-token your-token \
  --hf-tags tag1 --hf-tags tag2

📚 Advanced Features

Chain of Thought (CoT) Generation

CoT Style Template Pattern Best For
Free-text Natural language steps Mathematical problems (GSM8K-style)
Structured Explicit reasoning traces Educational content, tutoring
Hybrid Mixed reasoning Complex multi-step problems
# Example: Structured CoT configuration
data_engine:
  conversation_template: "cot_structured"
  cot_style: "mathematical"
  include_reasoning_tags: true

Batch Processing & Performance

Parameter Description Performance Impact
batch_size Parallel generation ↑ Speed, ↑ Memory
max_retries Retry failed generations ↑ Quality, ↓ Speed
temperature LLM creativity ↑ Diversity, ↓ Consistency
num_workers Parallel processing ↑ Speed (with local models)

Quality Control Features

  • Deduplication: Automatic removal of similar samples
  • Validation: Schema enforcement for all outputs
  • Retry Logic: Automatic retry with backoff for failures
  • Error Tracking: Detailed logs of generation issues
  • Progress Monitoring: Real-time generation statistics

📖 Documentation & Resources

Resource Description Link
Documentation Complete API reference & guides docs.deepfabric.io
Examples Ready-to-use configurations examples/
Discord Community support Join Discord
Issues Bug reports & features GitHub Issues

Stay Updated

Deepfabric development is moving at a fast pace 🏃‍♂️, for a great way to follow the project and to be instantly notified of new releases, Star the repo.

🤝 Contributing

We welcome contributions! Check out our good first issues to get started.

Development Setup

git clone https://github.com/lukehinds/deepfabric
cd deepfabric
uv sync --all-extras  # Install with dev dependencies
make test            # Run tests
make format          # Format code

📊 Community & Support

Who's Using DeepFabric?

If you're using DeepFabric in production or research, we'd love to hear from you! Share your experience in our Discord or open a discussion.

🏆 Use Cases

Industry Applications

Use Case Description Example Config
Model Distillation Teacher-student training distillation.yaml
Evaluation Benchmarks Model testing datasets benchmark.yaml
Domain Adaptation Specialized knowledge domain.yaml
Agent Training Tool-use & reasoning agent.yaml
Instruction Tuning Task-specific models instruct.yaml
Math Reasoning Step-by-step solutions math.yaml

🛡️ Privacy & Security

Data Protection

  • Local Processing: All data generation can run entirely offline with Ollama
  • No Training Data Storage: Generated content is never stored on our servers
  • API Key Security: Keys are never logged or transmitted to third parties

Analytics

  • Fully anonymized telemetry for performance optimization
  • No PII, prompts, or generated content captured
  • Opt-out: export ANONYMIZED_TELEMETRY=False

💡 Tips for Best Results

  1. Start Small: Test with depth=2, degree=3 before scaling up
  2. Mix Models: Use stronger models for topics, faster ones for generation
  3. Iterate: Generate small batches and refine prompts based on results
  4. Validate: Always review a sample before training
  5. Version Control: Save configurations for reproducibility

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepfabric-2.5.1.tar.gz (3.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepfabric-2.5.1-py3-none-any.whl (106.5 kB view details)

Uploaded Python 3

File details

Details for the file deepfabric-2.5.1.tar.gz.

File metadata

  • Download URL: deepfabric-2.5.1.tar.gz
  • Upload date:
  • Size: 3.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepfabric-2.5.1.tar.gz
Algorithm Hash digest
SHA256 258967e4020018b4e544a3bbeaa73f24fb1c01d5db6b3b1a62063625099b70e3
MD5 cd9eef7e3b691d3e83eadeeaf400b4eb
BLAKE2b-256 1d513419dc9906eafae68e5edc20cf81de6a316507edc049bb7ba5ef2f5b5af3

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepfabric-2.5.1.tar.gz:

Publisher: publish.yml on lukehinds/deepfabric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file deepfabric-2.5.1-py3-none-any.whl.

File metadata

  • Download URL: deepfabric-2.5.1-py3-none-any.whl
  • Upload date:
  • Size: 106.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepfabric-2.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 220ab9c7778c2345d27be15e67bbcf5aea254350478f82c2c4a6419168116187
MD5 fccb93a45f32a7f4273fbab92113f8ce
BLAKE2b-256 07f5597301b70b587ef5488bd84cb49bbd5de5c7a2e0dbbf73fabd31595ee229

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepfabric-2.5.1-py3-none-any.whl:

Publisher: publish.yml on lukehinds/deepfabric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page