DeepFabric

Large Scale Topic based Synthetic Data Generation for LLM Fine-Tuning & Training

Project description

Generate High-Quality Synthetic Datasets at Scale

DeepFabric is a powerful synthetic dataset generation framework that leverages LLMs to create high-quality, diverse training data at scale. Built for ML engineers, researchers, and AI developers, it streamlines the entire dataset creation pipeline from topic generation to model-ready formats.

No more unruly models failing to Tool call or comply with reams of natural language to try and yield structured formats. DeepFabric ensures your models are consistent, well-structured, and ready for fine-tuning or evaluation.

Key Features

Core Capabilities

Hierarchical Topic Generation: Tree and graph-based architectures for comprehensive domain coverage
Multi-Format Export: Direct export to popular training formats (no conversion scripts needed)
Conversation Templates: Support for various dialogue patterns and reasoning styles
Tool Calling Support: Generate function-calling and agent interaction datasets
Structured Output: Pydantic & Outlines enforced schemas for consistent, high-quality data
Multi-Provider Support: Works with OpenAI, Anthropic, Google, Ollama, and more
HuggingFace Integration: Direct dataset upload with auto-generated cards

Supported Output Formats

Format	Template	Use Case	Framework Compatibility
TRL SFT Tools	`builtin://trl_sft_tools`	Tool calling fine-tuning	HuggingFace TRL SFTTrainer
Alpaca	`builtin://alpaca.py`	Instruction-following	Stanford Alpaca, LLaMA
ChatML	`builtin://chatml.py`	Multi-turn conversations	Most chat models
Conversations	`builtin://conversations.py`	Generic conversations format	Unsloth, Axolotl, HF TRL
GRPO	`builtin://grpo.py`	Mathematical reasoning	GRPO training
Tool Calling	`builtin://tool_calling.py`	Function calling	Agent training
Single Tool Call	`builtin://single_tool_call.py`	Individual tool calls	Single function execution
XLAM v2	`builtin://xlam_v2`	Multi-turn tool calling	Salesforce xLAM models
Harmony	`builtin://harmony.py`	Reasoning with tags	OpenAI gpt-oss
Custom	`file://your_format.py`	Your requirements	Any framework

Custom Format

You can create your own custom output format by implementing a simple Python class with a format method using the deepfabric library and BaseFormatter class. See the Custom Format Guide for details.

Conversation Templates

Template Type	Description	Example Use Case
Single-Turn	Question → Answer	FAQ, classification
Multi-Turn	Extended dialogues	Chatbots, tutoring
Chain of Thought (CoT)	Step-by-step reasoning	Math, logic problems
Structured CoT	Explicit reasoning traces	Educational content
Hybrid CoT	Mixed reasoning styles	Complex problem-solving
Tool Calling	Function invocations	Agent interactions
System-Prompted	With system instructions	Role-playing, personas

Template Missing?

If there's a format or feature you'd like to see, please open an issue.

DeepFabric Pipeline

DeepFabric is designed to work within a modular MLOps pipeline, allowing you to customize each stage of the dataset generation process. The main components are:

Topic Generation: Create a structured topic tree or graph based on a high-level prompt.
Data Generation: Generate training examples for each topic using LLMs.
Format Engine: Convert raw outputs into your desired dataset format.

graph LR
    A[Topic Prompt] --> B[Topic Tree/Graph]
    B --> C[Data Generator]
    C --> D[Format Engine]
    D --> E[Export/Upload]

By decoupling these components, you can easily swap out models, prompts, and formats to suit your specific needs, along with version controlling your configurations for reproducibility.

Quickstart

1. Install DeepFabric

pip install deepfabric

2. Generate Your First Dataset

# Set your API key (or use Ollama for local generation)
export OPENAI_API_KEY="your-api-key"

# Generate a dataset with a single command
deepfabric generate \
  --mode tree \
  --provider openai \
  --model gpt-4o \
  --depth 3 \
  --degree 3 \
  --num-steps 27 \
  --batch-size 1 \
  --topic-prompt "This history Quantum physics" \
  --generation-system-prompt "You are an expert on academic history, with a specialism in the sciences" \
  --dataset-save-as dataset.jsonl

Deepfabric will automatically:

Generate a hierarchical topic tree (3 levels deep, 3 branches per level)
Create 27 diverse Q&A pairs across the generated topics
Save your dataset to dataset.jsonl

[!NOTE]
Want to generate faster? Increase batch size! For example, if you set --batch-size 3 and --num-steps 9 deepfabric will generate 3 entries at a time, while ensuring rate limits of OpenAI are monitored (we use backoff, jitter etc).

3. Use Your Dataset

Your dataset is ready in the OpenAI standard instruct format (JSONL):

{
  "messages": [
    {
      "role": "user",
      "content": "Can you explain Albert Einstein's contribution to quantum theory?"
    },
    {
      "role": "assistant",
      "content": "Albert Einstein made significant contributions to quantum theory, particularly through his explanation of the photoelectric effect, for which he won the Nobel Prize in 1921. He proposed that light could be thought of as discrete packets of energy called quanta or photons, which could explain how electrons are emitted from metals when exposed to light. This idea was instrumental in the development of quantum mechanics. He later became famous for his skepticism about quantum mechanics probabilistic interpretation, leading to his quote \"God does not play dice with the universe.\""
    }
  ]
}

4. Use local models.

Generate larger datasets with different models:

# With a depth of 4 and degree of 4^5 = 1,024
deepfabric generate \
  --provider ollama \
  --model qwen3:32b \
  --depth 4 \
  --degree 5 \
  --num-steps 100 \
  --batch-size 5 \
  --topic-prompt "Machine Learning Fundamentals"
  --generation-system-prompt "You are an expert on Machine Learning and its application in modern technologies" \
  --dataset-save-as dataset.jsonl

There are lots more examples to get you going.

Topic Generation Modes

Mode	Structure	Use Case	Max Topics
Tree	Hierarchical branching	Well-organized domains	depth^degree
Graph	DAG with cross-connections	Interconnected concepts	Flexible
Linear	Sequential topics	Simple lists	User-defined
Custom	User-provided structure	Specific requirements	Unlimited

Provider Support Matrix

Provider	Models	Best For	Local/Cloud
OpenAI	GPT-4, GPT-4o, GPT-3.5	High quality, complex tasks	Cloud
Anthropic	Claude 3.5 Sonnet, Haiku	Nuanced reasoning	Cloud
Google	Gemini 2.0, 1.5	Cost-effective at scale	Cloud
Ollama	Llama, Mistral, Qwen, etc.	Privacy, unlimited generation	Local
Together	Open models	Fast inference	Cloud
Groq	Llama, Mixtral	Ultra-fast generation	Cloud

Configuration System

DeepFabric uses a flexible YAML-based configuration with extensive CLI overrides:

# Main system prompt - used as fallback throughout the pipeline
dataset_system_prompt: "You are a helpful AI assistant providing clear, educational responses."

# Topic Tree Configuration
# Generates a hierarchical topic structure using tree generation
topic_tree:
  topic_prompt: "Python programming fundamentals and best practices"

  # LLM Settings
  provider: "ollama"                    # Options: openai, anthropic, gemini, ollama
  model: "qwen3:0.6b"                    # Change to your preferred model
  temperature: 0.7                      # 0.0 = deterministic, 1.0 = creative

  # Tree Structure
  degree: 2                             # Number of subtopics per node (1-10)
  depth: 2                              # Depth of the tree (1-5)

  # Topic generation prompt (optional - uses dataset_system_prompt if not specified)
  topic_system_prompt: "You are a curriculum designer creating comprehensive programming learning paths. Focus on practical concepts that beginners need to master."

  # Output
  save_as: "python_topics_tree.jsonl"  # Where to save the generated topic tree

# Data Engine Configuration
# Generates the actual training examples
data_engine:
  instructions: "Create clear programming tutorials with working code examples and explanations"

  # LLM Settings (can override main provider/model)
  provider: "ollama"
  model: "qwen3:0.6b"
  temperature: 0.3                      # Lower temperature for more consistent code
  max_retries: 3                        # Number of retries for failed generations

  # Content generation prompt
  generation_system_prompt: "You are a Python programming instructor creating educational content. Provide working code examples, clear explanations, and practical applications."

# Dataset Assembly Configuration
# Controls how the final dataset is created and formatted
dataset:
  creation:
    num_steps: 4                        # Number of training examples to generate
    batch_size: 1                       # Process 3 examples at a time
    sys_msg: true                       # Include system messages in output format

  # Output
  save_as: "python_programming_dataset.jsonl"

# Optional Hugging Face Hub configuration
huggingface:
  # Repository in format "username/dataset-name"
  repository: "your-username/your-dataset-name"
  # Token can also be provided via HF_TOKEN environment variable or --hf-token CLI option
  token: "your-hf-token"
  # Additional tags for the dataset (optional)
  # "deepfabric" and "synthetic" tags are added automatically
  tags:
    - "deepfabric-generated-dataset"
    - "geography"

Run using the CLI:

deepfabric generate config.yaml

The CLI supports various options to override configuration values:

deepfabric generate config.yaml \
  --save-tree output_tree.jsonl \
  --dataset-save-as output_dataset.jsonl \
  --model-name ollama/qwen3:8b \
  --temperature 0.8 \
  --degree 4 \
  --depth 3 \
  --num-steps 10 \
  --batch-size 2 \
  --sys-msg true \  # Control system message inclusion (default: true)
  --hf-repo username/dataset-name \
  --hf-token your-token \
  --hf-tags tag1 --hf-tags tag2

Advanced Features

Chain of Thought (CoT) Generation

CoT Style	Template Pattern	Best For
Free-text	Natural language steps	Mathematical problems (GSM8K-style)
Structured	Explicit reasoning traces	Educational content, tutoring
Hybrid	Mixed reasoning	Complex multi-step problems

# Example: Structured CoT configuration
data_engine:
  conversation_template: "cot_structured"
  cot_style: "mathematical"
  include_reasoning_tags: true

Quality Control Features

Deduplication: Automatic removal of similar samples
Validation: Schema enforcement for all outputs
Rate Limiting: Provider-aware retry with exponential backoff and jitter (docs)
Progress Monitoring: Real-time generation statistics

📖 Documentation & Resources

Resource	Description	Link
Documentation	Complete API reference & guides	docs
Examples	Ready-to-use configurations	examples/
Discord	Community support	Join Discord
Issues	Bug reports & features	GitHub Issues

Stay Updated

Deepfabric development is moving at a fast pace 🏃‍♂️, for a great way to follow the project and to be instantly notified of new releases, Star the repo.

Contributing

We welcome contributions! Check out our good first issues to get started.

Development Setup

git clone https://github.com/lukehinds/deepfabric
cd deepfabric
uv sync --all-extras  # Install with dev dependencies
make test            # Run tests
make format          # Format code

Community & Support

Discord: Join our community for real-time help
Issues: Report bugs or request features
Discussions: Share your use cases and datasets

Who's Using DeepFabric?

If you're using DeepFabric in production or research, we'd love to hear from you! Share your experience in our Discord or open a discussion.

Use Cases

Industry Applications

Use Case	Description	Example Config
Model Distillation	Teacher-student training	distillation.yaml
Evaluation Benchmarks	Model testing datasets	benchmark.yaml
Domain Adaptation	Specialized knowledge	domain.yaml
Agent Training	Tool-use & reasoning	agent.yaml
Instruction Tuning	Task-specific models	instruct.yaml
Math Reasoning	Step-by-step solutions	math.yaml

Tips for Best Results

Start Small: Test with depth=2, degree=3 before scaling up
Mix Models: Use stronger models for topics, faster ones for generation
Iterate: Generate small batches and refine prompts based on results
Validate: Always review a sample before training
Version Control: Save configurations for reproducibility

Analytics

We use privacy-respecting analytics to help us improve application performance and stability. We never send Personal identifiable information and we do not capture prompts, generated content, API keys, file names etc.

What We Collect

Anonymous User ID: A stable, one-way hash based on your machine characteristics (hostname + MAC address). This helps us understand unique user counts without identifying you. Its impossible to reverse this hash to get your actual machine details and one-way only.
Usage Metrics: Model names, numeric parameters (temperature, depth, degree, batch_size), timing and success/failure rates
Developer Flag: If you set DEEPFABRIC_DEVELOPER=True, events are marked to help us filter developer testing from real usage

Privacy Guarantees

No usernames, emails, IP addresses, or personal information
User ID is cryptographically hashed and cannot be reversed and contains no Personal Identifiable Information
No prompts, generated datasets, or sensitive data is collected
All data is used solely for application improvement in regards to performance, stability, and feature usage

Control Your Participation

# Disable all analytics
export ANONYMIZED_TELEMETRY=False

# Mark yourself as a developer (for filtering)
export DEEPFABRIC_DEVELOPER=True

Project details

Release history Release notifications | RSS feed

4.12.0

Feb 4, 2026

4.11.0

Feb 2, 2026

4.10.1

Jan 29, 2026

4.10.0

Jan 26, 2026

4.9.0

Jan 14, 2026

4.8.3

Jan 12, 2026

4.8.2

Jan 6, 2026

4.8.1

Jan 5, 2026

4.8.0

Jan 5, 2026

4.7.1

Jan 4, 2026

4.7.0

Jan 4, 2026

4.6.0

Jan 3, 2026

4.5.1

Dec 26, 2025

4.4.1

Dec 21, 2025

4.4.0

Dec 20, 2025

4.3.1

Dec 11, 2025

4.3.0

Dec 11, 2025

4.2.1

Dec 8, 2025

4.2.0

Dec 7, 2025

4.1.0

Dec 4, 2025

4.0.0

Dec 3, 2025

3.13.0

Dec 2, 2025

3.12.2

Nov 29, 2025

3.12.1

Nov 29, 2025

3.12.0

Nov 28, 2025

3.11.0

Nov 28, 2025

3.10.3

Nov 28, 2025

3.10.2

Nov 27, 2025

3.10.1

Nov 27, 2025

3.10.0

Nov 27, 2025

3.9.0

Nov 25, 2025

3.8.0

Nov 25, 2025

3.7.2

Nov 16, 2025

3.7.1

Nov 14, 2025

3.7.0

Nov 14, 2025

3.6.2

Nov 12, 2025

3.6.1

Nov 12, 2025

3.6.0

Nov 10, 2025

3.4.1

Nov 10, 2025

3.4.0

Nov 9, 2025

3.3.0

Nov 9, 2025

3.2.1

Nov 8, 2025

3.2.0

Nov 7, 2025

3.1.0

Nov 1, 2025

3.0.0

Nov 1, 2025

2.14.3

Oct 27, 2025

2.14.2

Oct 23, 2025

This version

2.14.1

Oct 20, 2025

2.14.0

Oct 19, 2025

2.12.0

Oct 15, 2025

2.11.1

Oct 14, 2025

2.11.0

Oct 12, 2025

2.10.0

Oct 10, 2025

2.9.0

Oct 10, 2025

2.8.1

Oct 7, 2025

2.8.0

Oct 6, 2025

2.7.0

Sep 30, 2025

2.6.0

Sep 28, 2025

2.5.1

Sep 27, 2025

2.5.0

Sep 26, 2025

2.4.2

Sep 21, 2025

2.4.1

Sep 20, 2025

2.4.0

Sep 16, 2025

2.3.1

Sep 15, 2025

2.3.0

Sep 15, 2025

2.2.0

Sep 15, 2025

2.0.3

Sep 12, 2025

2.0.1

Sep 12, 2025

2.0.0

Sep 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepfabric-2.14.1.tar.gz (3.5 MB view details)

Uploaded Oct 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

deepfabric-2.14.1-py3-none-any.whl (138.8 kB view details)

Uploaded Oct 20, 2025 Python 3

File details

Details for the file deepfabric-2.14.1.tar.gz.

File metadata

Download URL: deepfabric-2.14.1.tar.gz
Upload date: Oct 20, 2025
Size: 3.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepfabric-2.14.1.tar.gz
Algorithm	Hash digest
SHA256	`4a3544c7493e38aa7df77c62b8a9e0617e8f750c91b3ac8fbc27c613444c163c`
MD5	`c46e01caa51c1a98221f4729e4abad03`
BLAKE2b-256	`8a2b5037d9828e80048afb89b0c847c9a2a381dad1c625ff52e868962504c104`

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepfabric-2.14.1.tar.gz:

Publisher: publish.yml on lukehinds/deepfabric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: deepfabric-2.14.1.tar.gz
- Subject digest: 4a3544c7493e38aa7df77c62b8a9e0617e8f750c91b3ac8fbc27c613444c163c
- Sigstore transparency entry: 622789771
- Sigstore integration time: Oct 20, 2025
Source repository:
- Permalink: lukehinds/deepfabric@4721718af78f421884cdcbe1c973aa7707425fac
- Branch / Tag: refs/tags/v2.14.1
- Owner: https://github.com/lukehinds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4721718af78f421884cdcbe1c973aa7707425fac
- Trigger Event: release

File details

Details for the file deepfabric-2.14.1-py3-none-any.whl.

File metadata

Download URL: deepfabric-2.14.1-py3-none-any.whl
Upload date: Oct 20, 2025
Size: 138.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for deepfabric-2.14.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5a5778330182594a55884e849954d9935c323c7fdc1e5a856da7888d10d26005`
MD5	`e00329f771814da68643825e979cded7`
BLAKE2b-256	`0897e6298a13e28f36c13f4f03854fd48fae6bb93090da6cc91a69f3816322bd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for deepfabric-2.14.1-py3-none-any.whl:

Publisher: publish.yml on lukehinds/deepfabric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: deepfabric-2.14.1-py3-none-any.whl
- Subject digest: 5a5778330182594a55884e849954d9935c323c7fdc1e5a856da7888d10d26005
- Sigstore transparency entry: 622789773
- Sigstore integration time: Oct 20, 2025
Source repository:
- Permalink: lukehinds/deepfabric@4721718af78f421884cdcbe1c973aa7707425fac
- Branch / Tag: refs/tags/v2.14.1
- Owner: https://github.com/lukehinds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4721718af78f421884cdcbe1c973aa7707425fac
- Trigger Event: release

DeepFabric 2.14.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Generate High-Quality Synthetic Datasets at Scale

Key Features

Core Capabilities

Supported Output Formats

Custom Format

Conversation Templates

Template Missing?

DeepFabric Pipeline

Quickstart

1. Install DeepFabric

2. Generate Your First Dataset

3. Use Your Dataset

4. Use local models.

Topic Generation Modes

Provider Support Matrix

Configuration System

Advanced Features

Chain of Thought (CoT) Generation

Quality Control Features

📖 Documentation & Resources

Stay Updated

Contributing

Development Setup

Community & Support

Who's Using DeepFabric?

Use Cases

Industry Applications

Tips for Best Results

Analytics

What We Collect

Privacy Guarantees

Control Your Participation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance