Enterprise-Grade Synthetic Data Generation
Project description
OmniGen 🚀
Generate synthetic data at scale using an enterprise-ready framework with full customizable configuration, security, and ease of use
Built by Ultrasafe AI for production environments.
What is OmniGen?
OmniGen is an enterprise-grade framework for generating synthetic datasets at scale—from scratch or from base data. Generate trillions of tokens and billions of samples across multiple modalities:
🎯 Data Types Supported
- 💬 Conversational Data - Single-turn to multi-turn dialogues
- 🤖 Agentic Datasets - Tool use, function calling, multi-step reasoning
- 🎨 Multimodal Datasets - Text, images, audio, video combinations
- 🖼️ Images - Synthetic image generation and editing
- 🎵 Audio - Speech, music, sound effects
- 🎬 Video - Synthetic video sequences
🎓 Use Cases
- Fine-Tuning - Instruction following, task-specific models
- Supervised Fine-Tuning (SFT) - High-quality labeled datasets
- Offline Reinforcement Learning - Preference datasets with rewards
- Online Reinforcement Learning - Ground truth with reward checking scripts
- Pre-Training - Large-scale corpus generation
- Machine Learning - Training data for any ML task
🏗️ Why OmniGen?
- ✅ Enterprise-Ready - Built for production at scale
- ✅ Fully Customizable - Configure every aspect of generation
- ✅ Secure - Complete isolation, no data mixing
- ✅ Easy - Simple API, clear examples
- ✅ Modular - Independent pipelines for different data types
🚀 Currently Available Pipeline
conversation_extension - Extend Single-Turn to Multi-Turn Conversations
Turn your base questions into rich multi-turn dialogues. This is just the first pipeline—more coming soon!
Why OmniGen?
✅ Simple - One command to generate thousands of conversations
✅ Scalable - Parallel processing for fast generation
✅ Flexible - Mix different AI providers (OpenAI, Anthropic, Ultrasafe AI)
✅ Production Ready - Built for SaaS platforms with multi-tenant support
Quick Start
1. Install
pip install omnigen
2. Prepare Base Data
Create a file base_data.jsonl with your starting questions:
{"conversations": [{"role": "user", "content": "How do I learn Python?"}]}
{"conversations": [{"role": "user", "content": "What is machine learning?"}]}
{"conversations": [{"role": "user", "content": "Explain neural networks"}]}
3. Generate Conversations
from omnigen.pipelines.conversation_extension import (
ConversationExtensionConfigBuilder,
ConversationExtensionPipeline
)
# Configure the pipeline
config = (ConversationExtensionConfigBuilder()
# User followup generator
.add_provider(
role='user_followup',
name='ultrasafe',
api_key='your-api-key',
model='usf-mini'
)
# Assistant response generator
.add_provider(
role='assistant_response',
name='ultrasafe',
api_key='your-api-key',
model='usf-mini'
)
# Generation settings
.set_generation(
num_conversations=100,
turn_range=(3, 8) # 3-8 turns per conversation
)
# Input data
.set_data_source(
source_type='file',
file_path='base_data.jsonl'
)
# Output
.set_storage(
type='jsonl',
output_file='output.jsonl'
)
.build()
)
# Run the pipeline
pipeline = ConversationExtensionPipeline(config)
pipeline.run()
4. Get Results
Your generated conversations will be in output.jsonl:
{
"id": 0,
"conversations": [
{"role": "user", "content": "How do I learn Python?"},
{"role": "assistant", "content": "Great choice! Start with the basics..."},
{"role": "user", "content": "What resources do you recommend?"},
{"role": "assistant", "content": "I recommend these resources..."},
{"role": "user", "content": "How long will it take?"},
{"role": "assistant", "content": "With consistent practice..."}
],
"num_turns": 3,
"success": true
}
Supported AI Providers
| Provider | Model Examples |
|---|---|
| Ultrasafe AI | usf-mini, usf-max |
| OpenAI | gpt-4-turbo, gpt-3.5-turbo |
| Anthropic | claude-3-5-sonnet, claude-3-opus |
| OpenRouter | Various models |
Mix Different Providers
config = (ConversationExtensionConfigBuilder()
.add_provider('user_followup', 'openai', api_key, 'gpt-4-turbo')
.add_provider('assistant_response', 'anthropic', api_key, 'claude-3-5-sonnet')
# ... rest of config
.build()
)
Advanced Features
Multi-Tenant SaaS Support
Perfect for platforms serving multiple users concurrently:
# Each user gets isolated workspace
workspace_id = f"user_{user_id}_session_{session_id}"
config = (ConversationExtensionConfigBuilder(workspace_id=workspace_id)
.add_provider('user_followup', 'ultrasafe', shared_api_key, 'usf-mini')
.add_provider('assistant_response', 'ultrasafe', shared_api_key, 'usf-mini')
.set_storage('jsonl', output_file='output.jsonl') # Auto-isolated
.build()
)
# Storage automatically goes to: workspaces/{workspace_id}/output.jsonl
Parallel Dataset Generation
from concurrent.futures import ProcessPoolExecutor
def process_dataset(input_file, output_file):
config = (ConversationExtensionConfigBuilder()
.add_provider('user_followup', 'ultrasafe', api_key, 'usf-mini')
.add_provider('assistant_response', 'ultrasafe', api_key, 'usf-mini')
.set_data_source('file', file_path=input_file)
.set_storage('jsonl', output_file=output_file)
.build()
)
ConversationExtensionPipeline(config).run()
# Process 3 datasets in parallel
with ProcessPoolExecutor(max_workers=3) as executor:
executor.submit(process_dataset, 'data1.jsonl', 'out1.jsonl')
executor.submit(process_dataset, 'data2.jsonl', 'out2.jsonl')
executor.submit(process_dataset, 'data3.jsonl', 'out3.jsonl')
Examples
See examples/conversation_extension/ for more examples:
- Simple usage with JSONL files
- Multi-dataset parallel processing
- Multi-tenant SaaS implementation
Documentation
License
MIT License - Ultrasafe AI © 2024
About Ultrasafe AI
Enterprise-grade AI tools with focus on safety and performance.
- 🌐 Website: us.inc
- 📧 Email: support@us.inc
Made with ❤️ by Ultrasafe AI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omnigen_usf-0.0.1.post1.tar.gz.
File metadata
- Download URL: omnigen_usf-0.0.1.post1.tar.gz
- Upload date:
- Size: 37.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
184b01cd6e22049c5c17155551f78c951122cf06ebc31f42461a15312c4880f9
|
|
| MD5 |
9f14b7faa48e755ff6e813d3c24e1653
|
|
| BLAKE2b-256 |
7c0609b88cfc40e8d111d0806282e87022322aeac413774605bf538271e5a6fa
|
File details
Details for the file omnigen_usf-0.0.1.post1-py3-none-any.whl.
File metadata
- Download URL: omnigen_usf-0.0.1.post1-py3-none-any.whl
- Upload date:
- Size: 37.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
763eede66e489b936a4554887e7899f1ac50568f02ebfcf5bb1009a29963d8a3
|
|
| MD5 |
1877e4d859fdfb96e25a684a0ffca65d
|
|
| BLAKE2b-256 |
615e03ab7bdca1daf26ab5b2d57f8d4a1a9242486f2e2da6a33eec1c44d67525
|