Skip to main content

Enterprise-Grade Synthetic Data Generation

Project description

OmniGen 🚀

Generate synthetic data at scale using an enterprise-ready framework with full customizable configuration, security, and ease of use

Built by Ultrasafe AI for production environments.


What is OmniGen?

OmniGen is an enterprise-grade framework for generating synthetic datasets at scale—from scratch or from base data. Generate trillions of tokens and billions of samples across multiple modalities:

🎯 Data Types Supported

  • 💬 Conversational Data - Single-turn to multi-turn dialogues
  • 🤖 Agentic Datasets - Tool use, function calling, multi-step reasoning
  • 🎨 Multimodal Datasets - Text, images, audio, video combinations
  • 🖼️ Images - Synthetic image generation and editing
  • 🎵 Audio - Speech, music, sound effects
  • 🎬 Video - Synthetic video sequences

🎓 Use Cases

  • Fine-Tuning - Instruction following, task-specific models
  • Supervised Fine-Tuning (SFT) - High-quality labeled datasets
  • Offline Reinforcement Learning - Preference datasets with rewards
  • Online Reinforcement Learning - Ground truth with reward checking scripts
  • Pre-Training - Large-scale corpus generation
  • Machine Learning - Training data for any ML task

🏗️ Why OmniGen?

  • Enterprise-Ready - Built for production at scale
  • Fully Customizable - Configure every aspect of generation
  • Secure - Complete isolation, no data mixing
  • Easy - Simple API, clear examples
  • Modular - Independent pipelines for different data types

🚀 Currently Available Pipeline

conversation_extension - Extend Single-Turn to Multi-Turn Conversations

Turn your base questions into rich multi-turn dialogues. This is just the first pipeline—more coming soon!


Why OmniGen?

Simple - One command to generate thousands of conversations
Scalable - Parallel processing for fast generation
Flexible - Mix different AI providers (OpenAI, Anthropic, Ultrasafe AI)
Production Ready - Built for SaaS platforms with multi-tenant support


Quick Start

1. Install

pip install omnigen

2. Prepare Base Data

Create a file base_data.jsonl with your starting questions:

{"conversations": [{"role": "user", "content": "How do I learn Python?"}]}
{"conversations": [{"role": "user", "content": "What is machine learning?"}]}
{"conversations": [{"role": "user", "content": "Explain neural networks"}]}

3. Generate Conversations

from omnigen.pipelines.conversation_extension import (
    ConversationExtensionConfigBuilder,
    ConversationExtensionPipeline
)

# Configure the pipeline
config = (ConversationExtensionConfigBuilder()
    # User followup generator
    .add_provider(
        role='user_followup',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini'
    )
    # Assistant response generator
    .add_provider(
        role='assistant_response',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini'
    )
    # Generation settings
    .set_generation(
        num_conversations=100,
        turn_range=(3, 8)  # 3-8 turns per conversation
    )
    # Input data
    .set_data_source(
        source_type='file',
        file_path='base_data.jsonl'
    )
    # Output
    .set_storage(
        type='jsonl',
        output_file='output.jsonl'
    )
    .build()
)

# Run the pipeline
pipeline = ConversationExtensionPipeline(config)
pipeline.run()

4. Get Results

Your generated conversations will be in output.jsonl:

{
  "id": 0,
  "conversations": [
    {"role": "user", "content": "How do I learn Python?"},
    {"role": "assistant", "content": "Great choice! Start with the basics..."},
    {"role": "user", "content": "What resources do you recommend?"},
    {"role": "assistant", "content": "I recommend these resources..."},
    {"role": "user", "content": "How long will it take?"},
    {"role": "assistant", "content": "With consistent practice..."}
  ],
  "num_turns": 3,
  "success": true
}

Supported AI Providers

Provider Model Examples
Ultrasafe AI usf-mini, usf-max
OpenAI gpt-4-turbo, gpt-3.5-turbo
Anthropic claude-3-5-sonnet, claude-3-opus
OpenRouter Various models

Mix Different Providers

config = (ConversationExtensionConfigBuilder()
    .add_provider('user_followup', 'openai', api_key, 'gpt-4-turbo')
    .add_provider('assistant_response', 'anthropic', api_key, 'claude-3-5-sonnet')
    # ... rest of config
    .build()
)

Advanced Features

Multi-Tenant SaaS Support

Perfect for platforms serving multiple users concurrently:

# Each user gets isolated workspace
workspace_id = f"user_{user_id}_session_{session_id}"

config = (ConversationExtensionConfigBuilder(workspace_id=workspace_id)
    .add_provider('user_followup', 'ultrasafe', shared_api_key, 'usf-mini')
    .add_provider('assistant_response', 'ultrasafe', shared_api_key, 'usf-mini')
    .set_storage('jsonl', output_file='output.jsonl')  # Auto-isolated
    .build()
)

# Storage automatically goes to: workspaces/{workspace_id}/output.jsonl

Parallel Dataset Generation

from concurrent.futures import ProcessPoolExecutor

def process_dataset(input_file, output_file):
    config = (ConversationExtensionConfigBuilder()
        .add_provider('user_followup', 'ultrasafe', api_key, 'usf-mini')
        .add_provider('assistant_response', 'ultrasafe', api_key, 'usf-mini')
        .set_data_source('file', file_path=input_file)
        .set_storage('jsonl', output_file=output_file)
        .build()
    )
    ConversationExtensionPipeline(config).run()

# Process 3 datasets in parallel
with ProcessPoolExecutor(max_workers=3) as executor:
    executor.submit(process_dataset, 'data1.jsonl', 'out1.jsonl')
    executor.submit(process_dataset, 'data2.jsonl', 'out2.jsonl')
    executor.submit(process_dataset, 'data3.jsonl', 'out3.jsonl')

Examples

See examples/conversation_extension/ for more examples:

  • Simple usage with JSONL files
  • Multi-dataset parallel processing
  • Multi-tenant SaaS implementation

Documentation


License

MIT License - Ultrasafe AI © 2024


About Ultrasafe AI

Enterprise-grade AI tools with focus on safety and performance.


Made with ❤️ by Ultrasafe AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnigen_usf-0.0.1.post1.tar.gz (37.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnigen_usf-0.0.1.post1-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file omnigen_usf-0.0.1.post1.tar.gz.

File metadata

  • Download URL: omnigen_usf-0.0.1.post1.tar.gz
  • Upload date:
  • Size: 37.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for omnigen_usf-0.0.1.post1.tar.gz
Algorithm Hash digest
SHA256 184b01cd6e22049c5c17155551f78c951122cf06ebc31f42461a15312c4880f9
MD5 9f14b7faa48e755ff6e813d3c24e1653
BLAKE2b-256 7c0609b88cfc40e8d111d0806282e87022322aeac413774605bf538271e5a6fa

See more details on using hashes here.

File details

Details for the file omnigen_usf-0.0.1.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for omnigen_usf-0.0.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 763eede66e489b936a4554887e7899f1ac50568f02ebfcf5bb1009a29963d8a3
MD5 1877e4d859fdfb96e25a684a0ffca65d
BLAKE2b-256 615e03ab7bdca1daf26ab5b2d57f8d4a1a9242486f2e2da6a33eec1c44d67525

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page