Indox Synthetic Data Generation

These details have not been verified by PyPI

Project links

Homepage

Project description

IndoxGen: Enterprise-Grade Synthetic Data Generation Framework

Overview

IndoxGen is a state-of-the-art, enterprise-ready framework designed for generating high-fidelity synthetic data. Leveraging advanced AI technologies, including Large Language Models (LLMs) and incorporating human feedback loops, IndoxGen offers unparalleled flexibility and precision in synthetic data creation across various domains and use cases.

Key Features

Multiple Generation Pipelines:
- SyntheticDataGenerator: Standard LLM-powered generation pipeline for structured data with embedded quality control mechanisms.
- SyntheticDataGeneratorHF: Advanced pipeline integrating human feedback to improve generation.
- DataFromPrompt: Dynamic data generation based on natural language prompts, useful for rapid prototyping.
Customization & Control: Fine-grained control over data attributes, structure, and diversity. Customize every aspect of the synthetic data generation process.
Human-in-the-Loop: Seamlessly integrates expert feedback for continuous improvement of generated data, offering the highest quality assurance.
AI-Driven Diversity: Algorithms ensure representative and varied datasets, providing data diversity for robust modeling.
Flexible I/O: Supports various data sources and export formats (Excel, CSV, etc.) for easy integration into existing workflows.
Advanced Learning Techniques: Incorporation of few-shot learning for rapid adaptation to new domains with minimal examples.
Scalability: Designed to handle both small-scale experiments and large-scale data generation tasks with multi-LLM support.

Installation

pip install indoxgen

Quick Start Guide

Basic Usage: SyntheticDataGenerator

from indoxGen.synthCore import SyntheticDataGenerator
from indoxGen.llms import OpenAi

columns = ["name", "age", "occupation"]
example_data = [
    {"name": "Alice Johnson", "age": 35, "occupation": "Manager"},
    {"name": "Bob Williams", "age": 42, "occupation": "Accountant"}
]

openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4o-mini")
nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct",
                  base_url="https://integrate.api.nvidia.com/v1")

generator = SyntheticDataGenerator(
    generator_llm=nemotron,
    judge_llm=openai,
    columns=columns,
    example_data=example_data,
    user_instruction="Generate diverse, realistic data including name, age, and occupation. Ensure variability in demographics and professions.",
    verbose=1
)

generated_data = generator.generate_data(num_samples=100)

Advanced Usage: SyntheticDataGeneratorHF with Human Feedback

from indoxGen.synthCore import SyntheticDataGeneratorHF
from indoxGen.llms import OpenAi

openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4-0613")
nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct",
                  base_url="https://integrate.api.nvidia.com/v1")

generator = SyntheticDataGeneratorHF(
    generator_llm=nemotron,
    judge_llm=openai,
    columns=columns,
    example_data=example_data,
    user_instruction="Generate diverse, realistic professional profiles with name, age, and occupation.",
    verbose=1,
    diversity_threshold=0.4,
    feedback_range=feedback_range
)

# Implement human feedback loop
generator.user_review_and_regenerate(
    regenerate_rows=[0],
    accepted_rows=[],
    regeneration_feedback='Diversify names and occupations further',
    min_score=0.7
)

Prompt-Based Generation: DataFromPrompt

from indoxGen.synthCore import DataFromPrompt, DataGenerationPrompt
from indoxGen.llms import OpenAi

nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct",
                  base_url="https://integrate.api.nvidia.com/v1")


user_prompt = "Generate a comprehensive dataset with 3 columns and 3 rows about exoplanets."
instruction = DataGenerationPrompt.get_instruction(user_prompt)

data_generator = DataFromPrompt(
    prompt_name="Exoplanet Dataset Generation",
    args={
        "llm": nemotron,
        "n": 1,
        "instruction": instruction,
    },
    outputs={"generations": "generate"},
)

generated_df = data_generator.run()
data_generator.save_to_excel("exoplanet_data.xlsx")

Advanced Techniques

Few-Shot Learning for Specialized Domains

from indoxGen.synthCore import FewShotPrompt
from indoxGen.llms import OpenAi

openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4o-mini")

examples = [
    {
        "input": "Generate a dataset with 3 columns and 2 rows about quantum computing.",
        "output": '[{"Qubit Type": "Superconducting", "Coherence Time": "100 μs", "Gate Fidelity": "0.9999"}, {"Qubit Type": "Trapped Ion", "Coherence Time": "10 ms", "Gate Fidelity": "0.99999"}]'
    },
    {
        "input": "Generate a dataset with 3 columns and 2 rows about nanotechnology.",
        "output": '[{"Material": "Graphene", "Thickness": "1 nm", "Conductivity": "1.0e6 S/m"}, {"Material": "Carbon Nanotube", "Thickness": "1-2 nm", "Conductivity": "1.0e7 S/m"}]'
    }
]

user_prompt = "Generate a dataset with 3 columns and 2 rows about advanced AI architectures."

data_generator = FewShotPrompt(
    prompt_name="Generate AI Architecture Dataset",
    args={
        "llm": openai,
        "n": 1,  
        "instruction": user_prompt,  
    },
    outputs={"generations": "generate"},
    examples=examples  
)

generated_df = data_generator.run()
data_generator.save_to_excel("ai_architectures.xlsx", generated_df)

Attributed Prompts for Controlled Variation

from indoxGen.synthCore import DataFromAttributedPrompt
from indoxGen.llms import OpenAi

openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4o-mini")

args = {
    "instruction": "Generate a {complexity} machine learning algorithm description that is {application_area} focused.",
    "attributes": {
        "complexity": ["basic", "advanced", "cutting-edge"],
        "application_area": ["computer vision", "natural language processing", "reinforcement learning"]
    },
    "llm": openai
}

dataset = DataFromAttributedPrompt(
    prompt_name="ML Algorithm Generator",
    args=args,
    outputs={}
)

df = dataset.run()
print(df)

Configuration and Customization

Each generator class in IndoxGen is highly configurable to meet specific data generation requirements. Key parameters include:

generator_llm and judge_llm: Specify the LLMs used for generation and quality assessment
columns and example_data: Define the structure and provide examples for the generated data
user_instruction: Customize the generation process with specific guidelines
diversity_threshold: Control the level of variation in the generated data
verbose: Adjust the level of feedback during the generation process

Refer to the API documentation for a comprehensive list of configuration options for each class.

Best Practices

Data Quality Assurance: Regularly validate generated data against predefined quality metrics.
Iterative Refinement: Utilize the human feedback loop to continuously improve generation quality.
Domain Expertise Integration: Collaborate with domain experts to fine-tune generation parameters and validate outputs.
Ethical Considerations: Ensure generated data adheres to privacy standards and ethical guidelines.
Performance Optimization: Monitor and optimize generation pipeline for large-scale tasks.

Roadmap

Implement basic synthetic data generation
Add LLM-based judge for quality control
Improve diversity checking mechanism
Integrate human feedback loop for continuous improvement
Develop a web-based UI for easier interaction
Support for more data types (images, time series, etc.)
Implement differential privacy techniques
Create plugin system for custom data generation rules
Develop comprehensive documentation and tutorials

Contributing

We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.

License

IndoxGen is released under the MIT License. See LICENSE.md for more details.

Support and Documentation

For detailed API documentation, tutorials, and best practices, visit our official documentation.

For support, please open an issue on our GitHub repository or contact our support team at support@indoxgen.com.

IndoxGen - Empowering Data-Driven Innovation with Advanced Synthetic Data Generation

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.2

Sep 25, 2024

0.0.1

Sep 25, 2024

This version

0.0.0

Sep 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxgen-0.0.0.tar.gz (20.3 kB view hashes)

Uploaded Sep 25, 2024 Source

Built Distribution

IndoxGen-0.0.0-py3-none-any.whl (16.6 kB view hashes)

Uploaded Sep 25, 2024 Python 3

Hashes for indoxgen-0.0.0.tar.gz

Hashes for indoxgen-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9a1737e7ef960869ffe101847aba980a921d3fb55797d7b178da1b2e4856b50c`
MD5	`4e9c643fc019e213fbdbc0b74756cc71`
BLAKE2b-256	`542bde30abb25e198e6214bd6d87003720b9b7bdd0bfb383f56693499860ce87`

Hashes for IndoxGen-0.0.0-py3-none-any.whl

Hashes for IndoxGen-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f836c2b71c0e12f75d953b7eb07c057be8e269a3eee1eee9680eb303537994cf`
MD5	`a410fd367f151f2cc10c2351a9617f51`
BLAKE2b-256	`7752a3b98326101857de87f6792d950477a505f4cb33f32bb8a9535e9d151d98`