Indox Synthetic Data Generation
Project description
IndoxGen: Enterprise-Grade Synthetic Data Generation Framework
Official Website • Documentation • Discord
NEW: Subscribe to our mailing list for updates and news!
Overview
IndoxGen is a state-of-the-art, enterprise-ready framework designed for generating high-fidelity synthetic data. Leveraging advanced AI technologies, including Large Language Models (LLMs) and incorporating human feedback loops, IndoxGen offers unparalleled flexibility and precision in synthetic data creation across various domains and use cases.
Key Features
-
Multiple Generation Pipelines:
SyntheticDataGenerator
: Standard LLM-powered generation pipeline for structured data with embedded quality control mechanisms.SyntheticDataGeneratorHF
: Advanced pipeline integrating human feedback to improve generation.DataFromPrompt
: Dynamic data generation based on natural language prompts, useful for rapid prototyping.
-
Customization & Control: Fine-grained control over data attributes, structure, and diversity. Customize every aspect of the synthetic data generation process.
-
Human-in-the-Loop: Seamlessly integrates expert feedback for continuous improvement of generated data, offering the highest quality assurance.
-
AI-Driven Diversity: Algorithms ensure representative and varied datasets, providing data diversity for robust modeling.
-
Flexible I/O: Supports various data sources and export formats (Excel, CSV, etc.) for easy integration into existing workflows.
-
Advanced Learning Techniques: Incorporation of few-shot learning for rapid adaptation to new domains with minimal examples.
-
Scalability: Designed to handle both small-scale experiments and large-scale data generation tasks with multi-LLM support.
Installation
pip install indoxgen
Quick Start Guide
Basic Usage: SyntheticDataGenerator
from indoxGen.synthCore import SyntheticDataGenerator
from indoxGen.llms import OpenAi
columns = ["name", "age", "occupation"]
example_data = [
{"name": "Alice Johnson", "age": 35, "occupation": "Manager"},
{"name": "Bob Williams", "age": 42, "occupation": "Accountant"}
]
openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4o-mini")
nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct",
base_url="https://integrate.api.nvidia.com/v1")
generator = SyntheticDataGenerator(
generator_llm=nemotron,
judge_llm=openai,
columns=columns,
example_data=example_data,
user_instruction="Generate diverse, realistic data including name, age, and occupation. Ensure variability in demographics and professions.",
verbose=1
)
generated_data = generator.generate_data(num_samples=100)
Advanced Usage: SyntheticDataGeneratorHF with Human Feedback
from indoxGen.synthCore import SyntheticDataGeneratorHF
from indoxGen.llms import OpenAi
openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4-0613")
nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct",
base_url="https://integrate.api.nvidia.com/v1")
generator = SyntheticDataGeneratorHF(
generator_llm=nemotron,
judge_llm=openai,
columns=columns,
example_data=example_data,
user_instruction="Generate diverse, realistic professional profiles with name, age, and occupation.",
verbose=1,
diversity_threshold=0.4,
feedback_range=feedback_range
)
# Implement human feedback loop
generator.user_review_and_regenerate(
regenerate_rows=[0],
accepted_rows=[],
regeneration_feedback='Diversify names and occupations further',
min_score=0.7
)
Prompt-Based Generation: DataFromPrompt
from indoxGen.synthCore import DataFromPrompt, DataGenerationPrompt
from indoxGen.llms import OpenAi
nemotron = OpenAi(api_key=NVIDIA_API_KEY, model="nvidia/nemotron-4-340b-instruct",
base_url="https://integrate.api.nvidia.com/v1")
user_prompt = "Generate a comprehensive dataset with 3 columns and 3 rows about exoplanets."
instruction = DataGenerationPrompt.get_instruction(user_prompt)
data_generator = DataFromPrompt(
prompt_name="Exoplanet Dataset Generation",
args={
"llm": nemotron,
"n": 1,
"instruction": instruction,
},
outputs={"generations": "generate"},
)
generated_df = data_generator.run()
data_generator.save_to_excel("exoplanet_data.xlsx")
Advanced Techniques
Few-Shot Learning for Specialized Domains
from indoxGen.synthCore import FewShotPrompt
from indoxGen.llms import OpenAi
openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4o-mini")
examples = [
{
"input": "Generate a dataset with 3 columns and 2 rows about quantum computing.",
"output": '[{"Qubit Type": "Superconducting", "Coherence Time": "100 μs", "Gate Fidelity": "0.9999"}, {"Qubit Type": "Trapped Ion", "Coherence Time": "10 ms", "Gate Fidelity": "0.99999"}]'
},
{
"input": "Generate a dataset with 3 columns and 2 rows about nanotechnology.",
"output": '[{"Material": "Graphene", "Thickness": "1 nm", "Conductivity": "1.0e6 S/m"}, {"Material": "Carbon Nanotube", "Thickness": "1-2 nm", "Conductivity": "1.0e7 S/m"}]'
}
]
user_prompt = "Generate a dataset with 3 columns and 2 rows about advanced AI architectures."
data_generator = FewShotPrompt(
prompt_name="Generate AI Architecture Dataset",
args={
"llm": openai,
"n": 1,
"instruction": user_prompt,
},
outputs={"generations": "generate"},
examples=examples
)
generated_df = data_generator.run()
data_generator.save_to_excel("ai_architectures.xlsx", generated_df)
Attributed Prompts for Controlled Variation
from indoxGen.synthCore import DataFromAttributedPrompt
from indoxGen.llms import OpenAi
openai = OpenAi(api_key=OPENAI_API_KEY, model="gpt-4o-mini")
args = {
"instruction": "Generate a {complexity} machine learning algorithm description that is {application_area} focused.",
"attributes": {
"complexity": ["basic", "advanced", "cutting-edge"],
"application_area": ["computer vision", "natural language processing", "reinforcement learning"]
},
"llm": openai
}
dataset = DataFromAttributedPrompt(
prompt_name="ML Algorithm Generator",
args=args,
outputs={}
)
df = dataset.run()
print(df)
Configuration and Customization
Each generator class in IndoxGen is highly configurable to meet specific data generation requirements. Key parameters include:
generator_llm
andjudge_llm
: Specify the LLMs used for generation and quality assessmentcolumns
andexample_data
: Define the structure and provide examples for the generated datauser_instruction
: Customize the generation process with specific guidelinesdiversity_threshold
: Control the level of variation in the generated dataverbose
: Adjust the level of feedback during the generation process
Refer to the API documentation for a comprehensive list of configuration options for each class.
Best Practices
- Data Quality Assurance: Regularly validate generated data against predefined quality metrics.
- Iterative Refinement: Utilize the human feedback loop to continuously improve generation quality.
- Domain Expertise Integration: Collaborate with domain experts to fine-tune generation parameters and validate outputs.
- Ethical Considerations: Ensure generated data adheres to privacy standards and ethical guidelines.
- Performance Optimization: Monitor and optimize generation pipeline for large-scale tasks.
Roadmap
- Implement basic synthetic data generation
- Add LLM-based judge for quality control
- Improve diversity checking mechanism
- Integrate human feedback loop for continuous improvement
- Develop a web-based UI for easier interaction
- Support for more data types (images, time series, etc.)
- Implement differential privacy techniques
- Create plugin system for custom data generation rules
- Develop comprehensive documentation and tutorials
Contributing
We welcome contributions! Please see our CONTRIBUTING.md file for details on how to get started.
License
IndoxGen is released under the MIT License. See LICENSE.md for more details.
IndoxGen - Empowering Data-Driven Innovation with Advanced Synthetic Data Generation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for indoxGen-0.0.11-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64533645c90241dac73578ddffec154b2f88fd66149f72124f1ee6ef4ec75340 |
|
MD5 | 66413a212f0a8bf50e9fe17bb3a95842 |
|
BLAKE2b-256 | 0e55226734fff99684f0c1a698e80c064677630ed9153d2f28fc7b87365aa1ac |