SynthGenAI - Package for generating Synthetic Datasets.
Project description
SynthGenAI-Package for Generating Synthetic Datasets using LLMs
SynthGenAI is a package for generating Synthetic Datasets. The idea is to have a tool which is simple to use and can generate datasets on different topics by utilizing LLMs from different API providers. The package is designed to be modular and can be easily extended to include some different API providers for LLMs and new features.
[!IMPORTANT] The package is still in the early stages of development and some features may not be fully implemented or tested. If you find any issues or have any suggestions, feel free to open an issue or create a pull request.
Why SynthGenAI now? ๐ค
Interest in synthetic data generation has surged recently, driven by the growing recognition of data as a critical asset in AI development. As Ilya Sutskever, one of the most important figures in AI, says: 'Data is the fossil fuel of AI.' The more quality data we have, the better our models can perform. However, access to data is often restricted due to privacy concerns, or it may be prohibitively expensive to collect. Additionally, the vast amount of high-quality data on the internet has already been extensively mined. Synthetic data generation addresses these challenges by allowing us to create diverse and useful datasets using current pre-trained Large Language Models (LLMs). Beyond LLMs, synthetic data also holds immense potential for pre-training and post-training of Small Language Models (SLMs), which are gaining popularity due to their efficiency and suitability for specific, resource-constrained applications. By leveraging synthetic data for both LLMs and SLMs, we can enhance performance across a wide range of use cases while balancing resource efficiency and model effectiveness. This approach enables us to harness the strengths of both synthetic and authentic datasets to achieve optimal outcomes.
Tools used for building SynthGenAI ๐งฐ
The package is built using Python and the following libraries:
- uv, An extremely fast Python package and project manager, written in Rust.
- LiteLLM, A Python SDK for accessing LLMs from different API providers with standardized OpenAI Format.
- Langfuse, LLMOps platform for observability, tracebility and monitoring of LLMs.
- Pydantic, Data validation and settings management using Python type annotations.
- Huggingface Hub & Datasets, A Python library for saving generated datasets on Hugging Face Hub.
Installation ๐ ๏ธ
To install the package, you can use the following command:
pip install synthgenai
or if you want to use uv package manager, you can use the following command:
uv add synthgenai
or you can install the package directly from the source code using the following commands:
git clone https://github.com/Shekswess/synthgenai.git
uv build
pip install ./dist/synthgenai-{version}-py3-none-any.whl
Requirements ๐
To use the package, you need to have the following requirements installed:
- Python 3.10+
- uv for building the package directly from the source code
- Ollama running on your local machine if you want to use Ollama as an API provider (optional)
- Langfuse running on your local machine or in the cloud if you want to use Langfuse for tracebility (optional)
- Hugging Face Hub account if you want to save the generated datasets on Hugging Face Hub with generated token (optional)
- Gradio for using the SynthGenAI UI (optional)
Quick Start ๐
After installation, get started quickly by using the CLI:
# 1. See what environment variables you need
synthgenai env-setup
# 2. Set up your API keys (example for OpenAI)
export OPENAI_API_KEY="your-api-key-here"
# 3. # List available dataset types
synthgenai list-types
# 4. Generate your first dataset
synthgenai generate instruction \
--model "openai/gpt-5" \
--topic "Python Programming" \
--domain "Software Development" \
--entries 100
# 5. See more examples
synthgenai examples
Available Commands
synthgenai generate- Generate synthetic datasetssynthgenai list-types- Show all available dataset typessynthgenai examples- Display example commandssynthgenai providers- List supported LLM providerssynthgenai env-setup- Show environment setup guidesynthgenai --help- Show help information
Usage ๐จโ๐ป
Supported API Providers ๐ช
- Groq - more info about Groq models that can be used, can be found here
- Mistral AI - more info about Mistral AI models that can be used, can be found here
- Gemini - more info about Gemini models that can be used, can be found here
- Bedrock - more info about Bedrock models that can be used, can be found here
- Anthropic - more info about Anthropic models that can be used, can be found here
- OpenAI - more info about OpenAI models that can be used, can be found here
- Hugging Face - more info about Hugging Face models that can be used, can be found here
- Ollama - more info about Ollama models that can be used, can be found here
- vLLM - more info about vLLM models that can be used, can be found here
- SageMaker - more info about SageMaker models that can be used, can be found here
- Azure - more info about Azure and Azure AI models that can be used, can be found here & here
- Vertex AI - more info about Vertex AI models that can be used, can be found here
- DeepSeek - more info about DeepSeek models that can be used, can be found here
- xAI - more info about xAI models that can be used, can be found here
- OpenRouter - more info about OpenRouter models that can be used, can be found here
Environment Setup & Configuration ๐
For detailed information about setting up environment variables for different API providers, observability tools, and dataset management, please refer to the Installation Guide.
Logging Configuration
You can control the logging verbosity using the SYNTHGENAI_DETAILED_MODE environment variable:
# For detailed logging (shows all debug information)
export SYNTHGENAI_DETAILED_MODE="false"
# For NO logging (default)
export SYNTHGENAI_DETAILED_MODE="true"
[!NOTE] By default,
SYNTHGENAI_DETAILED_MODEis set to"true", which provides NO logging output. Set it to"false"to enable detailed debugging information during dataset generation.
Observability & Saving Datasets ๐
For observing the generated datasets, you can use Langfuse for tracebility and monitoring of the LLMs.
For handling the datasets and saving them on Hugging Face Hub, you can use the Hugging Face Datasets library.
Currently there are six types of datasets that can be generated using SynthGenAI:
- Raw Datasets
- Instruction Datasets
- Preference Datasets
- Sentiment Analysis Datasets
- Summarization Datasets
- Text Classification Datasets
The datasets can be generated:
- Synchronously - each dataset entry is generated one by one
- Asynchronously - batch of dataset entries is generated at once
[!NOTE] Asynchronous generation is faster than synchronous generation, but some of LLM providers can have limitations on the number of tokens that can be generated at once.
More Examples ๐
More examples with different combinations of LLM API providers and dataset configurations can be found in the examples directory.
[!IMPORTANT] Sometimes the generation of the keywords for the dataset and the dataset entries can fail due to the limitation of the LLM to generate JSON Object as output (this is handled by the package). That's why it is recommended to use models that are capable of generating JSON Objects (structured output). List of models that can generate JSON Objects can be found here.
Generated Datasets ๐
Examples of generated synthetic datasets can be found on the SynthGenAI Datasets Collection on Hugging Face Hub.
Contributing ๐ค
If you want to contribute to this project and make it better, your help is very welcome. Create a pull request with your changes and I will review it. If you have any questions, open an issue.
License ๐
This project is licensed under the MIT License - see the LICENSE.md file for details.
Repo Structure ๐
.
โโโ .github/ # GitHub configuration files and workflows
โ โโโ workflows/ # GitHub Actions workflows
โ โ โโโ build_n_publish.yaml # Build and publish workflow
โ โ โโโ docs.yaml # Documentation deployment workflow
โ โ โโโ uv-ci.yaml # UV package manager CI workflow
โ โโโ depandabot.yml # Dependabot configuration for automatic dependency updates
โโโ docs # MkDocs documentation source files
โ โโโ assets # Static assets for documentation
โ โ โโโ favicon.png # Website favicon
โ โ โโโ logo_header.png # Header logo image
โ โ โโโ logo.svg # SVG logo for the project
โ โโโ configurations # Configuration documentation
โ โ โโโ dataset_configuration.md # Dataset configuration guide
โ โ โโโ dataset_generator_configuration.md # Dataset generator configuration guide
โ โ โโโ index.md # Configuration section index
โ โ โโโ llm_configuration.md # LLM configuration guide
โ โโโ contributing # Contribution guidelines
โ โ โโโ index.md # How to contribute to the project
โ โโโ datasets # Dataset type documentation
โ โ โโโ index.md # Dataset types overview
โ โ โโโ instruction_datasets.md # Instruction datasets documentation
โ โ โโโ preference_datasets.md # Preference datasets documentation
โ โ โโโ raw_datasets.md # Raw datasets documentation
โ โ โโโ sentiment_analysis_datasets.md # Sentiment analysis datasets documentation
โ โ โโโ summarization_datasets.md # Summarization datasets documentation
โ โ โโโ text_classification_datasets.md # Text classification datasets documentation
โ โโโ examples # Examples documentation
โ โ โโโ index.md # Code examples and usage patterns
โ โโโ index.md # Main documentation homepage
โ โโโ installation # Installation documentation
โ โ โโโ index.md # Installation guide and requirements
โ โโโ llm_providers # LLM provider documentation
โ โ โโโ index.md # Supported LLM providers guide
โ โโโ quick_start # Quick start guide
โ โ โโโ index.md # Getting started tutorial
โ โโโ stylesheets # Custom CSS styles for documentation
โโโ examples # Python example scripts demonstrating usage
โ โโโ anthropic_instruction_dataset_example.py # Anthropic API instruction dataset example
โ โโโ azure_ai_preference_dataset_example.py # Azure AI preference dataset example
โ โโโ azure_summarization_dataset_example.py # Azure summarization dataset example
โ โโโ bedrock_raw_dataset_example.py # AWS Bedrock raw dataset example
โ โโโ deepseek_instruction_dataset_example.py # DeepSeek instruction dataset example
โ โโโ gemini_langfuse_raw_dataset_example.py # Gemini with Langfuse raw dataset example
โ โโโ groq_preference_dataset_example.py # Groq preference dataset example
โ โโโ huggingface_instruction_dataset_example.py # Hugging Face instruction dataset example
โ โโโ mistral_preference_dataset_example.py # Mistral AI preference dataset example
โ โโโ ollama_preference_dataset_example.py # Ollama preference dataset example
โ โโโ openai_raw_dataset_example.py # OpenAI raw dataset example
โ โโโ openrouter_raw_dataset_example.py # OpenRouter raw dataset example
โ โโโ sagemaker_summarization_dataset_example.py # AWS SageMaker summarization dataset example
โ โโโ vertex_ai_text_classification_dataset_example.py # Google Vertex AI text classification example
โ โโโ vllm_sentiment_analysis_dataset_example.py # vLLM sentiment analysis dataset example
โ โโโ xai_raw_dataset_example.py # xAI raw dataset example
โโโ synthgenai # Main package source code
โ โโโ dataset # Dataset handling modules
โ โ โโโ __init__.py # Dataset package initializer
โ โ โโโ base_dataset.py # Base dataset class and common functionality
โ โ โโโ dataset.py # Main dataset implementation
โ โโโ dataset_genetors # Dataset generation modules
โ โ โโโ __init__.py # Dataset generators package initializer
โ โ โโโ classification_dataset_generator.py # Text classification dataset generator
โ โ โโโ dataset_generator.py # Base dataset generator class
โ โ โโโ instruction_dataset_generator.py # Instruction-following dataset generator
โ โ โโโ preference_dataset_generator.py # Preference dataset generator (RLHF)
โ โ โโโ raw_dataset_generator.py # Raw text dataset generator
โ โ โโโ sentiment_dataset_generator.py # Sentiment analysis dataset generator
โ โ โโโ summarization_dataset_generator.py # Text summarization dataset generator
โ โโโ llm # LLM interaction modules
โ โ โโโ __init__.py # LLM package initializer
โ โ โโโ base_llm.py # Base LLM class and common functionality
โ โ โโโ llm.py # Main LLM implementation with LiteLLM integration
โ โโโ prompts # Prompt templates for different dataset types
โ โ โโโ description_system_prompt # System prompt for generating descriptions
โ โ โโโ description_user_prompt # User prompt template for descriptions
โ โ โโโ entry_classification_system_prompt # System prompt for classification entries
โ โ โโโ entry_instruction_system_prompt # System prompt for instruction entries
โ โ โโโ entry_preference_system_prompt # System prompt for preference entries
โ โ โโโ entry_raw_system_prompt # System prompt for raw text entries
โ โ โโโ entry_sentiment_system_prompt # System prompt for sentiment entries
โ โ โโโ entry_summarization_system_prompt # System prompt for summarization entries
โ โ โโโ entry_user_prompt # User prompt template for dataset entries
โ โ โโโ keyword_system_prompt # System prompt for keyword generation
โ โ โโโ keyword_user_prompt # User prompt template for keywords
โ โ โโโ labels_system_prompt # System prompt for label generation
โ โ โโโ labels_user_prompt # User prompt template for labels
โ โโโ schemas # Pydantic data models and validation schemas
โ โ โโโ __init__.py # Schemas package initializer
โ โ โโโ config.py # Configuration data models
โ โ โโโ datasets.py # Dataset-related data models
โ โ โโโ enums.py # Enumeration definitions
โ โ โโโ messages.py # Message and response data models
โ โโโ utils # Utility functions and helpers
โ | โโโ file_utils.py # File I/O operations and utilities
โ | โโโ __init__.py # Utils package initializer
โ | โโโ json_utils.py # JSON processing utilities
โ | โโโ progress_utils.py # Progress tracking and display utilities
โ | โโโ prompt_utils.py # Prompt processing and formatting utilities
โ | โโโ text_utils.py # Text manipulation and processing utilities
โ | โโโ yaml_utils.py # YAML processing utilities
โ โโโ __init__.py # Main package initializer and version info
โ โโโ cli.py # Command-line interface implementation
โโโ tests # Test suite for the package
โ โโโ __init__.py # Tests package initializer
โ โโโ conftest.py # pytest configuration and fixtures
โ โโโ test_dataset_generator.py # Tests for dataset generators
โ โโโ test_dataset.py # Tests for dataset functionality
โ โโโ test_llm.py # Tests for LLM integration
โโโ .gitignore # Git ignore rules for excluded files
โโโ .pre-commit-config.yaml # Pre-commit hooks configuration
โโโ .python-version # Python version specification for pyenv
โโโ LICENCE.txt # MIT License file
โโโ mkdocs.yml # MkDocs documentation configuration
โโโ pyproject.toml # Python project metadata and dependencies (PEP 518)
โโโ README.md # Main project documentation and overview
โโโ uv.lock # UV lockfile for reproducible dependency resolution
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synthgenai-2.0.1.tar.gz.
File metadata
- Download URL: synthgenai-2.0.1.tar.gz
- Upload date:
- Size: 425.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9437ebe77bec3e2cf3b37fd26a8c99c3f3f976a48e45d1c54aab741067eb94fb
|
|
| MD5 |
adef33c047fd537f209ed316c08e420e
|
|
| BLAKE2b-256 |
49c2e5943ae8f294dbe87e55765097d1aa66ee4cd9511effbbcee3c231f6b3da
|
File details
Details for the file synthgenai-2.0.1-py3-none-any.whl.
File metadata
- Download URL: synthgenai-2.0.1-py3-none-any.whl
- Upload date:
- Size: 48.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a90cc94141b1e61aebb746154e49e05b22c229a32d646906ef4187728f5fe1cd
|
|
| MD5 |
5c7948f7abd9eb9097afdcc269e31f2e
|
|
| BLAKE2b-256 |
ebae6509e8c86853a66029647958c96247722ca58b547dc88371fd19e75fe813
|