Skip to main content

SynthGenAI - Package for generating Synthetic Datasets.

Project description

SynthGenAI-Package for Generating Synthetic Datasets using LLMs

header_logo_image

SynthGenAI is a package for generating Synthetic Datasets. The idea is to have a tool which is simple to use and can generate datasets on different topics by utilizing LLMs from different API providers. The package is designed to be modular and can be easily extended to include some different API providers for LLMs and new features.

[!IMPORTANT] The package is still in the early stages of development and some features may not be fully implemented or tested. If you find any issues or have any suggestions, feel free to open an issue or create a pull request.

Why SynthGenAI now? ๐Ÿค”

Interest in synthetic data generation has surged recently, driven by the growing recognition of data as a critical asset in AI development. As Ilya Sutskever, one of the most important figures in AI, says: 'Data is the fossil fuel of AI.' The more quality data we have, the better our models can perform. However, access to data is often restricted due to privacy concerns, or it may be prohibitively expensive to collect. Additionally, the vast amount of high-quality data on the internet has already been extensively mined. Synthetic data generation addresses these challenges by allowing us to create diverse and useful datasets using current pre-trained Large Language Models (LLMs). Beyond LLMs, synthetic data also holds immense potential for pre-training and post-training of Small Language Models (SLMs), which are gaining popularity due to their efficiency and suitability for specific, resource-constrained applications. By leveraging synthetic data for both LLMs and SLMs, we can enhance performance across a wide range of use cases while balancing resource efficiency and model effectiveness. This approach enables us to harness the strengths of both synthetic and authentic datasets to achieve optimal outcomes.

Tools used for building SynthGenAI ๐Ÿงฐ

The package is built using Python and the following libraries:

  • uv, An extremely fast Python package and project manager, written in Rust.
  • LiteLLM, A Python SDK for accessing LLMs from different API providers with standardized OpenAI Format.
  • Langfuse, LLMOps platform for observability, tracebility and monitoring of LLMs.
  • Pydantic, Data validation and settings management using Python type annotations.
  • Huggingface Hub & Datasets, A Python library for saving generated datasets on Hugging Face Hub.

Installation ๐Ÿ› ๏ธ

To install the package, you can use the following command:

pip install synthgenai

or if you want to use uv package manager, you can use the following command:

uv add synthgenai

or you can install the package directly from the source code using the following commands:

git clone https://github.com/Shekswess/synthgenai.git
uv build
pip install ./dist/synthgenai-{version}-py3-none-any.whl

Requirements ๐Ÿ“‹

To use the package, you need to have the following requirements installed:

  • Python 3.10+
  • uv for building the package directly from the source code
  • Ollama running on your local machine if you want to use Ollama as an API provider (optional)
  • Langfuse running on your local machine or in the cloud if you want to use Langfuse for tracebility (optional)
  • Hugging Face Hub account if you want to save the generated datasets on Hugging Face Hub with generated token (optional)
  • Gradio for using the SynthGenAI UI (optional)

Quick Start ๐Ÿš€

After installation, get started quickly by using the CLI:

# 1. See what environment variables you need
synthgenai env-setup

# 2. Set up your API keys (example for OpenAI)
export OPENAI_API_KEY="your-api-key-here"

# 3. # List available dataset types
synthgenai list-types

# 4. Generate your first dataset
synthgenai generate instruction \
  --model "openai/gpt-5" \
  --topic "Python Programming" \
  --domain "Software Development" \
  --entries 100

# 5. See more examples
synthgenai examples

Available Commands

  • synthgenai generate - Generate synthetic datasets
  • synthgenai list-types - Show all available dataset types
  • synthgenai examples - Display example commands
  • synthgenai providers - List supported LLM providers
  • synthgenai env-setup - Show environment setup guide
  • synthgenai --help - Show help information

Usage ๐Ÿ‘จโ€๐Ÿ’ป

Supported API Providers ๐Ÿ’ช

  • Groq - more info about Groq models that can be used, can be found here
  • Mistral AI - more info about Mistral AI models that can be used, can be found here
  • Gemini - more info about Gemini models that can be used, can be found here
  • Bedrock - more info about Bedrock models that can be used, can be found here
  • Anthropic - more info about Anthropic models that can be used, can be found here
  • OpenAI - more info about OpenAI models that can be used, can be found here
  • Hugging Face - more info about Hugging Face models that can be used, can be found here
  • Ollama - more info about Ollama models that can be used, can be found here
  • vLLM - more info about vLLM models that can be used, can be found here
  • SageMaker - more info about SageMaker models that can be used, can be found here
  • Azure - more info about Azure and Azure AI models that can be used, can be found here & here
  • Vertex AI - more info about Vertex AI models that can be used, can be found here
  • DeepSeek - more info about DeepSeek models that can be used, can be found here
  • xAI - more info about xAI models that can be used, can be found here
  • OpenRouter - more info about OpenRouter models that can be used, can be found here

Environment Setup & Configuration ๐Ÿ”

For detailed information about setting up environment variables for different API providers, observability tools, and dataset management, please refer to the Installation Guide.

Logging Configuration

You can control the logging verbosity using the SYNTHGENAI_DETAILED_MODE environment variable:

# For detailed logging (shows all debug information)
export SYNTHGENAI_DETAILED_MODE="false"

# For NO logging (default)
export SYNTHGENAI_DETAILED_MODE="true"

[!NOTE] By default, SYNTHGENAI_DETAILED_MODE is set to "true", which provides NO logging output. Set it to "false" to enable detailed debugging information during dataset generation.

Observability & Saving Datasets ๐Ÿ“Š

For observing the generated datasets, you can use Langfuse for tracebility and monitoring of the LLMs.

For handling the datasets and saving them on Hugging Face Hub, you can use the Hugging Face Datasets library.

Currently there are six types of datasets that can be generated using SynthGenAI:

  • Raw Datasets
  • Instruction Datasets
  • Preference Datasets
  • Sentiment Analysis Datasets
  • Summarization Datasets
  • Text Classification Datasets

The datasets can be generated:

  • Synchronously - each dataset entry is generated one by one
  • Asynchronously - batch of dataset entries is generated at once

[!NOTE] Asynchronous generation is faster than synchronous generation, but some of LLM providers can have limitations on the number of tokens that can be generated at once.

More Examples ๐Ÿ“–

More examples with different combinations of LLM API providers and dataset configurations can be found in the examples directory.

[!IMPORTANT] Sometimes the generation of the keywords for the dataset and the dataset entries can fail due to the limitation of the LLM to generate JSON Object as output (this is handled by the package). That's why it is recommended to use models that are capable of generating JSON Objects (structured output). List of models that can generate JSON Objects can be found here.

Generated Datasets ๐Ÿ“š

Examples of generated synthetic datasets can be found on the SynthGenAI Datasets Collection on Hugging Face Hub.

Contributing ๐Ÿค

If you want to contribute to this project and make it better, your help is very welcome. Create a pull request with your changes and I will review it. If you have any questions, open an issue.

License ๐Ÿ“

This project is licensed under the MIT License - see the LICENSE.md file for details.

Repo Structure ๐Ÿ“‚

.
โ”œโ”€โ”€ .github/                                                      # GitHub configuration files and workflows
โ”‚   โ”œโ”€โ”€ workflows/                                                # GitHub Actions workflows
โ”‚   โ”‚   โ”œโ”€โ”€ build_n_publish.yaml                                  # Build and publish workflow
โ”‚   โ”‚   โ”œโ”€โ”€ docs.yaml                                             # Documentation deployment workflow
โ”‚   โ”‚   โ””โ”€โ”€ uv-ci.yaml                                            # UV package manager CI workflow
โ”‚   โ””โ”€โ”€ depandabot.yml                                            # Dependabot configuration for automatic dependency updates
โ”œโ”€โ”€ docs                                                          # MkDocs documentation source files
โ”‚   โ”œโ”€โ”€ assets                                                    # Static assets for documentation
โ”‚   โ”‚   โ”œโ”€โ”€ favicon.png                                           # Website favicon
โ”‚   โ”‚   โ”œโ”€โ”€ logo_header.png                                       # Header logo image
โ”‚   โ”‚   โ””โ”€โ”€ logo.svg                                              # SVG logo for the project
โ”‚   โ”œโ”€โ”€ configurations                                            # Configuration documentation
โ”‚   โ”‚   โ”œโ”€โ”€ dataset_configuration.md                              # Dataset configuration guide
โ”‚   โ”‚   โ”œโ”€โ”€ dataset_generator_configuration.md                    # Dataset generator configuration guide
โ”‚   โ”‚   โ”œโ”€โ”€ index.md                                              # Configuration section index
โ”‚   โ”‚   โ””โ”€โ”€ llm_configuration.md                                  # LLM configuration guide
โ”‚   โ”œโ”€โ”€ contributing                                              # Contribution guidelines
โ”‚   โ”‚   โ””โ”€โ”€ index.md                                              # How to contribute to the project
โ”‚   โ”œโ”€โ”€ datasets                                                  # Dataset type documentation
โ”‚   โ”‚   โ”œโ”€โ”€ index.md                                              # Dataset types overview
โ”‚   โ”‚   โ”œโ”€โ”€ instruction_datasets.md                               # Instruction datasets documentation
โ”‚   โ”‚   โ”œโ”€โ”€ preference_datasets.md                                # Preference datasets documentation
โ”‚   โ”‚   โ”œโ”€โ”€ raw_datasets.md                                       # Raw datasets documentation
โ”‚   โ”‚   โ”œโ”€โ”€ sentiment_analysis_datasets.md                        # Sentiment analysis datasets documentation
โ”‚   โ”‚   โ”œโ”€โ”€ summarization_datasets.md                             # Summarization datasets documentation
โ”‚   โ”‚   โ””โ”€โ”€ text_classification_datasets.md                       # Text classification datasets documentation
โ”‚   โ”œโ”€โ”€ examples                                                  # Examples documentation
โ”‚   โ”‚   โ””โ”€โ”€ index.md                                              # Code examples and usage patterns
โ”‚   โ”œโ”€โ”€ index.md                                                  # Main documentation homepage
โ”‚   โ”œโ”€โ”€ installation                                              # Installation documentation
โ”‚   โ”‚   โ””โ”€โ”€ index.md                                              # Installation guide and requirements
โ”‚   โ”œโ”€โ”€ llm_providers                                             # LLM provider documentation
โ”‚   โ”‚   โ””โ”€โ”€ index.md                                              # Supported LLM providers guide
โ”‚   โ”œโ”€โ”€ quick_start                                               # Quick start guide
โ”‚   โ”‚   โ””โ”€โ”€ index.md                                              # Getting started tutorial
โ”‚   โ””โ”€โ”€ stylesheets                                               # Custom CSS styles for documentation
โ”œโ”€โ”€ examples                                                      # Python example scripts demonstrating usage
โ”‚   โ”œโ”€โ”€ anthropic_instruction_dataset_example.py                  # Anthropic API instruction dataset example
โ”‚   โ”œโ”€โ”€ azure_ai_preference_dataset_example.py                    # Azure AI preference dataset example
โ”‚   โ”œโ”€โ”€ azure_summarization_dataset_example.py                    # Azure summarization dataset example
โ”‚   โ”œโ”€โ”€ bedrock_raw_dataset_example.py                            # AWS Bedrock raw dataset example
โ”‚   โ”œโ”€โ”€ deepseek_instruction_dataset_example.py                   # DeepSeek instruction dataset example
โ”‚   โ”œโ”€โ”€ gemini_langfuse_raw_dataset_example.py                    # Gemini with Langfuse raw dataset example
โ”‚   โ”œโ”€โ”€ groq_preference_dataset_example.py                        # Groq preference dataset example
โ”‚   โ”œโ”€โ”€ huggingface_instruction_dataset_example.py                # Hugging Face instruction dataset example
โ”‚   โ”œโ”€โ”€ mistral_preference_dataset_example.py                     # Mistral AI preference dataset example
โ”‚   โ”œโ”€โ”€ ollama_preference_dataset_example.py                      # Ollama preference dataset example
โ”‚   โ”œโ”€โ”€ openai_raw_dataset_example.py                             # OpenAI raw dataset example
โ”‚   โ”œโ”€โ”€ openrouter_raw_dataset_example.py                         # OpenRouter raw dataset example
โ”‚   โ”œโ”€โ”€ sagemaker_summarization_dataset_example.py                # AWS SageMaker summarization dataset example
โ”‚   โ”œโ”€โ”€ vertex_ai_text_classification_dataset_example.py          # Google Vertex AI text classification example
โ”‚   โ”œโ”€โ”€ vllm_sentiment_analysis_dataset_example.py                # vLLM sentiment analysis dataset example
โ”‚   โ””โ”€โ”€ xai_raw_dataset_example.py                                # xAI raw dataset example
โ”œโ”€โ”€ synthgenai                                                    # Main package source code
โ”‚   โ”œโ”€โ”€ dataset                                                   # Dataset handling modules
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py                                           # Dataset package initializer
โ”‚   โ”‚   โ”œโ”€โ”€ base_dataset.py                                       # Base dataset class and common functionality
โ”‚   โ”‚   โ””โ”€โ”€ dataset.py                                            # Main dataset implementation
โ”‚   โ”œโ”€โ”€ dataset_genetors                                          # Dataset generation modules
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py                                           # Dataset generators package initializer
โ”‚   โ”‚   โ”œโ”€โ”€ classification_dataset_generator.py                   # Text classification dataset generator
โ”‚   โ”‚   โ”œโ”€โ”€ dataset_generator.py                                  # Base dataset generator class
โ”‚   โ”‚   โ”œโ”€โ”€ instruction_dataset_generator.py                      # Instruction-following dataset generator
โ”‚   โ”‚   โ”œโ”€โ”€ preference_dataset_generator.py                       # Preference dataset generator (RLHF)
โ”‚   โ”‚   โ”œโ”€โ”€ raw_dataset_generator.py                              # Raw text dataset generator
โ”‚   โ”‚   โ”œโ”€โ”€ sentiment_dataset_generator.py                        # Sentiment analysis dataset generator
โ”‚   โ”‚   โ””โ”€โ”€ summarization_dataset_generator.py                    # Text summarization dataset generator
โ”‚   โ”œโ”€โ”€ llm                                                       # LLM interaction modules
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py                                           # LLM package initializer
โ”‚   โ”‚   โ”œโ”€โ”€ base_llm.py                                           # Base LLM class and common functionality
โ”‚   โ”‚   โ””โ”€โ”€ llm.py                                                # Main LLM implementation with LiteLLM integration
โ”‚   โ”œโ”€โ”€ prompts                                                   # Prompt templates for different dataset types
โ”‚   โ”‚   โ”œโ”€โ”€ description_system_prompt                             # System prompt for generating descriptions
โ”‚   โ”‚   โ”œโ”€โ”€ description_user_prompt                               # User prompt template for descriptions
โ”‚   โ”‚   โ”œโ”€โ”€ entry_classification_system_prompt                    # System prompt for classification entries
โ”‚   โ”‚   โ”œโ”€โ”€ entry_instruction_system_prompt                       # System prompt for instruction entries
โ”‚   โ”‚   โ”œโ”€โ”€ entry_preference_system_prompt                        # System prompt for preference entries
โ”‚   โ”‚   โ”œโ”€โ”€ entry_raw_system_prompt                               # System prompt for raw text entries
โ”‚   โ”‚   โ”œโ”€โ”€ entry_sentiment_system_prompt                         # System prompt for sentiment entries
โ”‚   โ”‚   โ”œโ”€โ”€ entry_summarization_system_prompt                     # System prompt for summarization entries
โ”‚   โ”‚   โ”œโ”€โ”€ entry_user_prompt                                     # User prompt template for dataset entries
โ”‚   โ”‚   โ”œโ”€โ”€ keyword_system_prompt                                 # System prompt for keyword generation
โ”‚   โ”‚   โ”œโ”€โ”€ keyword_user_prompt                                   # User prompt template for keywords
โ”‚   โ”‚   โ”œโ”€โ”€ labels_system_prompt                                  # System prompt for label generation
โ”‚   โ”‚   โ””โ”€โ”€ labels_user_prompt                                    # User prompt template for labels
โ”‚   โ”œโ”€โ”€ schemas                                                   # Pydantic data models and validation schemas
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py                                           # Schemas package initializer
โ”‚   โ”‚   โ”œโ”€โ”€ config.py                                             # Configuration data models
โ”‚   โ”‚   โ”œโ”€โ”€ datasets.py                                           # Dataset-related data models
โ”‚   โ”‚   โ”œโ”€โ”€ enums.py                                              # Enumeration definitions
โ”‚   โ”‚   โ””โ”€โ”€ messages.py                                           # Message and response data models
โ”‚   โ”œโ”€โ”€ utils                                                     # Utility functions and helpers
โ”‚   |   โ”œโ”€โ”€ file_utils.py                                         # File I/O operations and utilities
โ”‚   |   โ”œโ”€โ”€ __init__.py                                           # Utils package initializer
โ”‚   |   โ”œโ”€โ”€ json_utils.py                                         # JSON processing utilities
โ”‚   |   โ”œโ”€โ”€ progress_utils.py                                     # Progress tracking and display utilities
โ”‚   |   โ”œโ”€โ”€ prompt_utils.py                                       # Prompt processing and formatting utilities
โ”‚   |   โ”œโ”€โ”€ text_utils.py                                         # Text manipulation and processing utilities
โ”‚   |   โ””โ”€โ”€ yaml_utils.py                                         # YAML processing utilities
โ”‚   โ”œโ”€โ”€ __init__.py                                               # Main package initializer and version info
โ”‚   โ””โ”€โ”€ cli.py                                                    # Command-line interface implementation
โ”œโ”€โ”€ tests                                                         # Test suite for the package
โ”‚   โ”œโ”€โ”€ __init__.py                                               # Tests package initializer
โ”‚   โ”œโ”€โ”€ conftest.py                                               # pytest configuration and fixtures
โ”‚   โ”œโ”€โ”€ test_dataset_generator.py                                 # Tests for dataset generators
โ”‚   โ”œโ”€โ”€ test_dataset.py                                           # Tests for dataset functionality
โ”‚   โ””โ”€โ”€ test_llm.py                                               # Tests for LLM integration
โ”œโ”€โ”€ .gitignore                                                    # Git ignore rules for excluded files
โ”œโ”€โ”€ .pre-commit-config.yaml                                       # Pre-commit hooks configuration
โ”œโ”€โ”€ .python-version                                               # Python version specification for pyenv
โ”œโ”€โ”€ LICENCE.txt                                                   # MIT License file
โ”œโ”€โ”€ mkdocs.yml                                                    # MkDocs documentation configuration
โ”œโ”€โ”€ pyproject.toml                                                # Python project metadata and dependencies (PEP 518)
โ”œโ”€โ”€ README.md                                                     # Main project documentation and overview
โ””โ”€โ”€ uv.lock                                                       # UV lockfile for reproducible dependency resolution

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthgenai-2.0.1.tar.gz (425.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthgenai-2.0.1-py3-none-any.whl (48.8 kB view details)

Uploaded Python 3

File details

Details for the file synthgenai-2.0.1.tar.gz.

File metadata

  • Download URL: synthgenai-2.0.1.tar.gz
  • Upload date:
  • Size: 425.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for synthgenai-2.0.1.tar.gz
Algorithm Hash digest
SHA256 9437ebe77bec3e2cf3b37fd26a8c99c3f3f976a48e45d1c54aab741067eb94fb
MD5 adef33c047fd537f209ed316c08e420e
BLAKE2b-256 49c2e5943ae8f294dbe87e55765097d1aa66ee4cd9511effbbcee3c231f6b3da

See more details on using hashes here.

File details

Details for the file synthgenai-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: synthgenai-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 48.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for synthgenai-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a90cc94141b1e61aebb746154e49e05b22c229a32d646906ef4187728f5fe1cd
MD5 5c7948f7abd9eb9097afdcc269e31f2e
BLAKE2b-256 ebae6509e8c86853a66029647958c96247722ca58b547dc88371fd19e75fe813

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page