Skip to main content

SynthGenAI - Package for generating Synthetic Datasets.

Project description

SynthGenAI-Package for Generating Synthetic Datasets using LLMs

header_logo_image

SynthGenAI is a package for generating Synthetic Datasets. The idea is to have a tool which is simple to use and can generate datasets on different topics by utilizing LLMs from different API providers. The package is designed to be modular and can be easily extended to include some different API providers for LLMs and new features.

[!IMPORTANT] The package is still in the early stages of development and some features may not be fully implemented or tested. If you find any issues or have any suggestions, feel free to open an issue or create a pull request.

Why SynthGenAI now? ๐Ÿค”

Interest in synthetic data generation has surged recently, driven by the growing recognition of data as a critical asset in AI development. As Ilya Sutskever, one of the most important figures in AI, says: 'Data is the fossil fuel of AI.' The more quality data we have, the better our models can perform. However, access to data is often restricted due to privacy concerns, or it may be prohibitively expensive to collect. Additionally, the vast amount of high-quality data on the internet has already been extensively mined. Synthetic data generation addresses these challenges by allowing us to create diverse and useful datasets using current pre-trained Large Language Models (LLMs). Beyond LLMs, synthetic data also holds immense potential for training and fine-tuning Small Language Models (SLMs), which are gaining popularity due to their efficiency and suitability for specific, resource-constrained applications. By leveraging synthetic data for both LLMs and SLMs, we can enhance performance across a wide range of use cases while balancing resource efficiency and model effectiveness. This approach enables us to harness the strengths of both synthetic and authentic datasets to achieve optimal outcomes.

Tools used for building SynthGenAI ๐Ÿงฐ

The package is built using Python and the following libraries:

  • uv, An extremely fast Python package and project manager, written in Rust.
  • litellm, A Python SDK for accessing LLMs from different API providers with standardized OpenAI Format.
  • langfuse, LLMOps platform for observability, tracebility and monitoring of LLMs.
  • pydantic, Data validation and settings management using Python type annotations.
  • huggingface-hub & datasets, A Python library for saving generated datasets on Hugging Face Hub.

Installation ๐Ÿ› ๏ธ

To install the package, you can use the following command:

pip install synthgenai

or you can install the package directly from the source code using the following commands:

git clone https://github.com/Shekswess/synthgenai.git
uv build
pip install ./dist/synthgenai-{version}-py3-none-any.whl

Requirements ๐Ÿ“‹

To use the package, you need to have the following requirements installed:

  • Python 3.10+
  • uv for building the package directly from the source code
  • Ollama running on your local machine if you want to use Ollama as an API provider (optional)
  • Langfuse running on your local machine or in the cloud if you want to use Langfuse for tracebility (optional)
  • Hugging Face Hub account if you want to save the generated datasets on Hugging Face Hub with generated token (optional)

Usage ๐Ÿ‘จโ€๐Ÿ’ป

The available API providers for LLMs are:

  • Groq
  • Mistral AI
  • Gemini
  • Bedrock
  • Anthropic
  • OpenAI
  • Hugging Face
  • Ollama
  • vLLM
  • SageMaker
  • Azure
  • Vertex AI

For observing the generated datasets, you can use Langfuse for tracebility and monitoring of the LLMs.

To use the LLMs from different API providers, to observe the generated datasets, and to save the generated datasets on Hugging Face Hub, you need to set the following environment variables:

# API keys for different LLM providers
GROQ_API_KEY=
MISTRAL_API_KEY=
GEMINI_API_KEY=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=
AWS_PROFILE=
ANTHROPIC_API_KEY=
OPENAI_API_KEY=
AZURE_API_KEY
AZURE_API_BASE
AZURE_API_VERSION
AZURE_AD_TOKEN
AZURE_API_TYPE
GOOGLE_APPLICATION_CREDENTIALS
VERTEXAI_LOCATION
VERTEXAI_PROJECT
HUGGINGFACE_API_KEY=

# Langfuse API keys
LANGFUSE_PUBLIC_KEY=
LANGFUSE_SECRET_KEY=
LANGFUSE_HOST=

# Huggingface token for uploading datasets on Huggingface
HF_TOKEN=

Currently there are three types of datasets that can be generated using SynthGenAI:

  • Raw Datasets
  • Instruction Datasets
  • Preference Datasets
  • Sentiment Analysis Datasets
  • Summarization Datasets
  • Text Classification Datasets

The datasets can be generated:

  • Synchronously - each dataset entry is generated one by one
  • Asynchronously - batch of dataset entries is generated at once

[!NOTE] Asynchronous generation is faster than synchronous generation, but some of LLM providers can have limitations on the number of tokens that can be generated at once.

Raw Datasets ๐Ÿฅฉ

To generate a raw dataset, you can use the following code:

# For asynchronous dataset generation
# import asyncio
import os

from synthgenai import (
    DatasetConfig,
    DatasetGeneratorConfig,
    LLMConfig,
    RawDatasetGenerator,
)

# Setting the API keys
os.environ["LLM_API_KEY"] = ""

# Optional for Langfuse Tracebility
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_HOST"] = ""

# Optional for Hugging Face Hub upload
os.environ["HF_TOKEN"] = ""

# Creating the LLMConfig
llm_config = LLMConfig(
    model="model_provider/model_name", # Check liteLLM docs for more info
    temperature=0.5,
    top_p=0.9,
    max_tokens=2048,
)

# Creating the DatasetConfig
dataset_config = DatasetConfig(
    topic="topic_name",
    domains=["domain1", "domain2"],
    language="English",
    additional_description="Additional description",
    num_entries=1000
)

# Creating the DatasetGeneratorConfig
dataset_generator_config = DatasetGeneratorConfig(
    llm_config=llm_config,
    dataset_config=dataset_config,
)

# Creating the RawDatasetGenerator
raw_dataset_generator = RawDatasetGenerator(dataset_generator_config)

# Generating the dataset
raw_dataset = raw_dataset_generator.generate_dataset()

# Generating the dataset asynchronously
# raw_dataset = asyncio.run(raw_dataset_generator.agenerate_dataset())

# Name of the Hugging Face repository where the dataset will be saved
hf_repo_name = "organization_or_user_name/dataset_name" # optional

# Saving the dataset to the locally and to the Hugging Face repository(optional)
raw_dataset.save_dataset(
    hf_repo_name=hf_repo_name,
)

Example of generated entry for the raw dataset:

{
  "keyword": "keyword",
  "topic": "topic",
  "language": "language",
  "generated_entry": {
    "text": "generated text"
  }
}

Instruction Datasets ๐Ÿ’ฌ

To generate an instruction dataset, you can use the following code:

# For asynchronous dataset generation
# import asyncio
import os

from synthgenai import (
    DatasetConfig,
    DatasetGeneratorConfig,
    LLMConfig,
    InstructionDatasetGenerator,
)

# Setting the API keys
os.environ["LLM_API_KEY"] = ""

# Optional for Langfuse Tracebility
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_HOST"] = ""

# Optional for Hugging Face Hub upload
os.environ["HF_TOKEN"] = ""

# Creating the LLMConfig
llm_config = LLMConfig(
    model="model_provider/model_name", # Check liteLLM docs for more info
    temperature=0.5,
    top_p=0.9,
    max_tokens=2048,
)

# Creating the DatasetConfig
dataset_config = DatasetConfig(
    topic="topic_name",
    domains=["domain1", "domain2"],
    language="English",
    additional_description="Additional description",
    num_entries=1000
)

# Creating the DatasetGeneratorConfig
dataset_generator_config = DatasetGeneratorConfig(
    llm_config=llm_config,
    dataset_config=dataset_config,
)

# Creating the InstructionDatasetGenerator
instruction_dataset_generator = InstructionDatasetGenerator(dataset_generator_config)

# Generating the dataset
instruction_dataset = instruction_dataset_generator.generate_dataset()

# Generating the dataset asynchronously
# instruction_dataset = asyncio.run(instruction_dataset_generator.agenerate_dataset())

# Name of the Hugging Face repository where the dataset will be saved
hf_repo_name = "organization_or_user_name/dataset_name" # optional

# Saving the dataset to the locally and to the Hugging Face repository(optional)
instruction_dataset.save_dataset(
    hf_repo_name=hf_repo_name,
)

Example of generated entry for the instruction dataset:

{
  "keyword": "keyword",
  "topic": "topic",
  "language": "language",
  "generated_entry": {
    "messages": [
      {
        "role": "system",
        "content": "generated system(instruction) prompt"
      },
      {
        "role": "user",
        "content": "generated user prompt"
      },
      {
        "role": "assistant",
        "content": "generated assistant prompt"
      }
    ]
  }
}

Preference Datasets ๐ŸŒŸ

To generate a preference dataset, you can use the following code:

# For asynchronous dataset generation
# import asyncio
import os

from synthgenai import (
    DatasetConfig,
    DatasetGeneratorConfig,
    LLMConfig,
    PreferenceDatasetGenerator,
)

# Setting the API keys
os.environ["LLM_API_KEY"] = ""

# Optional for Langfuse Tracebility
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_HOST"] = ""

# Optional for Hugging Face Hub upload
os.environ["HF_TOKEN"] = ""

# Creating the LLMConfig
llm_config = LLMConfig(
    model="model_provider/model_name", # Check liteLLM docs for more info
    temperature=0.5,
    top_p=0.9,
    max_tokens=2048,
)

# Creating the DatasetConfig
dataset_config = DatasetConfig(
    topic="topic_name",
    domains=["domain1", "domain2"],
    language="English",
    additional_description="Additional description",
    num_entries=1000
)

# Creating the DatasetGeneratorConfig
dataset_generator_config = DatasetGeneratorConfig(
    llm_config=llm_config,
    dataset_config=dataset_config,
)

# Creating the PreferenceDatasetGenerator
preference_dataset_generator = PreferenceDatasetGenerator(dataset_generator_config)

# Generating the dataset
preference_dataset = preference_dataset_generator.generate_dataset()

# Generating the dataset asynchronously
# preference_dataset = asyncio.run(preference_dataset_generator.agenerate_dataset())

# Name of the Hugging Face repository where the dataset will be saved
hf_repo_name = "organization_or_user_name/dataset_name" # optional

# Saving the dataset to the locally and to the Hugging Face repository(optional)
preference_dataset.save_dataset(
    hf_repo_name=hf_repo_name,
)

Example of generated entry for the preference dataset:

{
  "keyword": "keyword",
  "topic": "topic",
  "language": "language",
  "generated_entry": {
    "prompt": [
      { "role": "system", "content": "generated system(instruction) prompt" },
      { "role": "user", "content": "generated user prompt" }
    ],
    "chosen": [
      { "role": "assistant", "content": "generated chosen assistant response" }
    ],
    "rejected": [
      {
        "role": "assistant",
        "content": "generated rejected assistant response"
      }
    ]
  }
}

Sentiment Analysis Datasets ๐ŸŽญ

To generate a sentiment analysis dataset, you can use the following code:

# For asynchronous dataset generation
# import asyncio
import os

from synthgenai import (
    DatasetConfig,
    DatasetGeneratorConfig,
    LLMConfig,
    SentimentAnalysisDatasetGenerator,
)

# Setting the API keys
os.environ["LLM_API_KEY"] = ""

# Optional for Langfuse Tracebility
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_HOST"] = ""

# Optional for Hugging Face Hub upload
os.environ["HF_TOKEN"] = ""

# Creating the LLMConfig
llm_config = LLMConfig(
    model="model_provider/model_name", # Check liteLLM docs for more info
    temperature=0.5,
    top_p=0.9,
    max_tokens=2048,
)

# Creating the DatasetConfig
dataset_config = DatasetConfig(
    topic="topic_name",
    domains=["domain1", "domain2"],
    language="English",
    additional_description="Additional description",
    num_entries=1000
)

# Creating the DatasetGeneratorConfig
dataset_generator_config = DatasetGeneratorConfig(
    llm_config=llm_config,
    dataset_config=dataset_config,
)

# Creating the SentimentAnalysisDatasetGenerator
sentiment_analysis_dataset_generator = SentimentAnalysisDatasetGenerator(dataset_generator_config)

# Generating the dataset
sentiment_analysis_dataset = sentiment_analysis_dataset_generator.generate_dataset()

# Generating the dataset asynchronously
# sentiment_analysis_dataset = asyncio.run(sentiment_analysis_dataset_generator.agenerate_dataset())

# Name of the Hugging Face repository where the dataset will be saved
hf_repo_name = "organization_or_user_name/dataset_name" # optional

# Saving the dataset to the locally and to the Hugging Face repository(optional)
sentiment_analysis_dataset.save_dataset(
    hf_repo_name=hf_repo_name,
)

Example of generated entry for the sentiment analysis dataset:

{
  "keyword": "keyword",
  "topic": "topic",
  "language": "language",
  "generated_entry": {
    "prompt": "generated text",
    "label": "generated sentiment (which can be positive, negative, neutral)"
  }
}

Text Classification Datasets ๐Ÿ” 

To generate a text classification dataset, you can use the following code:

# For asynchronous dataset generation
# import asyncio
import os

from synthgenai import (
    DatasetConfig,
    DatasetGeneratorConfig,
    LLMConfig,
    TextClassificationDatasetGenerator,
)

# Setting the API keys
os.environ["LLM_API_KEY"] = ""

# Optional for Langfuse Tracebility
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_HOST"] = ""

# Optional for Hugging Face Hub upload
os.environ["HF_TOKEN"] = ""

# Creating the LLMConfig
llm_config = LLMConfig(
    model="model_provider/model_name", # Check liteLLM docs for more info
    temperature=0.5,
    top_p=0.9,
    max_tokens=2048,
)

# Creating the DatasetConfig
dataset_config = DatasetConfig(
    topic="topic_name",
    domains=["domain1", "domain2"],
    language="English",
    additional_description="Additional description",
    num_entries=1000
)

# Creating the DatasetGeneratorConfig
dataset_generator_config = DatasetGeneratorConfig(
    llm_config=llm_config,
    dataset_config=dataset_config,
)

# Creating the TextClassificationDatasetGenerator
text_classification_dataset_generator = TextClassificationDatasetGenerator(dataset_generator_config)

# Generating the dataset
text_classification_dataset = text_classification_dataset_generator.generate_dataset()

# Generating the dataset asynchronously
# text_classification_dataset = asyncio.run(text_classification_dataset_generator.agenerate_dataset())

# Name of the Hugging Face repository where the dataset will be saved
hf_repo_name = "organization_or_user_name/dataset_name" # optional

# Saving the dataset to the locally and to the Hugging Face repository(optional)
text_classification_dataset.save_dataset(
    hf_repo_name=hf_repo_name,
)

Example of generated entry for the text classification dataset:

{
  "keyword": "keyword",
  "topic": "topic",
  "language": "language",
  "generated_entry": {
    "prompt": "generated text",
    "label": "generated sentiment (which will be from a list of labels, created from the model)"
  }
}

Summarization Datasets ๐Ÿงพ

To generate a summarization dataset, you can use the following code:

# For asynchronous dataset generation
# import asyncio
import os

from synthgenai import (
    DatasetConfig,
    DatasetGeneratorConfig,
    LLMConfig,
    SummarizationDatasetGenerator,
)

# Setting the API keys
os.environ["LLM_API_KEY"] = ""

# Optional for Langfuse Tracebility
os.environ["LANGFUSE_SECRET_KEY"] = ""
os.environ["LANGFUSE_PUBLIC_KEY"] = ""
os.environ["LANGFUSE_HOST"] = ""

# Optional for Hugging Face Hub upload
os.environ["HF_TOKEN"] = ""

# Creating the LLMConfig
llm_config = LLMConfig(
    model="model_provider/model_name", # Check liteLLM docs for more info
    temperature=0.5,
    top_p=0.9,
    max_tokens=2048,
)

# Creating the DatasetConfig
dataset_config = DatasetConfig(
    topic="topic_name",
    domains=["domain1", "domain2"],
    language="English",
    additional_description="Additional description",
    num_entries=1000
)

# Creating the DatasetGeneratorConfig
dataset_generator_config = DatasetGeneratorConfig(
    llm_config=llm_config,
    dataset_config=dataset_config,
)

# Creating the SummarizationDatasetGenerator
summarization_dataset_generator = SummarizationDatasetGenerator(dataset_generator_config)

# Generating the dataset
summarization_dataset = summarization_dataset_generator.generate_dataset()

# Generating the dataset asynchronously
# summarization_dataset = asyncio.run(summarization_dataset_generator.agenerate_dataset())

# Name of the Hugging Face repository where the dataset will be saved
hf_repo_name = "organization_or_user_name/dataset_name" # optional

# Saving the dataset to the locally and to the Hugging Face repository(optional)
summarization_dataset.save_dataset(
    hf_repo_name=hf_repo_name,
)

Example of generated entry for the summarization dataset:

{
  "keyword": "keyword",
  "topic": "topic",
  "language": "language",
  "generated_entry": {
    "text": "generated text",
    "summary": "generated summary"
  }
}

More Examples ๐Ÿ“–

More examples with different combinations of LLM API providers and dataset configurations can be found in the examples directory.

[!IMPORTANT] Sometimes the generation of the keywords for the dataset and the dataset entries can fail due to the limitation of the LLM to generate JSON Object as output (this is handled by the package). That's why it is recommended to use models that are capable of generating JSON Objects (structured output). List of models that can generate JSON Objects can be found here.

Generated Datasets ๐Ÿ“š

Examples of generated synthetic datasets can be found on the SynthGenAI Datasets Collection on Hugging Face Hub.

Supported API Providers ๐Ÿ’ช

  • Groq - more info about Groq models that can be used, can be found here
  • Mistral AI - more info about Mistral AI models that can be used, can be found here
  • Gemini - more info about Gemini models that can be used, can be found here
  • Bedrock - more info about Bedrock models that can be used, can be found here
  • Anthropic - more info about Anthropic models that can be used, can be found here
  • OpenAI - more info about OpenAI models that can be used, can be found here
  • Hugging Face - more info about Hugging Face models that can be used, can be found here
  • Ollama - more info about Ollama models that can be used, can be found here
  • vLLM - more info about vLLM models that can be used, can be found here
  • SageMaker - more info about SageMaker models that can be used, can be found here
  • Azure - more info about Azure and Azure AI models that can be used, can be found here & here
  • Vertex AI - more info about Vertex AI models that can be used, can be found here

Next Steps ๐Ÿš€

  • Add CLI or TUI or UI for generating datasets

Contributing ๐Ÿค

If you want to contribute to this project and make it better, your help is very welcome. Create a pull request with your changes and I will review it. If you have any questions, open an issue.

License ๐Ÿ“

This project is licensed under the MIT License - see the LICENSE.md file for details.

Repo Structure ๐Ÿ“‚

.
โ”œโ”€โ”€ .github
โ”‚   โ””โ”€โ”€ workflows
โ”‚       โ”œโ”€โ”€ build_n_release.yml
โ”‚       โ””โ”€โ”€ tests.yml
โ”œโ”€โ”€ assets
โ”‚   โ””โ”€โ”€ logo_header.png
โ”œโ”€โ”€ examples
โ”‚   โ”œโ”€โ”€ anthropic_instruction_dataset_example.py
โ”‚   โ”œโ”€โ”€ azure_ai_preference_dataset_example.py
โ”‚   โ”œโ”€โ”€ azure_summarization_dataset_example.py
โ”‚   โ”œโ”€โ”€ bedrock_raw_dataset_example.py
โ”‚   โ”œโ”€โ”€ gemini_langfuse_raw_dataset_example.py
โ”‚   โ”œโ”€โ”€ groq_preference_dataset_example.py
โ”‚   โ”œโ”€โ”€ huggingface_instruction_dataset_example.py
โ”‚   โ”œโ”€โ”€ mistral_preference_dataset_example.py
โ”‚   โ”œโ”€โ”€ ollama_preference_dataset_example.py
โ”‚   โ”œโ”€โ”€ openai_raw_dataset_example.py
โ”‚   โ”œโ”€โ”€ sagemaker_summarization_dataset_example.py
โ”‚   โ”œโ”€โ”€ vertex_ai_text_classification_dataset_example.py
โ”‚   โ””โ”€โ”€ vllm_sentiment_analysis_dataset_example.py
โ”œโ”€โ”€ synthgenai
โ”‚   โ”œโ”€โ”€ data_model.py
โ”‚   โ”œโ”€โ”€ dataset_generator.py
โ”‚   โ”œโ”€โ”€ dataset.py
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ llm.py
โ”‚   โ”œโ”€โ”€ prompts.py
โ”‚   โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ tests
โ”‚   โ”œโ”€โ”€ test_dataset.py
โ”‚   โ””โ”€โ”€ test_llm.py
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ .python-version
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ uv.lock

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthgenai-0.3.0.tar.gz (343.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthgenai-0.3.0-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file synthgenai-0.3.0.tar.gz.

File metadata

  • Download URL: synthgenai-0.3.0.tar.gz
  • Upload date:
  • Size: 343.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.20

File hashes

Hashes for synthgenai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2da5f4d4130811ed95d8a0051cc2e6c2684e956706b2a45aa9ce3b423899e09a
MD5 66b56fd47260203a631d3b0707740ffa
BLAKE2b-256 586d048d02025493b0aaf25bde5bc41dc6a8fa827398ec2f70b57801842b1107

See more details on using hashes here.

File details

Details for the file synthgenai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: synthgenai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.20

File hashes

Hashes for synthgenai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 75e7c19802b4fdee70e709905df25ade63dad9934a554818f0b69ce63d17860a
MD5 05b4a94d582be2785f07ab0c0683c5d5
BLAKE2b-256 cfdc4db777bd2764ed01d0cd1a6ea0d3168716ba66dd51da8f4027d3b477e754

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page