AI-powered synthetic data generation pipeline with web search, topic extraction, and persistent state management

These details have not been verified by PyPI

Project links

Project description

Synthetic Data Pipeline

AI-powered synthetic data generation pipeline with web search and topic extraction.

Installation

Install the package using pip:

pip install Data_Generation_Agents

Requirements

Python >= 3.8
API Keys for: Gemini, Tavily, ScraperAPI

Quick Start

Step 1: Create Configuration File

Create a .env file in your project directory:

GEMINI_API_KEY=your_gemini_api_key_here
TAVILY_API_KEY=your_tavily_api_key_here
SCRAPERAPI_API_KEY=your_scraper_api_key_here
OUTPUT_DIR=/path/to/your/output

Note: The OUTPUT_DIR variable is mandatory. The pipeline will not start without it. This directory is where all the generated data and state files will be saved.

Step 2: Use in Python Code

Basic usage example:

from Data_Generation_Agents import generate_synthetic_data

generate_synthetic_data("prompt")

Advanced usage with custom parameters:

generate_synthetic_data(
    user_query="prompt",
    refined_queries_count=20,
    search_results_per_query=5,
    rows_per_subtopic=5
    gemini_model_name="gemeni-2.0-flash
)

Step 3: Use CLI

Command line usage:

synthetic-data "prompt"

Configuration

Environment Variables

Variable	Required	Description
GEMINI_API_KEY	Yes	Google Gemini API key
TAVILY_API_KEY	Yes	Tavily search API key
SCRAPERAPI_API_KEY	Yes	ScraperAPI key
OUTPUT_DIR	Yes	Output directory path

Pipeline Output Structure

When you run the pipeline, it will create a new directory for each run inside your specified OUTPUT_DIR. The directory will be named with a unique workflow ID. Inside this directory, you will find the following files, which are updated in real-time:

pipeline_state.json: The main state file with metadata about the run.
refined_queries.json: The search queries generated by the QueryRefinerAgent.
search_results.json: The results from the web search.
scraped_content.json: The content scraped from the web pages.
all_chunks.json: The scraped content, broken down into smaller chunks.
all_extracted_topics.json: The topics extracted from the content chunks.
synthetic_data.json: The final generated synthetic data, with each data point saved as it is generated.

This structure provides a complete and real-time record of the data generation process.

API Reference

`generate_synthetic_data(user_query: str, refined_queries_count: Optional[int] = None, search_results_per_query: Optional[int] = None, rows_per_subtopic: Optional[int] = None, gemini_model_name: Optional[str] = None)`

Generate synthetic data based on a natural language prompt. The user_query is parsed to automatically determine the number of samples, data type, language, and a detailed description of the data to be generated.

Categories Feature: When you specify categories within your domain (e.g., "cardiovascular and neurology" for medical domain), the pipeline will:

Focus search queries specifically on those categories
Generate more targeted and relevant content
Distribute queries across all specified categories
Use category-specific terminology and concepts

If no categories are specified, the pipeline will comprehensively cover the entire domain.

Parameters:

user_query (str): Required. A natural language description of the data you want to generate. This query should implicitly or explicitly contain:
- Number of samples: The total count of data entries to generate (e.g., "100"). (required)
- Data type: The structure or format of the data (e.g., "QA pairs", "product reviews", "customer support conversations"). (required)
- Language: The desired language for the generated data (e.g., "English", "French", "Egyptian_Arabic"). (required)
- Description: A detailed explanation of the data's content and context. (required)
- Domain: The desired domain for the generated data (e.g., "Finance", "Medical", "Law"). (optional)
- Categories: Specific subcategories within the domain to focus on (e.g., "cardiovascular, neurology" for medical domain). (optional)
refined_queries_count (int, optional): Number of refined search queries to generate. Defaults to a value from .env or internal settings.
search_results_per_query (int, optional): Number of web search results to consider per refined query. Defaults to a value from .env or internal settings.
rows_per_subtopic (int, optional): Number of synthetic data rows to generate per extracted subtopic. Defaults to a value from .env or internal settings.
gemini_model_name (str, optional): The name of the Gemini model to use (e.g., "gemini-pro", "gemini-1.5-flash"). Defaults to "gemini-2.5-flash" or a value from .env.

Examples:

from Data_Generation_Agents import generate_synthetic_data

generate_synthetic_data(
    user_query= "Generate 5000 diverse, contextually rich English-to-Egyptian Arabic translation pairs In Law domain with varying sentence complexities, ensuring authentic colloquial Egyptian Arabic translations while preserving English technical terms, proper nouns, and specialized terminology untranslated. the data the data contains two columns (English, Egyptian Arabic)"
    refined_queries_count=25,
    search_results_per_query=5,
    rows_per_subtopic=5
)

from Data_Generation_Agents import generate_synthetic_data

generate_synthetic_data(
    user_query="Generate 2000 finance classification examples in Arabic covering banking, insurance, and investment topics, the data contains two columns (Text, classification_type)",
    refined_queries_count=30,
    search_results_per_query=5,
    rows_per_subtopic=5
    gemini_model_name="gemini-1.5-pro"
)

Development

Local Installation

git clone https://github.com/Omar-YYoussef/Data_Gen_Agent
cd synthetic-data-pipeline
pip install -e .

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues
Email: omarjooo595@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Oct 13, 2025

0.1.6

Oct 7, 2025

0.1.5

Oct 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_generation_agents-1.0.0.tar.gz (47.1 kB view details)

Uploaded Oct 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_generation_agents-1.0.0-py3-none-any.whl (51.8 kB view details)

Uploaded Oct 13, 2025 Python 3

File details

Details for the file data_generation_agents-1.0.0.tar.gz.

File metadata

Download URL: data_generation_agents-1.0.0.tar.gz
Upload date: Oct 13, 2025
Size: 47.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for data_generation_agents-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e5ebe9caf239362cb462125e96455144aad33202a61997671f7c45b370b11821`
MD5	`36ae045beafafa897072e41f290b656f`
BLAKE2b-256	`104e2c8fc73ab289b04b325f5794c6c07a2fa301c4500d92c9af52d4a594a30f`

See more details on using hashes here.

File details

Details for the file data_generation_agents-1.0.0-py3-none-any.whl.

File metadata

Download URL: data_generation_agents-1.0.0-py3-none-any.whl
Upload date: Oct 13, 2025
Size: 51.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for data_generation_agents-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0f10bcd870dc6e2de118cd8d67b86d0e5aa89078567f7459be98df062448d8e3`
MD5	`18b7821d91dd9ed2171c842a2ea5e0ad`
BLAKE2b-256	`7b28e12f4ba7c39fd11f116478155264c9e8ba621ae53b3364cbdbabfe214298`

See more details on using hashes here.

Data-Generation-Agents 1.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Synthetic Data Pipeline

Installation

Requirements

Quick Start

Configuration

Pipeline Output Structure

API Reference

`generate_synthetic_data(user_query: str, refined_queries_count: Optional[int] = None, search_results_per_query: Optional[int] = None, rows_per_subtopic: Optional[int] = None, gemini_model_name: Optional[str] = None)`

Development

License

Support

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes