SysGen - High-quality synthetic datasets creating tool using Gemini API

These details have not been verified by PyPI

Project links

Project description

SysGen

SysGen is a powerful CLI tool that creates high-quality synthetic datasets from documents using the Gemini API. It intelligently chunks documents, generates comprehensive questions, and produces detailed answers for machine learning training datasets.

Features

Smart Document Chunking: Automatically splits large documents into manageable chunks with overlap
Comprehensive Question Generation: Extracts ALL possible questions from content using advanced AI prompting
High-Quality Answer Generation: Creates detailed 4-5 sentence answers with supporting evidence
Multiple Output Formats: Supports Alpaca, ChatML, and Conversation formats
Semantic Duplicate Detection: Automatically removes duplicate questions using sentence embeddings
Token-Aware Processing: Uses tiktoken for accurate token counting and chunking
Batch Processing: Process multiple markdown/text files in a single run
Quality Validation: Ensures answer length and content quality standards

Installation

Install from pip

pip install sysgen

Set Up Environment Variables

Before running sysgen, set the API key in your terminal:

# Windows
set GEMINI_API_KEY=your_gemini_api_key_here

# Linux/Mac
export GEMINI_API_KEY=your_gemini_api_key_here

Usage

Basic Usage

sysgen --input-folder md --output dataset.json --format alpaca

Advanced Usage

sysgen --input-folder documents --output training_data.json --format chatml --similarity-threshold 0.85

Arguments

--input-folder: Folder containing markdown/text files (default: md)
--output: Output JSON file (default: output.json)
--format: Output format - alpaca, chatml, or conversation (default: alpaca)
--similarity-threshold: Similarity threshold for duplicate detection, 0.0-1.0 (default: 0.85)

Output Formats

Alpaca Format

{
  "instruction": "What is the main concept discussed in this section?",
  "input": "",
  "output": "The main concept discussed is the implementation of neural networks...",
  "source_document": "document.md"
}

ChatML Format

{
  "messages": [
    {"role": "user", "content": "What is the main concept discussed in this section?"},
    {"role": "assistant", "content": "The main concept discussed is the implementation of neural networks..."}
  ],
  "source_document": "document.md"
}

Conversation Format

{
  "conversations": [
    {"from": "human", "value": "What is the main concept discussed in this section?"},
    {"from": "gpt", "value": "The main concept discussed is the implementation of neural networks..."}
  ],
  "source_document": "document.md"
}

How It Works

Document Chunking: Splits documents into 3000-token chunks with 200-token overlap
Question Extraction: Uses advanced AI prompting to extract ALL possible questions from each chunk
Answer Generation: Creates comprehensive 4-5 sentence answers with supporting evidence
Quality Filtering: Validates answer length (3-6 sentences) and content quality
Duplicate Detection: Uses sentence embeddings to identify semantically similar questions
Format Conversion: Converts to specified output format (Alpaca/ChatML/Conversation)
Batch Processing: Processes multiple files and combines results

Advanced Features

Smart Chunking

Token-Aware: Uses tiktoken for accurate token counting
Sentence Preservation: Keeps sentences intact during chunking
Overlap Management: Maintains context between chunks with configurable overlap

Comprehensive Question Generation

Multi-Level Questions: Generates factual, conceptual, analytical, and application questions
Exhaustive Extraction: Extracts ALL possible questions from content
Quality Standards: Ensures questions are clear, specific, and answerable

Semantic Duplicate Detection

Embedding-Based: Uses sentence-transformers for semantic similarity
Configurable Threshold: Adjust sensitivity with similarity_threshold parameter
Quality Preservation: Keeps highest quality version from duplicate groups

Dependencies

google-genai: Gemini API client for question and answer generation
sentence-transformers: Semantic similarity detection for duplicate removal
scikit-learn: Cosine similarity calculations
tiktoken: Token counting for document chunking
torch: PyTorch backend for sentence transformers
transformers: Hugging Face transformers library
numpy: Numerical operations
scipy: Scientific computing utilities

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

How to Contribute

Fork the Repository: Start by forking the project on GitHub
Clone the Repository: Clone it to your local machine
Create a Branch: Create a new branch for your changes
Make Changes: Implement your improvements or bug fixes
Test Your Changes: Ensure the tool works correctly with your modifications
Submit a Pull Request: Open a PR describing your changes

License

This project is licensed under the MIT License. See LICENSE for details.

Contact

Author: Adhishtanaka
Email: kulasoooriyaa@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Jul 7, 2025

0.2.1

Jul 7, 2025

0.2.0

Jul 6, 2025

0.1.1

Mar 22, 2025

0.1.0

Mar 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sysgen-0.2.2.tar.gz (11.3 kB view details)

Uploaded Jul 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sysgen-0.2.2-py3-none-any.whl (9.7 kB view details)

Uploaded Jul 7, 2025 Python 3

File details

Details for the file sysgen-0.2.2.tar.gz.

File metadata

Download URL: sysgen-0.2.2.tar.gz
Upload date: Jul 7, 2025
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sysgen-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`be67e5ad6292084afbe9c0331608559c55613a980e94dedf577418b972169f4c`
MD5	`e9e823b034abdc68db05f5d459cee71a`
BLAKE2b-256	`196b01bfbf12d875cab27f5ccb2027a9debf7f2f2f595f776c5026aac8943617`

See more details on using hashes here.

File details

Details for the file sysgen-0.2.2-py3-none-any.whl.

File metadata

Download URL: sysgen-0.2.2-py3-none-any.whl
Upload date: Jul 7, 2025
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sysgen-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`317a8d92eef218af56c73fbee2163655042ad69331d8cb369424ec1471de86f2`
MD5	`a5b3abbff48e4634c963b735e39b0984`
BLAKE2b-256	`7fcc85e59c2d5d4e39ca7258c5b85cbd4f7e927a006b2e6b8195d3d35d202a01`

See more details on using hashes here.

sysgen 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SysGen

Features

Installation

Install from pip

Set Up Environment Variables

Usage

Basic Usage

Advanced Usage

Arguments

Output Formats

Alpaca Format

ChatML Format

Conversation Format

How It Works

Advanced Features

Smart Chunking

Comprehensive Question Generation

Semantic Duplicate Detection

Dependencies

Contributing

How to Contribute

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes