Quality-Driven Synthetic Data Generation via LLM-Guided Oversampling

These details have not been verified by PyPI

Project links

Project description

QualSynth: Quality-Driven Synthetic Data Generation via LLM-Guided Oversampling

QualSynth is a Python package that leverages Large Language Models (LLMs) with iterative refinement to generate quality-validated synthetic samples for imbalanced classification tasks.

QualSynth Architecture

Key Features

LLM-Guided Generation: Uses LLMs to generate contextually aware synthetic samples that respect domain constraints
Multi-Stage Validation: Every sample passes schema validation, statistical checks, and duplicate detection
Anchor-Centric Approach: Generates variations of real minority samples, preserving natural feature correlations
Zero Duplicates: Achieves 0% duplicate ratio across all datasets (vs 29.7% for TabFairGDT)
Fairness-Aware: Reduces demographic parity difference without explicit fairness constraints
Multiple LLM Backends: Supports OpenAI, Ollama (local), OpenRouter, and custom endpoints

Performance Highlights

Evaluated on 8 benchmark datasets across 320 experiments (8 datasets × 10 seeds × 4 methods):

Metric	QualSynth	SMOTE	CTGAN	TabFairGDT
F1 Rank	2.12	2.25	2.50	3.12
ROC-AUC Rank	1.63	2.50	3.63	2.25
Duplicate Ratio	0%	0%	0%	29.7%
DPD (Fairness)	0.062	0.089	0.139	0.095

Critical Difference Diagram

Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/qualsynth.git
cd qualsynth

# Install dependencies
pip install -r requirements.txt

Basic Usage

from qualsynth import QualSynthGenerator

# Initialize generator
generator = QualSynthGenerator(
    model_name="gpt-4",
    api_key="your-openai-api-key",  # Or set OPENAI_API_KEY env var
    temperature=0.7,
    max_iterations=20
)

# Generate synthetic samples
X_synthetic, y_synthetic = generator.fit_generate(X_train, y_train)

# Combine with original data for training
X_augmented = pd.concat([X_train, X_synthetic])
y_augmented = pd.concat([y_train, y_synthetic])

Using Local LLMs (Ollama)

from qualsynth import QualSynthGenerator

# First, start Ollama server: ollama serve
# Pull a model: ollama pull gemma3:12b

generator = QualSynthGenerator(
    model_name="gemma3:12b",  # Model name from 'ollama list'
    api_base="http://localhost:11434/v1"
)

X_synthetic, y_synthetic = generator.fit_generate(X_train, y_train)

Using OpenRouter (Cloud)

from qualsynth import QualSynthGenerator

generator = QualSynthGenerator(
    model_name="google/gemma-2-9b-it",
    api_key="your-openrouter-api-key",
    api_base="https://openrouter.ai/api/v1"
)

X_synthetic, y_synthetic = generator.fit_generate(X_train, y_train)

Project Structure

qualsynth/
├── src/qualsynth/           # Main package source code
│   ├── core/                # Core workflow logic
│   │   └── iterative_workflow.py
│   ├── generators/          # Sample generation
│   │   └── counterfactual_generator.py
│   ├── validation/          # Multi-stage validation
│   │   ├── adaptive_validator.py
│   │   └── universal_validator.py
│   ├── modules/             # LLM-powered modules
│   │   ├── dataset_profiler.py
│   │   ├── schema_profiler.py
│   │   ├── diversity_planner.py
│   │   ├── fairness_auditor.py
│   │   └── validator.py
│   ├── baselines/           # Baseline implementations
│   │   ├── smote.py
│   │   ├── ctgan_baseline.py
│   │   └── tabfairgdt.py
│   ├── evaluation/          # Metrics and classifiers
│   └── prompts/             # LLM prompt templates
├── configs/                 # Dataset and method configurations
├── scripts/                 # Experiment scripts
├── replication/             # Replication package
│   ├── qualsyn-1.0.0/       # Standalone package
│   └── tables/              # Pre-computed results
└── data/splits/             # Pre-computed dataset splits

🔧 Configuration Options

Parameter	Default	Description
`model_name`	`"gemma3:12b"`	LLM model to use
`api_key`	`None`	API key for cloud providers
`api_base`	`None`	Custom API endpoint URL
`temperature`	`0.7`	Generation diversity (lower = more consistent)
`batch_size`	`20`	Samples per LLM call
`max_iterations`	`20`	Maximum refinement iterations
`target_ratio`	`1.0`	Target class ratio (1.0 = balanced)
`validation_threshold`	`4.5`	Statistical validation threshold (σ)
`sensitive_attributes`	`None`	Columns for fairness-aware generation

Datasets

The package has been evaluated on 8 benchmark datasets:

Dataset	Domain	Samples	Features	Imbalance Ratio
German Credit	Finance	1,000	20	2.33:1
Breast Cancer	Medical	569	30	1.68:1
Pima Diabetes	Medical	768	8	1.87:1
Haberman	Medical	306	3	2.78:1
Wine Quality	Food Science	4,898	11	3.39:1
Yeast	Biology	1,484	8	28.10:1
Thyroid	Medical	3,772	25	15.09:1
HTRU2	Astronomy	17,898	8	9.16:1

Reproducing Experiments

Running All Experiments

# Using OpenRouter (recommended)
python scripts/run_openrouter_experiments.py --all --seeds 42 123 456 789 1234

# Using local Ollama
./scripts/run_with_ollama_m4.sh

Running Single Experiment

python scripts/run_experiments.py \
    --dataset german_credit \
    --method qualsynth \
    --seed 42

Running Baselines

# SMOTE
python scripts/run_experiments.py --dataset german_credit --method smote --seed 42

# CTGAN
python scripts/run_experiments.py --dataset german_credit --method ctgan --seed 42

# TabFairGDT
python scripts/run_experiments.py --dataset german_credit --method tabfairgdt --seed 42

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Contact

Asım Sinan Yüksel
Department of Computer Engineering
Süleyman Demirel University
Email: asimyuksel@sdu.edu.tr

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Dec 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qualsynth-1.0.0.tar.gz (172.0 kB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

qualsynth-1.0.0-py3-none-any.whl (197.6 kB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file qualsynth-1.0.0.tar.gz.

File metadata

Download URL: qualsynth-1.0.0.tar.gz
Upload date: Dec 16, 2025
Size: 172.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for qualsynth-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`fa3887a2e9c86da2b313c5e12867029538ae29720a9a30cc63a51a68036bf23d`
MD5	`f582b6b996291a10f9547e7f2e412dc0`
BLAKE2b-256	`bee55a89a22cc522262d25c7722c9210cc6e1eea989ad1a819a1fc9c26f61193`

See more details on using hashes here.

File details

Details for the file qualsynth-1.0.0-py3-none-any.whl.

File metadata

Download URL: qualsynth-1.0.0-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 197.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for qualsynth-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ace298f61a098896ae54b8cc961729e256b38c35d6bcd7490fbf4b1313e40d16`
MD5	`0a4a18a9ad01fb42aea083d6da9c9827`
BLAKE2b-256	`3a5c263f24f17546e60a1670d4d4dd8a92a6093e6236c63174cd6d64ccc2e86d`

See more details on using hashes here.

qualsynth 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

QualSynth: Quality-Driven Synthetic Data Generation via LLM-Guided Oversampling

Key Features

Performance Highlights

Quick Start

Installation

Basic Usage

Using Local LLMs (Ollama)

Using OpenRouter (Cloud)

Project Structure

🔧 Configuration Options

Datasets

Reproducing Experiments

Running All Experiments

Running Single Experiment

Running Baselines

License

Contributing

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes