Skip to main content

Quality-Driven Synthetic Data Generation via LLM-Guided Oversampling

Project description

QualSynth: Quality-Driven Synthetic Data Generation via LLM-Guided Oversampling

Python 3.9+ License: MIT

QualSynth is a Python package that leverages Large Language Models (LLMs) with iterative refinement to generate quality-validated synthetic samples for imbalanced classification tasks.

QualSynth Architecture

Key Features

  • LLM-Guided Generation: Uses LLMs to generate contextually aware synthetic samples that respect domain constraints
  • Multi-Stage Validation: Every sample passes schema validation, statistical checks, and duplicate detection
  • Anchor-Centric Approach: Generates variations of real minority samples, preserving natural feature correlations
  • Zero Duplicates: Achieves 0% duplicate ratio across all datasets (vs 29.7% for TabFairGDT)
  • Fairness-Aware: Reduces demographic parity difference without explicit fairness constraints
  • Multiple LLM Backends: Supports OpenAI, Ollama (local), OpenRouter, and custom endpoints

Performance Highlights

Evaluated on 8 benchmark datasets across 320 experiments (8 datasets × 10 seeds × 4 methods):

Metric QualSynth SMOTE CTGAN TabFairGDT
F1 Rank 2.12 2.25 2.50 3.12
ROC-AUC Rank 1.63 2.50 3.63 2.25
Duplicate Ratio 0% 0% 0% 29.7%
DPD (Fairness) 0.062 0.089 0.139 0.095

Critical Difference Diagram

Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/qualsynth.git
cd qualsynth

# Install dependencies
pip install -r requirements.txt

Basic Usage

from qualsynth import QualSynthGenerator

# Initialize generator
generator = QualSynthGenerator(
    model_name="gpt-4",
    api_key="your-openai-api-key",  # Or set OPENAI_API_KEY env var
    temperature=0.7,
    max_iterations=20
)

# Generate synthetic samples
X_synthetic, y_synthetic = generator.fit_generate(X_train, y_train)

# Combine with original data for training
X_augmented = pd.concat([X_train, X_synthetic])
y_augmented = pd.concat([y_train, y_synthetic])

Using Local LLMs (Ollama)

from qualsynth import QualSynthGenerator

# First, start Ollama server: ollama serve
# Pull a model: ollama pull gemma3:12b

generator = QualSynthGenerator(
    model_name="gemma3:12b",  # Model name from 'ollama list'
    api_base="http://localhost:11434/v1"
)

X_synthetic, y_synthetic = generator.fit_generate(X_train, y_train)

Using OpenRouter (Cloud)

from qualsynth import QualSynthGenerator

generator = QualSynthGenerator(
    model_name="google/gemma-2-9b-it",
    api_key="your-openrouter-api-key",
    api_base="https://openrouter.ai/api/v1"
)

X_synthetic, y_synthetic = generator.fit_generate(X_train, y_train)

Project Structure

qualsynth/
├── src/qualsynth/           # Main package source code
│   ├── core/                # Core workflow logic
│   │   └── iterative_workflow.py
│   ├── generators/          # Sample generation
│   │   └── counterfactual_generator.py
│   ├── validation/          # Multi-stage validation
│   │   ├── adaptive_validator.py
│   │   └── universal_validator.py
│   ├── modules/             # LLM-powered modules
│   │   ├── dataset_profiler.py
│   │   ├── schema_profiler.py
│   │   ├── diversity_planner.py
│   │   ├── fairness_auditor.py
│   │   └── validator.py
│   ├── baselines/           # Baseline implementations
│   │   ├── smote.py
│   │   ├── ctgan_baseline.py
│   │   └── tabfairgdt.py
│   ├── evaluation/          # Metrics and classifiers
│   └── prompts/             # LLM prompt templates
├── configs/                 # Dataset and method configurations
├── scripts/                 # Experiment scripts
├── replication/             # Replication package
│   ├── qualsyn-1.0.0/       # Standalone package
│   └── tables/              # Pre-computed results
└── data/splits/             # Pre-computed dataset splits

🔧 Configuration Options

Parameter Default Description
model_name "gemma3:12b" LLM model to use
api_key None API key for cloud providers
api_base None Custom API endpoint URL
temperature 0.7 Generation diversity (lower = more consistent)
batch_size 20 Samples per LLM call
max_iterations 20 Maximum refinement iterations
target_ratio 1.0 Target class ratio (1.0 = balanced)
validation_threshold 4.5 Statistical validation threshold (σ)
sensitive_attributes None Columns for fairness-aware generation

Datasets

The package has been evaluated on 8 benchmark datasets:

Dataset Domain Samples Features Imbalance Ratio
German Credit Finance 1,000 20 2.33:1
Breast Cancer Medical 569 30 1.68:1
Pima Diabetes Medical 768 8 1.87:1
Haberman Medical 306 3 2.78:1
Wine Quality Food Science 4,898 11 3.39:1
Yeast Biology 1,484 8 28.10:1
Thyroid Medical 3,772 25 15.09:1
HTRU2 Astronomy 17,898 8 9.16:1

Reproducing Experiments

Running All Experiments

# Using OpenRouter (recommended)
python scripts/run_openrouter_experiments.py --all --seeds 42 123 456 789 1234

# Using local Ollama
./scripts/run_with_ollama_m4.sh

Running Single Experiment

python scripts/run_experiments.py \
    --dataset german_credit \
    --method qualsynth \
    --seed 42

Running Baselines

# SMOTE
python scripts/run_experiments.py --dataset german_credit --method smote --seed 42

# CTGAN
python scripts/run_experiments.py --dataset german_credit --method ctgan --seed 42

# TabFairGDT
python scripts/run_experiments.py --dataset german_credit --method tabfairgdt --seed 42

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Contact

Asım Sinan Yüksel
Department of Computer Engineering
Süleyman Demirel University
Email: asimyuksel@sdu.edu.tr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qualsynth-1.0.0.tar.gz (172.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qualsynth-1.0.0-py3-none-any.whl (197.6 kB view details)

Uploaded Python 3

File details

Details for the file qualsynth-1.0.0.tar.gz.

File metadata

  • Download URL: qualsynth-1.0.0.tar.gz
  • Upload date:
  • Size: 172.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for qualsynth-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fa3887a2e9c86da2b313c5e12867029538ae29720a9a30cc63a51a68036bf23d
MD5 f582b6b996291a10f9547e7f2e412dc0
BLAKE2b-256 bee55a89a22cc522262d25c7722c9210cc6e1eea989ad1a819a1fc9c26f61193

See more details on using hashes here.

File details

Details for the file qualsynth-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: qualsynth-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 197.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for qualsynth-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ace298f61a098896ae54b8cc961729e256b38c35d6bcd7490fbf4b1313e40d16
MD5 0a4a18a9ad01fb42aea083d6da9c9827
BLAKE2b-256 3a5c263f24f17546e60a1670d4d4dd8a92a6093e6236c63174cd6d64ccc2e86d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page