Skip to main content

Quality Data Extractor (QDE): CES & OES filtering for synthetic data

Project description

Quality Data Extractor (QDE)

PyPI Python Versions License DOI

QDE (Quality Data Extractor) is a Python framework for post-generation filtration of synthetic data.

It introduces two filtering strategies:

  • CES (Comprehensive Extraction Strategy)
  • OES (Optimal Extraction Strategy)

These strategies help researchers and practitioners filter synthetic datasets to retain samples that improve downstream model accuracy.

📄 Published in IEEE Access (2025):
Sachdeva, P., Malhotra, A., & Gupta, K. — Quality Data Extractor (QDE): Elevating Synthetic Data Augmentation through Post-Generation Filtration


🚀 Installation

From PyPI:

pip install qde

🔧 Quick Start

import qde
from qde import QDE
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
import numpy as np

# Example: use Iris dataset
X, y = load_iris(return_X_y=True)
train_X, train_y = X[:80], y[:80]
synth_X, synth_y = X[80:110], y[80:110]   # pretend this is synthetic
test_X,  test_y  = X[110:], y[110:]

# Initialize QDE with CES
q = QDE(default_strategy="ces")
q.fit(train_X, train_y, synth_X, synth_y, test_X, test_y, encode_labels=True)

# Extract filtered synthetic samples
result, X_sel, y_sel = q.extract(estimator=GaussianNB())
print("Selected indices:", result.indices)
print("Filtered accuracy:", result.meta["filtered-accuracy"])

🖥️ Command-Line Interface (CLI)

QDE also ships a CLI:

qde strategies
# -> ces
# -> oes

qde run --train train.csv --synth synth.csv --test test.csv --target target --strategy ces

📖 Documentation

  • CES
    Adds synthetic samples one by one, retaining only those that do not reduce baseline accuracy.

  • OES
    Selects samples using distance-based neighborhood filtering (configurable with --k-neighbors and --distance-mode).

✅ Each run outputs

  • SelectionResult.indices → indices of accepted synthetic samples
  • meta → metadata (strategy, accuracy metrics, etc.)

🛠️ Development

Clone the repo and install in editable mode:

git clone https://github.com/pragatischdv/quality-data-extractor
cd quality-data-extractor
pip install -e .

📄 Citation

If you use QDE in your research, please cite:

@ARTICLE{11142788,
  author={Sachdeva, Pragati and Malhotra, Amarjit and Gupta, Karan},
  journal={IEEE Access}, 
  title={Quality Data Extractor (QDE): Elevating Synthetic Data Augmentation through Post-Generation Filtration}, 
  year={2025},
  doi={10.1109/ACCESS.2025.3603435}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qde-1.0.2.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qde-1.0.2-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file qde-1.0.2.tar.gz.

File metadata

  • Download URL: qde-1.0.2.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for qde-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c8b6ebc9ced8151ececa70931c297763dec89b4cc2c6273022cafb498ec9f8ef
MD5 19cd56a2028af2b6c3914817eb6b4afd
BLAKE2b-256 179165a448958e3301fbd10ebc9ffd85e135d6ecc4d17a90ce8644bb3e0e0585

See more details on using hashes here.

File details

Details for the file qde-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: qde-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for qde-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bb4422fbba75bfc955569647ab51e8edaa1edda96e7309ee8c90a600b2d16cc8
MD5 bc43492215742baa2031c1667e2051e9
BLAKE2b-256 8fa4a3af66a763fc1ae841574771e0b7965f7fbad58cfeccd37f9772d24c827a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page