Skip to main content

Quality Data Extractor (QDE): CES & OES filtering for synthetic data

Project description

Quality Data Extractor (QDE)

PyPI Python Versions License DOI

QDE (Quality Data Extractor) is a Python framework for post-generation filtration of synthetic data.

It introduces two filtering strategies:

  • CES (Comprehensive Extraction Strategy)
  • OES (Optimal Extraction Strategy)

These strategies help researchers and practitioners filter synthetic datasets to retain samples that improve downstream model accuracy.

📄 Published in IEEE Access (2025):
Sachdeva, P., Malhotra, A., & Gupta, K. — Quality Data Extractor (QDE): Elevating Synthetic Data Augmentation through Post-Generation Filtration


🚀 Installation

From PyPI:

pip install qde

🔧 Quick Start

import qde
from qde import QDE
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
import numpy as np

# Example: use Iris dataset
X, y = load_iris(return_X_y=True)
train_X, train_y = X[:80], y[:80]
synth_X, synth_y = X[80:110], y[80:110]   # pretend this is synthetic
test_X,  test_y  = X[110:], y[110:]

# Initialize QDE with CES
q = QDE(default_strategy="ces")
q.fit(train_X, train_y, synth_X, synth_y, test_X, test_y, encode_labels=True)

# Extract filtered synthetic samples
result, X_sel, y_sel = q.extract(estimator=GaussianNB())
print("Selected indices:", result.indices)
print("Filtered accuracy:", result.meta["filtered-accuracy"])

🖥️ Command-Line Interface (CLI)

QDE also ships a CLI:

qde strategies
# -> ces
# -> oes

qde run --train train.csv --synth synth.csv --test test.csv --target target --strategy ces

📖 Documentation

  • CES
    Adds synthetic samples one by one, retaining only those that do not reduce baseline accuracy.

  • OES
    Selects samples using distance-based neighborhood filtering (configurable with --k-neighbors and --distance-mode).

✅ Each run outputs

  • SelectionResult.indices → indices of accepted synthetic samples
  • meta → metadata (strategy, accuracy metrics, etc.)

🛠️ Development

Clone the repo and install in editable mode:

git clone https://github.com/pragatischdv/quality-data-extractor
cd quality-data-extractor
pip install -e .

📄 Citation

If you use QDE in your research, please cite:

@ARTICLE{11142788,
  author={Sachdeva, Pragati and Malhotra, Amarjit and Gupta, Karan},
  journal={IEEE Access}, 
  title={Quality Data Extractor (QDE): Elevating Synthetic Data Augmentation through Post-Generation Filtration}, 
  year={2025},
  doi={10.1109/ACCESS.2025.3603435}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qde-1.0.1.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qde-1.0.1-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file qde-1.0.1.tar.gz.

File metadata

  • Download URL: qde-1.0.1.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for qde-1.0.1.tar.gz
Algorithm Hash digest
SHA256 fcfd055e3347a37271a77e11459440969646704275fedb759e3b16f64fca3f37
MD5 57e0024f4731c12b144dde756c33a8ac
BLAKE2b-256 873e6120747a5ea550df4f20c31602f546439dba9c61e0a198ef3461114a526c

See more details on using hashes here.

File details

Details for the file qde-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: qde-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for qde-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c2d81772925505bd114aaf4a6946845ae3feb46faecce3ad440a00f63dca5d87
MD5 dda182fa4afbb6cf0f66b03fbb033bd2
BLAKE2b-256 3fb58843f0353ed0fba42f356c6be13cfcd028e20711637678c4e7ed75f21521

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page