Skip to main content

Diagnosis-first synthetic data augmentation for imbalanced tabular datasets

Project description

SynthEdge

CI Python License

Diagnosis-first synthetic data augmentation for imbalanced tabular datasets.

SMOTE generated 114x more synthetic samples and achieved 22 points worse recall.
SynthEdge generated 18 targeted samples and preserved performance.
Framingham Heart Study, 3,658 rows


CI Python

The problem with SMOTE

SMOTE answers "how do I get more data?" — it generates blindly across all minority samples.

SynthEdge answers "where is my data actually missing, and how do I fix exactly that?"

When your dataset has structural gaps — specific patient subgroups, rare feature combinations, underrepresented demographics — SMOTE fills the wrong places and can actively hurt recall. SynthEdge finds the gaps first, then synthesizes surgically.


Install

pip install synthedge

Quick start

from synthedge import SynthEdge

se = SynthEdge(df, target_col="target")

# Step 1: diagnose
report = se.analyze()
# Severity: SEVERE
# Will SynthEdge help? YES
# Recommendation: SMOTE will likely hurt recall. SynthEdge strongly recommended.

# Step 2: fill
aug_df = se.fill()
# [CTGAN] Training on 446 positive samples...
# [SE] Voxel (1,2,3): 9 samples via CTGAN
# [SE] Voxel (3,0,1): 6 samples via CTGAN
# Added 18 targeted positives (2915 -> 2933 rows)

# Step 3: check quality
report = se.quality_report()
# KL divergence in gap region: 1.969 (vs SMOTE: 1.989)

CLI

# Diagnose your dataset
synthedge analyze data.csv --target diagnosis

# Fill gaps and save augmented CSV
synthedge fill data.csv --target diagnosis --n-top 3 --out augmented.csv

How it works

1. 3D Local Density Scan (Gap Detector)

SynthEdge projects your training data into PCA space and tiles it with an adaptive 3D voxel grid. Each voxel is scored:

gap_score = 0.5 × sparsity + 0.3 × label_entropy + 0.2 × pos_rate

Voxels with high sparsity and minority-class presence are flagged as gaps. With n_bins=6, the grid has 216 voxels — fine-grained enough to isolate a 3-sample gap without merging it with nearby dense regions.

Unlike HDBSCAN (which needs min_cluster_size samples to form a cluster), this approach finds gaps with as few as 1 sample.

2. Severity Classifier

Before augmentation, SynthEdge tells you whether it will actually help:

Severity Score Meaning
NONE < 0.15 Dataset is well-distributed. No augmentation needed.
MILD 0.15–0.35 Minor gaps. SynthEdge will match or slightly improve SMOTE.
MODERATE 0.35–0.60 Clear gaps. Meaningful recall improvement expected.
SEVERE > 0.60 Structural gaps. SMOTE will likely hurt you. Use SynthEdge.

This is the only augmentation tool that tells you not to use it when it won't help.

3. CTGAN Targeted Synthesis

SynthEdge trains CTGAN on the positive-class samples to learn the real joint feature distribution. It then generates a large candidate pool and filters to samples that fall within identified gap voxels. A lightweight logistic discriminator rejects candidates that are too easy to identify as fake.

When CTGAN's pool is insufficient for a voxel (too sparse), it falls back to Gaussian sampling around the voxel centroid.

4. Multi-Dataset Gap Transfer

If you have multiple datasets covering the same domain, SynthEdge can find matching gap regions across them using centroid cosine similarity and transfer real samples from the less-sparse dataset into the more-sparse one — no synthesis needed, just rescaling.

from synthedge.transfer import find_matching_gaps, transfer_samples, apply_transfers

datasets_info = [
    {"name": "Cleveland", "top_voxels": top_cl, "scaler": sc_cl,
     "X_tr_sc": X_cl, "y_tr": y_cl, "feature_names": feat_cl},
    {"name": "Framingham", "top_voxels": top_fr, "scaler": sc_fr,
     "X_tr_sc": X_fr, "y_tr": y_fr, "feature_names": feat_fr},
]

matches   = find_matching_gaps(datasets_info, similarity_threshold=0.70)
transfers = transfer_samples(matches, n_transfer=20)
X_aug, y_aug, n_added = apply_transfers("Cleveland", X_cl, y_cl, transfers)

Benchmark results

Tested on three cardiovascular datasets with artificially carved gap regions (70% of minority samples in a specific demographic subgroup removed).

KL divergence in gap region (lower = better recovery)

Dataset No aug SMOTE ADASYN SynthEdge SE vs SMOTE
Cleveland (297 rows) 1.033 1.039 1.057 0.972 −0.067
Framingham (3,658 rows) 1.987 1.989 1.998 1.969 −0.020
Cardiovascular (68,604 rows) 0.653 0.646 0.646 0.653 +0.006

Minority-class recall

Dataset No aug SMOTE ADASYN SynthEdge SE vs SMOTE
Cleveland (297 rows) 0.821 0.821 0.857 0.821 0.0 pp
Framingham (3,658 rows) 0.514 0.288 0.351 0.486 +19.8 pp
Cardiovascular (68,604 rows) 0.695 0.690 0.690 0.693 +0.3 pp

Synthesis efficiency

Dataset SMOTE added SynthEdge added Ratio
Cleveland 22 27
Framingham 2,045 18 114× fewer
Cardiovascular 1,690 39 43× fewer

Key finding: On Framingham, SMOTE generated 114× more synthetic samples and achieved 22 points worse recall. SynthEdge generated 18 targeted samples and preserved performance.

On balanced datasets (Cardiovascular, 49.5% positive rate): SynthEdge correctly detects low severity and produces results identical to no augmentation — it does not over-generate when data is already sufficient.


When to use SynthEdge

Use SynthEdge Use SMOTE instead
Clinical / healthcare data Generic balanced datasets
Specific demographic subgroups underrepresented Uniform minority-class sparsity
Audit / compliance requirements (gap report) Quick baseline augmentation
Multiple datasets from same domain (transfer) Single small dataset, no CTGAN data
Severity = MODERATE or SEVERE Severity = NONE or MILD

API reference

SynthEdge(df, target_col, feature_cols, discrete_cols, verbose)

Method Description
.analyze(n_bins, top_k) Run gap detection + severity classification
.fill(n_top, ctgan_epochs, use_ctgan) Synthesize targeted samples
.quality_report(held_sc) KL divergence + feature drift metrics
.gap_map DataFrame of top gap voxels with scores
.severity Severity classification result dict

synthedge.transfer

Function Description
find_matching_gaps(datasets_info, threshold) Match gap regions across datasets
transfer_samples(matches, n_transfer) Extract real samples for transfer
apply_transfers(name, X_sc, y, transfers) Inject transferred samples

Citation

If you use SynthEdge in research, please cite:

@software{synthedge2025,
  title  = {SynthEdge: Diagnosis-first synthetic data augmentation},
  author = {Sagnik},
  year   = {2025},
  url    = {https://github.com/yourusername/synthedge}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthedge-0.1.0.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

synthedge-0.1.0-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file synthedge-0.1.0.tar.gz.

File metadata

  • Download URL: synthedge-0.1.0.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for synthedge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d6bcdf7d69aa5fb724377a0d8847408f785c53a27221374531965ca4a3ffe500
MD5 ac4147a3bcf90219ad749c9a3b172f57
BLAKE2b-256 0b6ef698903b5e476a3646e2fa4b8c9ff72c474b56b7daf9224fd2c8c5720632

See more details on using hashes here.

File details

Details for the file synthedge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: synthedge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for synthedge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 080b938416858909687fd46451176cad117815b69abd52ee0efb66c646097515
MD5 ab7387209a355cd75317d8958e8da619
BLAKE2b-256 cea9bbfbf116b785eb54ceb3d9aef287fa484ea3cf0711e48081e29e3fc3af6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page