Diagnosis-first synthetic data augmentation for imbalanced tabular datasets
Project description
SynthEdge
Diagnosis-first synthetic data augmentation for imbalanced tabular datasets.
SMOTE generated 114x more synthetic samples and achieved 22 points worse recall.
SynthEdge generated 18 targeted samples and preserved performance.
— Framingham Heart Study, 3,658 rows
The problem with SMOTE
SMOTE answers "how do I get more data?" — it generates blindly across all minority samples.
SynthEdge answers "where is my data actually missing, and how do I fix exactly that?"
When your dataset has structural gaps — specific patient subgroups, rare feature combinations, underrepresented demographics — SMOTE fills the wrong places and can actively hurt recall. SynthEdge finds the gaps first, then synthesizes surgically.
Install
pip install synthedge
Quick start
from synthedge import SynthEdge
se = SynthEdge(df, target_col="target")
# Step 1: diagnose
report = se.analyze()
# Severity: SEVERE
# Will SynthEdge help? YES
# Recommendation: SMOTE will likely hurt recall. SynthEdge strongly recommended.
# Step 2: fill
aug_df = se.fill()
# [CTGAN] Training on 446 positive samples...
# [SE] Voxel (1,2,3): 9 samples via CTGAN
# [SE] Voxel (3,0,1): 6 samples via CTGAN
# Added 18 targeted positives (2915 -> 2933 rows)
# Step 3: check quality
report = se.quality_report()
# KL divergence in gap region: 1.969 (vs SMOTE: 1.989)
CLI
# Diagnose your dataset
synthedge analyze data.csv --target diagnosis
# Fill gaps and save augmented CSV
synthedge fill data.csv --target diagnosis --n-top 3 --out augmented.csv
How it works
1. 3D Local Density Scan (Gap Detector)
SynthEdge projects your training data into PCA space and tiles it with an adaptive 3D voxel grid. Each voxel is scored:
gap_score = 0.5 × sparsity + 0.3 × label_entropy + 0.2 × pos_rate
Voxels with high sparsity and minority-class presence are flagged as gaps. With n_bins=6, the grid has 216 voxels — fine-grained enough to isolate a 3-sample gap without merging it with nearby dense regions.
Unlike HDBSCAN (which needs min_cluster_size samples to form a cluster), this approach finds gaps with as few as 1 sample.
2. Severity Classifier
Before augmentation, SynthEdge tells you whether it will actually help:
| Severity | Score | Meaning |
|---|---|---|
| NONE | < 0.15 | Dataset is well-distributed. No augmentation needed. |
| MILD | 0.15–0.35 | Minor gaps. SynthEdge will match or slightly improve SMOTE. |
| MODERATE | 0.35–0.60 | Clear gaps. Meaningful recall improvement expected. |
| SEVERE | > 0.60 | Structural gaps. SMOTE will likely hurt you. Use SynthEdge. |
This is the only augmentation tool that tells you not to use it when it won't help.
3. CTGAN Targeted Synthesis
SynthEdge trains CTGAN on the positive-class samples to learn the real joint feature distribution. It then generates a large candidate pool and filters to samples that fall within identified gap voxels. A lightweight logistic discriminator rejects candidates that are too easy to identify as fake.
When CTGAN's pool is insufficient for a voxel (too sparse), it falls back to Gaussian sampling around the voxel centroid.
4. Multi-Dataset Gap Transfer
If you have multiple datasets covering the same domain, SynthEdge can find matching gap regions across them using centroid cosine similarity and transfer real samples from the less-sparse dataset into the more-sparse one — no synthesis needed, just rescaling.
from synthedge.transfer import find_matching_gaps, transfer_samples, apply_transfers
datasets_info = [
{"name": "Cleveland", "top_voxels": top_cl, "scaler": sc_cl,
"X_tr_sc": X_cl, "y_tr": y_cl, "feature_names": feat_cl},
{"name": "Framingham", "top_voxels": top_fr, "scaler": sc_fr,
"X_tr_sc": X_fr, "y_tr": y_fr, "feature_names": feat_fr},
]
matches = find_matching_gaps(datasets_info, similarity_threshold=0.70)
transfers = transfer_samples(matches, n_transfer=20)
X_aug, y_aug, n_added = apply_transfers("Cleveland", X_cl, y_cl, transfers)
Benchmark results
Tested on three cardiovascular datasets with artificially carved gap regions (70% of minority samples in a specific demographic subgroup removed).
KL divergence in gap region (lower = better recovery)
| Dataset | No aug | SMOTE | ADASYN | SynthEdge | SE vs SMOTE |
|---|---|---|---|---|---|
| Cleveland (297 rows) | 1.033 | 1.039 | 1.057 | 0.972 | −0.067 ✓ |
| Framingham (3,658 rows) | 1.987 | 1.989 | 1.998 | 1.969 | −0.020 ✓ |
| Cardiovascular (68,604 rows) | 0.653 | 0.646 | 0.646 | 0.653 | +0.006 |
Minority-class recall
| Dataset | No aug | SMOTE | ADASYN | SynthEdge | SE vs SMOTE |
|---|---|---|---|---|---|
| Cleveland (297 rows) | 0.821 | 0.821 | 0.857 | 0.821 | 0.0 pp |
| Framingham (3,658 rows) | 0.514 | 0.288 | 0.351 | 0.486 | +19.8 pp ✓ |
| Cardiovascular (68,604 rows) | 0.695 | 0.690 | 0.690 | 0.693 | +0.3 pp ✓ |
Synthesis efficiency
| Dataset | SMOTE added | SynthEdge added | Ratio |
|---|---|---|---|
| Cleveland | 22 | 27 | 1× |
| Framingham | 2,045 | 18 | 114× fewer |
| Cardiovascular | 1,690 | 39 | 43× fewer |
Key finding: On Framingham, SMOTE generated 114× more synthetic samples and achieved 22 points worse recall. SynthEdge generated 18 targeted samples and preserved performance.
On balanced datasets (Cardiovascular, 49.5% positive rate): SynthEdge correctly detects low severity and produces results identical to no augmentation — it does not over-generate when data is already sufficient.
When to use SynthEdge
| Use SynthEdge | Use SMOTE instead |
|---|---|
| Clinical / healthcare data | Generic balanced datasets |
| Specific demographic subgroups underrepresented | Uniform minority-class sparsity |
| Audit / compliance requirements (gap report) | Quick baseline augmentation |
| Multiple datasets from same domain (transfer) | Single small dataset, no CTGAN data |
| Severity = MODERATE or SEVERE | Severity = NONE or MILD |
API reference
SynthEdge(df, target_col, feature_cols, discrete_cols, verbose)
| Method | Description |
|---|---|
.analyze(n_bins, top_k) |
Run gap detection + severity classification |
.fill(n_top, ctgan_epochs, use_ctgan) |
Synthesize targeted samples |
.quality_report(held_sc) |
KL divergence + feature drift metrics |
.gap_map |
DataFrame of top gap voxels with scores |
.severity |
Severity classification result dict |
synthedge.transfer
| Function | Description |
|---|---|
find_matching_gaps(datasets_info, threshold) |
Match gap regions across datasets |
transfer_samples(matches, n_transfer) |
Extract real samples for transfer |
apply_transfers(name, X_sc, y, transfers) |
Inject transferred samples |
Citation
If you use SynthEdge in research, please cite:
@software{synthedge2025,
title = {SynthEdge: Diagnosis-first synthetic data augmentation},
author = {Sagnik},
year = {2025},
url = {https://github.com/yourusername/synthedge}
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synthedge-0.1.0.tar.gz.
File metadata
- Download URL: synthedge-0.1.0.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6bcdf7d69aa5fb724377a0d8847408f785c53a27221374531965ca4a3ffe500
|
|
| MD5 |
ac4147a3bcf90219ad749c9a3b172f57
|
|
| BLAKE2b-256 |
0b6ef698903b5e476a3646e2fa4b8c9ff72c474b56b7daf9224fd2c8c5720632
|
File details
Details for the file synthedge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: synthedge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
080b938416858909687fd46451176cad117815b69abd52ee0efb66c646097515
|
|
| MD5 |
ab7387209a355cd75317d8958e8da619
|
|
| BLAKE2b-256 |
cea9bbfbf116b785eb54ceb3d9aef287fa484ea3cf0711e48081e29e3fc3af6f
|