Diagnosis-first synthetic data augmentation for imbalanced tabular datasets

These details have not been verified by PyPI

Project links

Project description

SynthEdge

Python License

Diagnosis-first synthetic data augmentation for imbalanced tabular datasets.

SMOTE generated 114x more synthetic samples and achieved 22 points worse recall.
SynthEdge generated 18 targeted samples and preserved performance.
— Framingham Heart Study, 3,658 rows

Python

The problem with SMOTE

SMOTE answers "how do I get more data?" — it generates blindly across all minority samples.

SynthEdge answers "where is my data actually missing, and how do I fix exactly that?"

When your dataset has structural gaps — specific patient subgroups, rare feature combinations, underrepresented demographics — SMOTE fills the wrong places and can actively hurt recall. SynthEdge finds the gaps first, then synthesizes surgically.

Install

pip install synthedge

Quick start

from synthedge import SynthEdge

se = SynthEdge(df, target_col="target")

# Step 1: diagnose
report = se.analyze()
# Severity: SEVERE
# Will SynthEdge help? YES
# Recommendation: SMOTE will likely hurt recall. SynthEdge strongly recommended.

# Step 2: fill
aug_df = se.fill()
# [CTGAN] Training on 446 positive samples...
# [SE] Voxel (1,2,3): 9 samples via CTGAN
# [SE] Voxel (3,0,1): 6 samples via CTGAN
# Added 18 targeted positives (2915 -> 2933 rows)

# Step 3: check quality
report = se.quality_report()
# KL divergence in gap region: 1.969 (vs SMOTE: 1.989)

CLI

# Diagnose your dataset
synthedge analyze data.csv --target diagnosis

# Fill gaps and save augmented CSV
synthedge fill data.csv --target diagnosis --n-top 3 --out augmented.csv

How it works

1. 3D Local Density Scan (Gap Detector)

SynthEdge projects your training data into PCA space and tiles it with an adaptive 3D voxel grid. Each voxel is scored:

gap_score = 0.5 × sparsity + 0.3 × label_entropy + 0.2 × pos_rate

Voxels with high sparsity and minority-class presence are flagged as gaps. With n_bins=6, the grid has 216 voxels — fine-grained enough to isolate a 3-sample gap without merging it with nearby dense regions.

Unlike HDBSCAN (which needs min_cluster_size samples to form a cluster), this approach finds gaps with as few as 1 sample.

2. Severity Classifier

Before augmentation, SynthEdge tells you whether it will actually help:

Severity	Score	Meaning
NONE	< 0.15	Dataset is well-distributed. No augmentation needed.
MILD	0.15–0.35	Minor gaps. SynthEdge will match or slightly improve SMOTE.
MODERATE	0.35–0.60	Clear gaps. Meaningful recall improvement expected.
SEVERE	> 0.60	Structural gaps. SMOTE will likely hurt you. Use SynthEdge.

This is the only augmentation tool that tells you not to use it when it won't help.

3. CTGAN Targeted Synthesis

SynthEdge trains CTGAN on the positive-class samples to learn the real joint feature distribution. It then generates a large candidate pool and filters to samples that fall within identified gap voxels. A lightweight logistic discriminator rejects candidates that are too easy to identify as fake.

When CTGAN's pool is insufficient for a voxel (too sparse), it falls back to Gaussian sampling around the voxel centroid.

4. Multi-Dataset Gap Transfer

If you have multiple datasets covering the same domain, SynthEdge can find matching gap regions across them using centroid cosine similarity and transfer real samples from the less-sparse dataset into the more-sparse one — no synthesis needed, just rescaling.

from synthedge.transfer import find_matching_gaps, transfer_samples, apply_transfers

datasets_info = [
    {"name": "Cleveland", "top_voxels": top_cl, "scaler": sc_cl,
     "X_tr_sc": X_cl, "y_tr": y_cl, "feature_names": feat_cl},
    {"name": "Framingham", "top_voxels": top_fr, "scaler": sc_fr,
     "X_tr_sc": X_fr, "y_tr": y_fr, "feature_names": feat_fr},
]

matches   = find_matching_gaps(datasets_info, similarity_threshold=0.70)
transfers = transfer_samples(matches, n_transfer=20)
X_aug, y_aug, n_added = apply_transfers("Cleveland", X_cl, y_cl, transfers)

Benchmark results

Tested on three cardiovascular datasets with artificially carved gap regions (70% of minority samples in a specific demographic subgroup removed).

KL divergence in gap region (lower = better recovery)

Dataset	No aug	SMOTE	ADASYN	SynthEdge	SE vs SMOTE
Cleveland (297 rows)	1.033	1.039	1.057	0.972	−0.067 ✓
Framingham (3,658 rows)	1.987	1.989	1.998	1.969	−0.020 ✓
Cardiovascular (68,604 rows)	0.653	0.646	0.646	0.653	+0.006

Minority-class recall

Dataset	No aug	SMOTE	ADASYN	SynthEdge	SE vs SMOTE
Cleveland (297 rows)	0.821	0.821	0.857	0.821	0.0 pp
Framingham (3,658 rows)	0.514	0.288	0.351	0.486	+19.8 pp ✓
Cardiovascular (68,604 rows)	0.695	0.690	0.690	0.693	+0.3 pp ✓

Synthesis efficiency

Dataset	SMOTE added	SynthEdge added	Ratio
Cleveland	22	27	1×
Framingham	2,045	18	114× fewer
Cardiovascular	1,690	39	43× fewer

Key finding: On Framingham, SMOTE generated 114× more synthetic samples and achieved 22 points worse recall. SynthEdge generated 18 targeted samples and preserved performance.

On balanced datasets (Cardiovascular, 49.5% positive rate): SynthEdge correctly detects low severity and produces results identical to no augmentation — it does not over-generate when data is already sufficient.

When to use SynthEdge

Use SynthEdge	Use SMOTE instead
Clinical / healthcare data	Generic balanced datasets
Specific demographic subgroups underrepresented	Uniform minority-class sparsity
Audit / compliance requirements (gap report)	Quick baseline augmentation
Multiple datasets from same domain (transfer)	Single small dataset, no CTGAN data
Severity = MODERATE or SEVERE	Severity = NONE or MILD

API reference

`SynthEdge(df, target_col, feature_cols, discrete_cols, verbose)`

Method	Description
`.analyze(n_bins, top_k)`	Run gap detection + severity classification
`.fill(n_top, ctgan_epochs, use_ctgan)`	Synthesize targeted samples
`.quality_report(held_sc)`	KL divergence + feature drift metrics
`.gap_map`	DataFrame of top gap voxels with scores
`.severity`	Severity classification result dict

`synthedge.transfer`

Function	Description
`find_matching_gaps(datasets_info, threshold)`	Match gap regions across datasets
`transfer_samples(matches, n_transfer)`	Extract real samples for transfer
`apply_transfers(name, X_sc, y, transfers)`	Inject transferred samples

Citation

If you use SynthEdge in research, please cite:

@software{synthedge2025,
  title  = {SynthEdge: Diagnosis-first synthetic data augmentation},
  author = {Sagnik},
  year   = {2025},
  url    = {https://github.com/yourusername/synthedge}
}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 27, 2026

This version

0.1.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthedge-0.1.0.tar.gz (22.1 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

synthedge-0.1.0-py3-none-any.whl (19.2 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file synthedge-0.1.0.tar.gz.

File metadata

Download URL: synthedge-0.1.0.tar.gz
Upload date: Apr 23, 2026
Size: 22.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for synthedge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d6bcdf7d69aa5fb724377a0d8847408f785c53a27221374531965ca4a3ffe500`
MD5	`ac4147a3bcf90219ad749c9a3b172f57`
BLAKE2b-256	`0b6ef698903b5e476a3646e2fa4b8c9ff72c474b56b7daf9224fd2c8c5720632`

See more details on using hashes here.

File details

Details for the file synthedge-0.1.0-py3-none-any.whl.

File metadata

Download URL: synthedge-0.1.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 19.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for synthedge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`080b938416858909687fd46451176cad117815b69abd52ee0efb66c646097515`
MD5	`ab7387209a355cd75317d8958e8da619`
BLAKE2b-256	`cea9bbfbf116b785eb54ceb3d9aef287fa484ea3cf0711e48081e29e3fc3af6f`

See more details on using hashes here.

synthedge 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SynthEdge

The problem with SMOTE

Install

Quick start

CLI

How it works

1. 3D Local Density Scan (Gap Detector)

2. Severity Classifier

3. CTGAN Targeted Synthesis

4. Multi-Dataset Gap Transfer

Benchmark results

KL divergence in gap region (lower = better recovery)

Minority-class recall

Synthesis efficiency

When to use SynthEdge

API reference

SynthEdge(df, target_col, feature_cols, discrete_cols, verbose)

synthedge.transfer

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`SynthEdge(df, target_col, feature_cols, discrete_cols, verbose)`

`synthedge.transfer`