Synthesis of Next Generation Bulk Transcriptomic Sequencing - A data augmentation tool for transcriptomics data using deep generative models
Project description
SyNG-BTS: Synthesis of Next Generation Bulk Transcriptomic Sequencing
SyNG-BTS is a Python package for data augmentation of bulk transcriptomic sequencing data using deep generative models. It synthesizes realistic transcriptomic data without relying on predefined formulas, enabling researchers to augment small pilot datasets for more robust machine learning analyses. SyNG-BTS supports three generative model families: variational autoencoders (VAE), generative adversarial networks (GAN), and flow-based models. These models are trained on a pilot dataset and can synthesize additional samples at any desired scale.
Features
- Multiple Generative Models: VAE, CVAE, GAN, WGANGP, and flow-based models (MAF)
- Unified API: Run pilot experiments, generate synthetic data, and perform transfer learning
- DataFrame-First API: Accept pandas DataFrames, CSV file paths, or bundled dataset names
- Rich Result Objects:
SyngResult/PilotResultwith built-in plotting and export - In-Memory Pipeline: No disk I/O by default — results stay in memory until you choose to save
- Built-in Evaluation: Heatmap and UMAP visualization functions
- Sample-Size Evaluation: Integrated SyntheSize methodology for classifier learning curves
- Bundled Datasets: Example TCGA datasets included for immediate experimentation
Installation
From PyPI (Recommended)
pip install syng-bts
From Source
git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
pip install -e .
Optional Dependencies
For documentation building:
pip install syng-bts[docs]
For development (testing, linting):
pip install syng-bts[dev]
Quick Start
Generate Synthetic Data
from syng_bts import generate
# Train a VAE on bundled data and generate 500 synthetic samples
result = generate(
data="SKCMPositive_4", # bundled dataset name, CSV path, or DataFrame
model="VAE1-10", # model specification (type + kl_weight)
new_size=500, # number of synthetic samples
batch_frac=0.1, # batch fraction
learning_rate=0.0005, # learning rate
)
# Grouped-size controls
# - int: exact total sample count (grouped data is split by input ratio)
# - list[int]: explicit grouped counts [n_group_0, n_group_1]
# where group_0 is the first group value encountered in input data.
# Access results in memory
print(result.generated_data.shape) # (500, n_features)
print(result.loss.columns.tolist()) # ['kl', 'recons']
print(result.summary())
# Plot training loss (one figure per loss column)
figs = result.plot_loss() # dict[str, Figure]
# Optionally save to disk
result.save("./my_output/")
# Load a previously saved result
from syng_bts import SyngResult
loaded = SyngResult.load("./my_output/")
Run a Pilot Study
from syng_bts import pilot_study
# Sweep over multiple pilot sizes (5 random draws each)
pilot = pilot_study(
data="SKCMPositive_4",
pilot_size=[50, 100],
model="VAE1-10",
batch_frac=0.1,
learning_rate=0.0005,
)
# Access individual runs
run = pilot.runs[(50, 1)] # (pilot_size, draw_index)
print(run.generated_data.head())
# All runs overlaid on one plot per loss column
figs = pilot.plot_loss(style="overlay_runs") # dict[str, Figure]
# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band") # dict[str, Figure]
Transfer Learning
from syng_bts import transfer
# Pre-train on PRAD, fine-tune and generate on BRCA
result = transfer(
source_data="PRAD",
target_data="BRCA",
new_size=500,
model="maf",
apply_log=True,
epoch=10,
)
print(result.generated_data.shape)
result.save("./transfer_output/")
Use DataFrame Input
import pandas as pd
from syng_bts import generate
my_data = pd.read_csv("my_dataset.csv")
result = generate(
data=my_data,
name="my_dataset", # used in output filenames
model="WGANGP",
new_size=1000,
epoch=50,
)
Evaluate Generated Data
from syng_bts import generate, resolve_data, heatmap_eval, UMAP_eval
result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
real_data, _groups = resolve_data("SKCMPositive_4")
# Built-in heatmap on result object
fig = result.plot_heatmap()
# Standalone evaluation comparing real vs generated
fig_heatmap = heatmap_eval(real_data=real_data.head(50), generated_data=result.generated_data.head(50))
fig_umap = UMAP_eval(real_data=real_data, generated_data=result.generated_data, random_seed=42)
Sample-Size Evaluation (SyntheSize)
Evaluate how classifier performance scales with sample size using the integrated SyntheSize methodology (R version, Python version):
from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data
# Load real data with group labels
data, groups = resolve_data("BRCASubtypeSel_test")
# Evaluate classifiers at increasing sample sizes
metrics = evaluate_sample_sizes(
data=data,
sample_sizes=[50, 100, 150],
groups=groups,
n_draws=5,
methods=["LOGIS", "RF", "XGB"],
)
# Plot inverse power-law learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")
evaluate_sample_sizes applies log2(x + 1) by default (apply_log=True).
Set apply_log=False if your inputs are already log-transformed.
You can also pass a SyngResult directly — groups are auto-resolved:
from syng_bts import generate, evaluate_sample_sizes
result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=50)
metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated")
List Available Datasets
from syng_bts import list_bundled_datasets
print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]
Available Models
| Model | Description |
|---|---|
VAE1-10 |
Variational Auto-Encoder with 1:10 loss ratio |
CVAE1-10 |
Conditional VAE with 1:10 loss ratio |
GAN |
Standard Generative Adversarial Network |
WGANGP |
Wasserstein GAN with Gradient Penalty |
maf |
Masked Autoregressive Flow |
Dependencies
SyNG-BTS requires Python 3.10+ and the following packages:
- torch (>=2.0.0)
- pandas (>=1.5.0)
- numpy (>=1.23.0)
- scipy (>=1.9.0)
- matplotlib (>=3.6.0)
- seaborn (>=0.12.0)
- scikit-learn (>=1.3.0)
- xgboost (>=2.0.0)
- tensorboardX (>=2.6.0)
- umap-learn (>=0.5.6)
- pyarrow (>=14.0.0)
Documentation
Full documentation is available at syng-bts.readthedocs.io.
Development
Quick Setup
git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
make init-dev # Install package + dev dependencies
Makefile Commands
The project includes a Makefile for common development tasks:
| Command | Description |
|---|---|
make help |
Show all available commands |
make install |
Install package in editable mode |
make install-dev |
Install development dependencies |
make init-dev |
Full dev setup (install + dev deps in venv) |
make test |
Run tests with pytest |
make test-cov |
Run tests with coverage report |
make lint |
Check code with ruff |
make format |
Auto-format code with ruff |
make check |
Run lint + tests |
make docs |
Build documentation |
make clean |
Remove build artifacts |
Citation
If you use SyNG-BTS in your research, please cite:
Qi Y, Wang X, Qin LX. Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. Brief Bioinform. 2025 Mar 4;26(2):bbaf097. doi: 10.1093/bib/bbaf097. PMID: 40072846; PMCID: PMC11899567.
BibTeX:
@article{qin2025optimizing,
title = {Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach},
author = {Qi, Yunhui and Wang, Xinyi and Qin, Li-Xuan},
journal = {Brief Bioinformatics},
year = {2025},
volume = {26},
number = {2},
pages = {bbaf097},
doi = {10.1093/bib/bbaf097},
url = {https://pmc.ncbi.nlm.nih.gov/articles/PMC11899567/}
}
License
SyNG-BTS is licensed under the GNU Affero General Public License v3.0.
Acknowledgments
This package was developed at Memorial Sloan Kettering Cancer Center. We thank Sebastian Raschka for the STAT 453 course materials that provided foundational concepts for the deep generative models.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file syng_bts-3.4.0.tar.gz.
File metadata
- Download URL: syng_bts-3.4.0.tar.gz
- Upload date:
- Size: 5.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
476808ee19c1e34415eaef2a3cec420a16384aba707c09111a7ab713c40abe69
|
|
| MD5 |
7d50532ae9737fb09283e880a074f6a7
|
|
| BLAKE2b-256 |
d3c05a9db09bd56b3dc675d12998a766556ad2903967dd1cd7167692ad06bc7f
|
File details
Details for the file syng_bts-3.4.0-py3-none-any.whl.
File metadata
- Download URL: syng_bts-3.4.0-py3-none-any.whl
- Upload date:
- Size: 5.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a28cef4ef7d6909e8324d94b247a373bf4e3e4224d97683e9e978f25d032cd8f
|
|
| MD5 |
c59c9ba42423cd485f62c950b2ea6871
|
|
| BLAKE2b-256 |
c1f7d2a05e19c63db1fbd2f9feaec20e2782840e8f16409bd8a3feeee9fde63a
|