Synthesis of Next Generation Bulk Transcriptomic Sequencing - A data augmentation tool for transcriptomics data using deep generative models

These details have not been verified by PyPI

Project links

Project description

SyNG-BTS: Synthesis of Next Generation Bulk Transcriptomic Sequencing

SyNG-BTS is a Python package for data augmentation of bulk transcriptomic sequencing data using deep generative models. It synthesizes realistic transcriptomic data without relying on predefined formulas, enabling researchers to augment small pilot datasets for more robust machine learning analyses. SyNG-BTS supports three generative model families: variational autoencoders (VAE), generative adversarial networks (GAN), and flow-based models. These models are trained on a pilot dataset and can synthesize additional samples at any desired scale.

SyNG-BTS Workflow

Features

Multiple Generative Models: VAE, CVAE, GAN, WGANGP, and flow-based models (MAF)
Unified API: Run pilot experiments, generate synthetic data, and perform transfer learning
DataFrame-First API: Accept pandas DataFrames, CSV file paths, or bundled dataset names
Rich Result Objects: SyngResult / PilotResult with built-in plotting and export
In-Memory Pipeline: No disk I/O by default — results stay in memory until you choose to save
Built-in Evaluation: Heatmap and UMAP visualization functions
Sample-Size Evaluation: Integrated SyntheSize methodology for classifier learning curves
Bundled Datasets: Example TCGA datasets included for immediate experimentation

Installation

From PyPI (Recommended)

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ syng-bts

TODO: Update to main PyPI installation upon release.

From Source

git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
pip install -e .

Optional Dependencies

For documentation building:

pip install syng-bts[docs]

For development (testing, linting):

pip install syng-bts[dev]

Quick Start

Generate Synthetic Data

from syng_bts import generate

# Train a VAE on bundled data and generate 500 synthetic samples
result = generate(
    data="SKCMPositive_4",   # bundled dataset name, CSV path, or DataFrame
    model="VAE1-10",         # model specification (type + kl_weight)
    new_size=500,            # number of synthetic samples
    batch_frac=0.1,          # batch fraction
    learning_rate=0.0005,    # learning rate
)

# Grouped-size controls
# - int: exact total sample count (grouped data is split by input ratio)
# - list[int]: explicit grouped counts [n_group_0, n_group_1]
#   where group_0 is the first group value encountered in input data.

# Access results in memory
print(result.generated_data.shape)   # (500, n_features)
print(result.loss.columns.tolist())  # ['kl', 'recons']
print(result.summary())

# Plot training loss (one figure per loss column)
figs = result.plot_loss()  # dict[str, Figure]

# Optionally save to disk
result.save("./my_output/")

# Load a previously saved result
from syng_bts import SyngResult
loaded = SyngResult.load("./my_output/")

Run a Pilot Study

from syng_bts import pilot_study

# Sweep over multiple pilot sizes (5 random draws each)
pilot = pilot_study(
    data="SKCMPositive_4",
    pilot_size=[50, 100],
    model="VAE1-10",
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access individual runs
run = pilot.runs[(50, 1)]  # (pilot_size, draw_index)
print(run.generated_data.head())

# All runs overlaid on one plot per loss column
figs = pilot.plot_loss(style="overlay_runs")       # dict[str, Figure]

# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band")  # dict[str, Figure]

Transfer Learning

from syng_bts import transfer

# Pre-train on PRAD, fine-tune and generate on BRCA
result = transfer(
    source_data="PRAD",
    target_data="BRCA",
    new_size=500,
    model="maf",
    apply_log=True,
    epoch=10,
)

print(result.generated_data.shape)
result.save("./transfer_output/")

Use DataFrame Input

import pandas as pd
from syng_bts import generate

my_data = pd.read_csv("my_dataset.csv")
result = generate(
    data=my_data,
    name="my_dataset",     # used in output filenames
    model="WGANGP",
    new_size=1000,
    epoch=50,
)

Evaluate Generated Data

from syng_bts import generate, resolve_data, heatmap_eval, UMAP_eval

result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
real_data, _groups = resolve_data("SKCMPositive_4")

# Built-in heatmap on result object
fig = result.plot_heatmap()

# Standalone evaluation comparing real vs generated
fig_heatmap = heatmap_eval(real_data=real_data.head(50), generated_data=result.generated_data.head(50))
fig_umap = UMAP_eval(real_data=real_data, generated_data=result.generated_data, random_seed=42)

Sample-Size Evaluation (SyntheSize)

Evaluate how classifier performance scales with sample size using the integrated SyntheSize methodology (R version, Python version):

from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

# Load real data with group labels
data, groups = resolve_data("BRCASubtypeSel_test")

# Evaluate classifiers at increasing sample sizes
metrics = evaluate_sample_sizes(
    data=data,
    sample_sizes=[50, 100, 150],
    groups=groups,
    n_draws=5,
    methods=["LOGIS", "RF", "XGB"],
)

# Plot inverse power-law learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")

evaluate_sample_sizes applies log2(x + 1) by default (apply_log=True). Set apply_log=False if your inputs are already log-transformed.

You can also pass a SyngResult directly — groups are auto-resolved:

from syng_bts import generate, evaluate_sample_sizes

result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=50)
metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated")

List Available Datasets

from syng_bts import list_bundled_datasets

print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]

Available Models

Model	Description
`VAE1-10`	Variational Auto-Encoder with 1:10 loss ratio
`CVAE1-10`	Conditional VAE with 1:10 loss ratio
`GAN`	Standard Generative Adversarial Network
`WGANGP`	Wasserstein GAN with Gradient Penalty
`maf`	Masked Autoregressive Flow

Dependencies

SyNG-BTS requires Python 3.10+ and the following packages:

torch (>=2.0.0)
pandas (>=1.5.0)
numpy (>=1.23.0)
scipy (>=1.9.0)
matplotlib (>=3.6.0)
seaborn (>=0.12.0)
scikit-learn (>=1.3.0)
xgboost (>=2.0.0)
tensorboardX (>=2.6.0)
umap-learn (>=0.5.6)
pyarrow (>=14.0.0)

Documentation

Full documentation is available at syng-bts.readthedocs.io.

Development

Quick Setup

git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
make init-dev  # Install package + dev dependencies

Makefile Commands

The project includes a Makefile for common development tasks:

Command	Description
`make help`	Show all available commands
`make install`	Install package in editable mode
`make install-dev`	Install development dependencies
`make init-dev`	Full dev setup (install + dev deps in venv)
`make test`	Run tests with pytest
`make test-cov`	Run tests with coverage report
`make lint`	Check code with ruff
`make format`	Auto-format code with ruff
`make check`	Run lint + tests
`make docs`	Build documentation
`make clean`	Remove build artifacts

Citation

If you use SyNG-BTS in your research, please cite:

Qi Y, Wang X, Qin LX. Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. Brief Bioinform. 2025 Mar 4;26(2):bbaf097. doi: 10.1093/bib/bbaf097. PMID: 40072846; PMCID: PMC11899567.

BibTeX:

@article{qin2025optimizing,
  title = {Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach},
  author = {Qi, Yunhui and Wang, Xinyi and Qin, Li-Xuan},
  journal = {Brief Bioinformatics},
  year = {2025},
  volume = {26},
  number = {2},
  pages = {bbaf097},
  doi = {10.1093/bib/bbaf097},
  url = {https://pmc.ncbi.nlm.nih.gov/articles/PMC11899567/}
}

License

SyNG-BTS is licensed under the GNU Affero General Public License v3.0.

Acknowledgments

This package was developed at Memorial Sloan Kettering Cancer Center. We thank Sebastian Raschka for the STAT 453 course materials that provided foundational concepts for the deep generative models.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.4.1

May 15, 2026

3.4.0

May 1, 2026

This version

3.3.2

Apr 3, 2026

3.3.1

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syng_bts-3.3.2.tar.gz (5.4 MB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

syng_bts-3.3.2-py3-none-any.whl (5.4 MB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file syng_bts-3.3.2.tar.gz.

File metadata

Download URL: syng_bts-3.3.2.tar.gz
Upload date: Apr 3, 2026
Size: 5.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for syng_bts-3.3.2.tar.gz
Algorithm	Hash digest
SHA256	`09b3fdc04bb0da7f155781f0b6b1734ca56ff320592e7f3a291d69023f3ef590`
MD5	`a47f8f728265f40fa8164961576e6fda`
BLAKE2b-256	`9ef7c41f5300397d98845099575aa4b083a61efe069a91f8e4fd977339736649`

See more details on using hashes here.

File details

Details for the file syng_bts-3.3.2-py3-none-any.whl.

File metadata

Download URL: syng_bts-3.3.2-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 5.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for syng_bts-3.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`25bf2210338db6f951f9a370417f4dd9ced0cdab608eec56fa5100871f7e22a4`
MD5	`f0667a1da398207bcd80a8b28dfcae7d`
BLAKE2b-256	`b8484d78073f26939b3c793ec4dd731bb3789293f0b64383ed95e70ddc1b962b`

See more details on using hashes here.

syng-bts 3.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SyNG-BTS: Synthesis of Next Generation Bulk Transcriptomic Sequencing

Features

Installation

From PyPI (Recommended)

From Source

Optional Dependencies

Quick Start

Generate Synthetic Data

Run a Pilot Study

Transfer Learning

Use DataFrame Input

Evaluate Generated Data

Sample-Size Evaluation (SyntheSize)

List Available Datasets

Available Models

Dependencies

Documentation

Development

Quick Setup

Makefile Commands

Citation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes