Skip to main content

Synthesis of Next Generation Bulk Transcriptomic Sequencing - A data augmentation tool for transcriptomics data using deep generative models

Project description

SyNG-BTS: Synthesis of Next Generation Bulk Transcriptomic Sequencing

Python 3.10+ License: AGPL v3

SyNG-BTS is a Python package for data augmentation of bulk transcriptomic sequencing data using deep generative models. It synthesizes realistic transcriptomic data without relying on predefined formulas, enabling researchers to augment small pilot datasets for more robust machine learning analyses. SyNG-BTS supports three generative model families: variational autoencoders (VAE), generative adversarial networks (GAN), and flow-based models. These models are trained on a pilot dataset and can synthesize additional samples at any desired scale.

SyNG-BTS Workflow

Features

  • Multiple Generative Models: VAE, CVAE, GAN, WGANGP, and flow-based models (MAF)
  • Unified API: Run pilot experiments, generate synthetic data, and perform transfer learning
  • DataFrame-First API: Accept pandas DataFrames, CSV file paths, or bundled dataset names
  • Rich Result Objects: SyngResult / PilotResult with built-in plotting and export
  • In-Memory Pipeline: No disk I/O by default — results stay in memory until you choose to save
  • Built-in Evaluation: Heatmap and UMAP visualization functions
  • Sample-Size Evaluation: Integrated SyntheSize methodology for classifier learning curves
  • Bundled Datasets: Example TCGA datasets included for immediate experimentation

Installation

From PyPI (Recommended)

pip install syng-bts

From Source

git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
pip install -e .

Optional Dependencies

For documentation building:

pip install syng-bts[docs]

For development (testing, linting):

pip install syng-bts[dev]

Quick Start

Generate Synthetic Data

from syng_bts import generate

# Train a VAE on bundled data and generate 500 synthetic samples
result = generate(
    data="SKCMPositive_4",   # bundled dataset name, CSV path, or DataFrame
    model="VAE1-10",         # model specification (type + kl_weight)
    new_size=500,            # number of synthetic samples
    batch_frac=0.1,          # batch fraction
    learning_rate=0.0005,    # learning rate
)

# Grouped-size controls
# - int: exact total sample count (grouped data is split by input ratio)
# - list[int]: explicit grouped counts [n_group_0, n_group_1]
#   where group_0 is the first group value encountered in input data.

# Access results in memory
print(result.generated_data.shape)   # (500, n_features)
print(result.loss.columns.tolist())  # ['kl', 'recons']
print(result.summary())

# Plot training loss (one figure per loss column)
figs = result.plot_loss()  # dict[str, Figure]

# Optionally save to disk
result.save("./my_output/")

# Load a previously saved result
from syng_bts import SyngResult
loaded = SyngResult.load("./my_output/")

Run a Pilot Study

from syng_bts import pilot_study

# Sweep over multiple pilot sizes (5 random draws each)
pilot = pilot_study(
    data="SKCMPositive_4",
    pilot_size=[50, 100],
    model="VAE1-10",
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access individual runs
run = pilot.runs[(50, 1)]  # (pilot_size, draw_index)
print(run.generated_data.head())

# All runs overlaid on one plot per loss column
figs = pilot.plot_loss(style="overlay_runs")       # dict[str, Figure]

# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band")  # dict[str, Figure]

Transfer Learning

from syng_bts import transfer

# Pre-train on PRAD, fine-tune and generate on BRCA
result = transfer(
    source_data="PRAD",
    target_data="BRCA",
    new_size=500,
    model="maf",
    apply_log=True,
    epoch=10,
)

print(result.generated_data.shape)
result.save("./transfer_output/")

Use DataFrame Input

import pandas as pd
from syng_bts import generate

my_data = pd.read_csv("my_dataset.csv")
result = generate(
    data=my_data,
    name="my_dataset",     # used in output filenames
    model="WGANGP",
    new_size=1000,
    epoch=50,
)

Evaluate Generated Data

from syng_bts import generate, resolve_data, heatmap_eval, UMAP_eval

result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
real_data, _groups = resolve_data("SKCMPositive_4")

# Built-in heatmap on result object
fig = result.plot_heatmap()

# Standalone evaluation comparing real vs generated
fig_heatmap = heatmap_eval(real_data=real_data.head(50), generated_data=result.generated_data.head(50))
fig_umap = UMAP_eval(real_data=real_data, generated_data=result.generated_data, random_seed=42)

Sample-Size Evaluation (SyntheSize)

Evaluate how classifier performance scales with sample size using the integrated SyntheSize methodology (R version, Python version):

from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

# Load real data with group labels
data, groups = resolve_data("BRCASubtypeSel_test")

# Evaluate classifiers at increasing sample sizes
metrics = evaluate_sample_sizes(
    data=data,
    sample_sizes=[50, 100, 150],
    groups=groups,
    n_draws=5,
    methods=["LOGIS", "RF", "XGB"],
)

# Plot inverse power-law learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")

evaluate_sample_sizes applies log2(x + 1) by default (apply_log=True). Set apply_log=False if your inputs are already log-transformed.

You can also pass a SyngResult directly — groups are auto-resolved:

from syng_bts import generate, evaluate_sample_sizes

result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=50)
metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated")

List Available Datasets

from syng_bts import list_bundled_datasets

print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]

Available Models

Model Description
VAE1-10 Variational Auto-Encoder with 1:10 loss ratio
CVAE1-10 Conditional VAE with 1:10 loss ratio
GAN Standard Generative Adversarial Network
WGANGP Wasserstein GAN with Gradient Penalty
maf Masked Autoregressive Flow

Dependencies

SyNG-BTS requires Python 3.10+ and the following packages:

  • torch (>=2.0.0)
  • pandas (>=1.5.0)
  • numpy (>=1.23.0)
  • scipy (>=1.9.0)
  • matplotlib (>=3.6.0)
  • seaborn (>=0.12.0)
  • scikit-learn (>=1.3.0)
  • xgboost (>=2.0.0)
  • tensorboardX (>=2.6.0)
  • umap-learn (>=0.5.6)
  • pyarrow (>=14.0.0)

Documentation

Full documentation is available at syng-bts.readthedocs.io.

Development

Quick Setup

git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
make init-dev  # Install package + dev dependencies

Makefile Commands

The project includes a Makefile for common development tasks:

Command Description
make help Show all available commands
make install Install package in editable mode
make install-dev Install development dependencies
make init-dev Full dev setup (install + dev deps in venv)
make test Run tests with pytest
make test-cov Run tests with coverage report
make lint Check code with ruff
make format Auto-format code with ruff
make check Run lint + tests
make docs Build documentation
make clean Remove build artifacts

Citation

If you use SyNG-BTS in your research, please cite:

Qi Y, Wang X, Qin LX. Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. Brief Bioinform. 2025 Mar 4;26(2):bbaf097. doi: 10.1093/bib/bbaf097. PMID: 40072846; PMCID: PMC11899567.

BibTeX:

@article{qin2025optimizing,
  title = {Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach},
  author = {Qi, Yunhui and Wang, Xinyi and Qin, Li-Xuan},
  journal = {Brief Bioinformatics},
  year = {2025},
  volume = {26},
  number = {2},
  pages = {bbaf097},
  doi = {10.1093/bib/bbaf097},
  url = {https://pmc.ncbi.nlm.nih.gov/articles/PMC11899567/}
}

License

SyNG-BTS is licensed under the GNU Affero General Public License v3.0.

Acknowledgments

This package was developed at Memorial Sloan Kettering Cancer Center. We thank Sebastian Raschka for the STAT 453 course materials that provided foundational concepts for the deep generative models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syng_bts-3.4.0.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syng_bts-3.4.0-py3-none-any.whl (5.4 MB view details)

Uploaded Python 3

File details

Details for the file syng_bts-3.4.0.tar.gz.

File metadata

  • Download URL: syng_bts-3.4.0.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for syng_bts-3.4.0.tar.gz
Algorithm Hash digest
SHA256 476808ee19c1e34415eaef2a3cec420a16384aba707c09111a7ab713c40abe69
MD5 7d50532ae9737fb09283e880a074f6a7
BLAKE2b-256 d3c05a9db09bd56b3dc675d12998a766556ad2903967dd1cd7167692ad06bc7f

See more details on using hashes here.

File details

Details for the file syng_bts-3.4.0-py3-none-any.whl.

File metadata

  • Download URL: syng_bts-3.4.0-py3-none-any.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for syng_bts-3.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a28cef4ef7d6909e8324d94b247a373bf4e3e4224d97683e9e978f25d032cd8f
MD5 c59c9ba42423cd485f62c950b2ea6871
BLAKE2b-256 c1f7d2a05e19c63db1fbd2f9feaec20e2782840e8f16409bd8a3feeee9fde63a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page