Skip to main content

Synthesis of Next Generation Bulk Transcriptomic Sequencing - A data augmentation tool for transcriptomics data using deep generative models

Project description

SyNG-BTS: Synthesis of Next Generation Bulk Transcriptomic Sequencing

Python 3.10+ License: AGPL v3

SyNG-BTS is a Python package for data augmentation of bulk transcriptomic sequencing data using deep generative models. It synthesizes realistic transcriptomic data without relying on predefined formulas, enabling researchers to augment small pilot datasets for more robust machine learning analyses. SyNG-BTS supports three generative model families: variational autoencoders (VAE), generative adversarial networks (GAN), and flow-based models. These models are trained on a pilot dataset and can synthesize additional samples at any desired scale.

SyNG-BTS Workflow

Features

  • Multiple Generative Models: VAE, CVAE, GAN, WGANGP, and flow-based models (MAF)
  • Unified API: Run pilot experiments, generate synthetic data, and perform transfer learning
  • DataFrame-First API: Accept pandas DataFrames, CSV file paths, or bundled dataset names
  • Rich Result Objects: SyngResult / PilotResult with built-in plotting and export
  • In-Memory Pipeline: No disk I/O by default — results stay in memory until you choose to save
  • Built-in Evaluation: Heatmap and UMAP visualization functions
  • Sample-Size Evaluation: Integrated SyntheSize methodology for classifier learning curves
  • Bundled Datasets: Example TCGA datasets included for immediate experimentation

Installation

From PyPI (Recommended)

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ syng-bts

TODO: Update to main PyPI installation upon release.

From Source

git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
pip install -e .

Optional Dependencies

For documentation building:

pip install syng-bts[docs]

For development (testing, linting):

pip install syng-bts[dev]

Quick Start

Generate Synthetic Data

from syng_bts import generate

# Train a VAE on bundled data and generate 500 synthetic samples
result = generate(
    data="SKCMPositive_4",   # bundled dataset name, CSV path, or DataFrame
    model="VAE1-10",         # model specification (type + kl_weight)
    new_size=500,            # number of synthetic samples
    batch_frac=0.1,          # batch fraction
    learning_rate=0.0005,    # learning rate
)

# Grouped-size controls
# - int: exact total sample count (grouped data is split by input ratio)
# - list[int]: explicit grouped counts [n_group_0, n_group_1]
#   where group_0 is the first group value encountered in input data.

# Access results in memory
print(result.generated_data.shape)   # (500, n_features)
print(result.loss.columns.tolist())  # ['kl', 'recons']
print(result.summary())

# Plot training loss (one figure per loss column)
figs = result.plot_loss()  # dict[str, Figure]

# Optionally save to disk
result.save("./my_output/")

# Load a previously saved result
from syng_bts import SyngResult
loaded = SyngResult.load("./my_output/")

Run a Pilot Study

from syng_bts import pilot_study

# Sweep over multiple pilot sizes (5 random draws each)
pilot = pilot_study(
    data="SKCMPositive_4",
    pilot_size=[50, 100],
    model="VAE1-10",
    batch_frac=0.1,
    learning_rate=0.0005,
)

# Access individual runs
run = pilot.runs[(50, 1)]  # (pilot_size, draw_index)
print(run.generated_data.head())

# All runs overlaid on one plot per loss column
figs = pilot.plot_loss(style="overlay_runs")       # dict[str, Figure]

# Mean ± std across runs
figs = pilot.plot_loss(style="mean_band")  # dict[str, Figure]

Transfer Learning

from syng_bts import transfer

# Pre-train on PRAD, fine-tune and generate on BRCA
result = transfer(
    source_data="PRAD",
    target_data="BRCA",
    new_size=500,
    model="maf",
    apply_log=True,
    epoch=10,
)

print(result.generated_data.shape)
result.save("./transfer_output/")

Use DataFrame Input

import pandas as pd
from syng_bts import generate

my_data = pd.read_csv("my_dataset.csv")
result = generate(
    data=my_data,
    name="my_dataset",     # used in output filenames
    model="WGANGP",
    new_size=1000,
    epoch=50,
)

Evaluate Generated Data

from syng_bts import generate, resolve_data, heatmap_eval, UMAP_eval

result = generate(data="SKCMPositive_4", model="VAE1-10", epoch=5)
real_data, _groups = resolve_data("SKCMPositive_4")

# Built-in heatmap on result object
fig = result.plot_heatmap()

# Standalone evaluation comparing real vs generated
fig_heatmap = heatmap_eval(real_data=real_data.head(50), generated_data=result.generated_data.head(50))
fig_umap = UMAP_eval(real_data=real_data, generated_data=result.generated_data, random_seed=42)

Sample-Size Evaluation (SyntheSize)

Evaluate how classifier performance scales with sample size using the integrated SyntheSize methodology (R version, Python version):

from syng_bts import evaluate_sample_sizes, plot_sample_sizes, resolve_data

# Load real data with group labels
data, groups = resolve_data("BRCASubtypeSel_test")

# Evaluate classifiers at increasing sample sizes
metrics = evaluate_sample_sizes(
    data=data,
    sample_sizes=[50, 100, 150],
    groups=groups,
    n_draws=5,
    methods=["LOGIS", "RF", "XGB"],
)

# Plot inverse power-law learning curves
fig = plot_sample_sizes(metrics, n_target=200)
fig.savefig("learning_curves.png")

evaluate_sample_sizes applies log2(x + 1) by default (apply_log=True). Set apply_log=False if your inputs are already log-transformed.

You can also pass a SyngResult directly — groups are auto-resolved:

from syng_bts import generate, evaluate_sample_sizes

result = generate(data="BRCASubtypeSel_train", model="CVAE1-20", epoch=50)
metrics = evaluate_sample_sizes(result, sample_sizes=[50, 100], which="generated")

List Available Datasets

from syng_bts import list_bundled_datasets

print(list_bundled_datasets())
# ['SKCMPositive_4', 'BRCA', 'PRAD', 'BRCASubtypeSel', ...]

Available Models

Model Description
VAE1-10 Variational Auto-Encoder with 1:10 loss ratio
CVAE1-10 Conditional VAE with 1:10 loss ratio
GAN Standard Generative Adversarial Network
WGANGP Wasserstein GAN with Gradient Penalty
maf Masked Autoregressive Flow

Dependencies

SyNG-BTS requires Python 3.10+ and the following packages:

  • torch (>=2.0.0)
  • pandas (>=1.5.0)
  • numpy (>=1.23.0)
  • scipy (>=1.9.0)
  • matplotlib (>=3.6.0)
  • seaborn (>=0.12.0)
  • scikit-learn (>=1.3.0)
  • xgboost (>=2.0.0)
  • tensorboardX (>=2.6.0)
  • umap-learn (>=0.5.6)
  • pyarrow (>=14.0.0)

Documentation

Full documentation is available at syng-bts.readthedocs.io.

Development

Quick Setup

git clone https://github.com/Omics-Data-Synthesis/SyNG-BTS
cd SyNG-BTS
make init-dev  # Install package + dev dependencies

Makefile Commands

The project includes a Makefile for common development tasks:

Command Description
make help Show all available commands
make install Install package in editable mode
make install-dev Install development dependencies
make init-dev Full dev setup (install + dev deps in venv)
make test Run tests with pytest
make test-cov Run tests with coverage report
make lint Check code with ruff
make format Auto-format code with ruff
make check Run lint + tests
make docs Build documentation
make clean Remove build artifacts

Citation

If you use SyNG-BTS in your research, please cite:

Qi Y, Wang X, Qin LX. Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach. Brief Bioinform. 2025 Mar 4;26(2):bbaf097. doi: 10.1093/bib/bbaf097. PMID: 40072846; PMCID: PMC11899567.

BibTeX:

@article{qin2025optimizing,
  title = {Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach},
  author = {Qi, Yunhui and Wang, Xinyi and Qin, Li-Xuan},
  journal = {Brief Bioinformatics},
  year = {2025},
  volume = {26},
  number = {2},
  pages = {bbaf097},
  doi = {10.1093/bib/bbaf097},
  url = {https://pmc.ncbi.nlm.nih.gov/articles/PMC11899567/}
}

License

SyNG-BTS is licensed under the GNU Affero General Public License v3.0.

Acknowledgments

This package was developed at Memorial Sloan Kettering Cancer Center. We thank Sebastian Raschka for the STAT 453 course materials that provided foundational concepts for the deep generative models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syng_bts-3.3.2.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syng_bts-3.3.2-py3-none-any.whl (5.4 MB view details)

Uploaded Python 3

File details

Details for the file syng_bts-3.3.2.tar.gz.

File metadata

  • Download URL: syng_bts-3.3.2.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for syng_bts-3.3.2.tar.gz
Algorithm Hash digest
SHA256 09b3fdc04bb0da7f155781f0b6b1734ca56ff320592e7f3a291d69023f3ef590
MD5 a47f8f728265f40fa8164961576e6fda
BLAKE2b-256 9ef7c41f5300397d98845099575aa4b083a61efe069a91f8e4fd977339736649

See more details on using hashes here.

File details

Details for the file syng_bts-3.3.2-py3-none-any.whl.

File metadata

  • Download URL: syng_bts-3.3.2-py3-none-any.whl
  • Upload date:
  • Size: 5.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for syng_bts-3.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 25bf2210338db6f951f9a370417f4dd9ced0cdab608eec56fa5100871f7e22a4
MD5 f0667a1da398207bcd80a8b28dfcae7d
BLAKE2b-256 b8484d78073f26939b3c793ec4dd731bb3789293f0b64383ed95e70ddc1b962b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page