Skip to main content

Biologically structured variational autoencoder for gene expression data with PPI priors.

Project description

BSVAE: Biologically Structured Variational Autoencoder

PyTorch License DOI

BSVAE is a PyTorch package for Structured Factor Variational Autoencoders (StructuredFactorVAE).
It is designed for gene expression modeling with biological priors, integrating protein–protein interaction (PPI) networks and sparsity constraints for interpretable latent representations.


Features

  • Structured VAE architecture (StructuredFactorVAE)
    • Factorized encoder/decoder with group sparsity
    • Optional Laplacian regularization from PPI networks
  • Dataset utilities
    • Load gene expression matrices (GeneExpression)
    • Support for CSV full-matrix mode or pre-split train/test mode
  • Biological priors
    • Fetch and cache STRING PPI networks by NCBI TaxID
    • Map gene symbols / Ensembl IDs to protein IDs using MyGene.info or BioMart
  • Training and evaluation
    • Unified training loop (Trainer)
    • Evaluation with reconstruction, KL, sparsity, and Laplacian penalties (Evaluator)
  • Reproducibility
    • Save/load models + metadata (modelIO)
    • Configurable hyperparameters via hyperparam.ini
  • Post-training analysis
    • Gene–gene network extraction via decoder similarity, propagated covariance, Graphical Lasso, and Laplacian refinement
    • Latent export (mu, logvar) to CSV or AnnData for downstream workflows

Installation

Install from PyPI:

pip install bsvae

Or install from source with Poetry:

git clone https://github.com/YOUR-LAB/BSVAE.git
cd BSVAE
poetry install

Dependencies:

  • Python 3.11+
  • PyTorch ≥ 2.8
  • pandas, numpy, scikit-learn
  • networkx, scipy
  • mygene (for gene annotation)

Quickstart

1. Prepare gene expression data

BSVAE expects genes × samples CSVs.

  • Full-matrix mode: Provide expr.csv with all samples → 10-fold CV split is created.
  • Pre-split mode: Provide directory with X_train.csv and X_test.csv.

2. Train a model

bsvae-train exp1 \
    --gene-expression-filename data/expr.csv \
    --epochs 50 \
    --latent-dim 10 \
    --ppi-taxid 9606
  • Results (checkpoints, logs, metadata) saved under results/exp1/.

3. Evaluate a trained model

bsvae-train exp1 \
    --gene-expression-filename data/expr.csv \
    --is-eval-only

4. Extract networks and export latents

bsvae-networks extract-networks \
    --model-path results/exp1 \
    --dataset data/expr.csv \
    --output-dir results/exp1/networks
# optional: --methods latent_cov graphical_lasso laplacian

bsvae-networks export-latents \
    --model-path results/exp1 \
    --dataset data/expr.csv \
    --output results/exp1/latents.h5ad

The extractor writes adjacency matrices, edge lists, and optional heatmaps for each requested method. By default the decoder-loading cosine similarity (w_similarity) is computed; add other methods with --methods. Latent exports include per-sample mu and logvar as tidy CSV or AnnData files.


⚙Configuration

Hyperparameters can be set via hyperparam.ini:

[Custom]
seed = 42
no_cuda = False
epochs = 100
batch_size = 64
latent_dim = 10
hidden_dims = [128, 64]
dropout = 0.1
l1_strength = 1e-3
lap_strength = 1e-4

Override from CLI if needed:

bsvae-train my_experiment --epochs 50 --latent-dim 20

PPI Priors

BSVAE supports automatic download & caching of STRING v12.0 PPI networks.

  • Supported species (via NCBI TaxID):

    • Human (9606)
    • Mouse (10090)
    • Rat (10116)
    • Fly (7227)
  • Cache location defaults to ~/.bsvae/ppi (override via --ppi-cache).

Prefetch PPI cache from the CLI

Use the lightweight downloader to cache a STRING network ahead of training:

bsvae-download-ppi --taxid 9606 --cache-dir ~/.bsvae/ppi

Troubleshooting PPI downloads on HPC systems

Some clusters block HTTPS certificate resolution for outbound downloads. If bsvae-download-ppi cannot reach STRING, manually cache the file with wget (or curl) using --no-check-certificate and point --ppi-cache to the same directory:

OUTDIR="$HOME/.bsvae/ppi"
mkdir -p "${OUTDIR}"
wget --no-check-certificate \
  "https://stringdb-static.org/download/protein.links.detailed.v12.0/9606.protein.links.detailed.v12.0.txt.gz" \
  -O "${OUTDIR}/9606_string.txt.gz"

Use curl -k -L "<url>" -o "${OUTDIR}/9606_string.txt.gz" if wget is unavailable.


Integration notes

  • The bsvae-networks workflows reuse the same gene ordering as training. When loading a standalone expression file, ensure columns correspond to the genes seen by the checkpoint.
  • The CLI automatically handles CPU/GPU placement based on availability; models are loaded in evaluation mode without modifying training metadata.
  • Network extraction functions are written to be test-friendly: they accept PyTorch DataLoader instances, operate without global state, and persist outputs as CSV/TSV/NPY for interoperability with graph toolchains.

Citation

If you use BSVAE in your research, please cite:

@article{Benjamin2025bsvae,
  title={Structured Factor Variational Autoencoder with Biological Priors},
  author={Kynon J. M. Benjamin},
  year={2025},
  journal={N/A}
}

License

This project is licensed under the GNU General Public License v3.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bsvae-0.2.0.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bsvae-0.2.0-py3-none-any.whl (65.9 kB view details)

Uploaded Python 3

File details

Details for the file bsvae-0.2.0.tar.gz.

File metadata

  • Download URL: bsvae-0.2.0.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for bsvae-0.2.0.tar.gz
Algorithm Hash digest
SHA256 449c91ed3d2d222b2ee1dede4960154a0a3f66804f705f86413b6544255dc28a
MD5 a0446adc70e4f4cdf8c308323a4ec48c
BLAKE2b-256 e1d15458c4ec2bfcd9dfbec5becff01b8b635a5dd9cceeda9509f2d0eff4b71c

See more details on using hashes here.

File details

Details for the file bsvae-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: bsvae-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 65.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64

File hashes

Hashes for bsvae-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 887a27255e0965c5cf3f5d77a1fba5cc47c75636fb7e83f011757d43fecf6a08
MD5 07b21f7ae4aebf44806f22027ad74c18
BLAKE2b-256 6186a6948888c9a93fe03cb7b3ff3bf91e323ad3eb9bb29173d05d0d6aecf29d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page