Biologically structured variational autoencoder for gene expression data with PPI priors.
Project description
BSVAE: Biologically Structured Variational Autoencoder
BSVAE is a PyTorch package for Structured Factor Variational Autoencoders (StructuredFactorVAE).
It is designed for gene expression modeling with biological priors, integrating protein–protein interaction (PPI) networks and sparsity constraints for interpretable latent representations.
Features
- Structured VAE architecture (
StructuredFactorVAE)- Factorized encoder/decoder with group sparsity
- Optional Laplacian regularization from PPI networks
- Dataset utilities
- Load gene expression matrices (
GeneExpression) - Support for CSV full-matrix mode or pre-split train/test mode
- Load gene expression matrices (
- Biological priors
- Fetch and cache STRING PPI networks by NCBI TaxID
- Map gene symbols / Ensembl IDs to protein IDs using MyGene.info or BioMart
- Training and evaluation
- Unified training loop (
Trainer) - Evaluation with reconstruction, KL, sparsity, and Laplacian penalties (
Evaluator)
- Unified training loop (
- Reproducibility
- Save/load models + metadata (
modelIO) - Configurable hyperparameters via
hyperparam.ini
- Save/load models + metadata (
- Post-training analysis
- Gene–gene network extraction via decoder similarity, propagated covariance, Graphical Lasso, and Laplacian refinement
- Latent export (
mu,logvar) to CSV or AnnData for downstream workflows
Installation
Install from PyPI:
pip install bsvae
Or install from source with Poetry:
git clone https://github.com/YOUR-LAB/BSVAE.git
cd BSVAE
poetry install
Dependencies:
- Python 3.11+
- PyTorch ≥ 2.8
- pandas, numpy, scikit-learn
- networkx, scipy
- mygene (for gene annotation)
Quickstart
1. Prepare gene expression data
BSVAE expects genes × samples CSVs.
- Full-matrix mode:
Provide
expr.csvwith all samples → 10-fold CV split is created. - Pre-split mode:
Provide directory with
X_train.csvandX_test.csv.
2. Train a model
bsvae-train exp1 \
--gene-expression-filename data/expr.csv \
--epochs 50 \
--latent-dim 10 \
--ppi-taxid 9606
- Results (checkpoints, logs, metadata) saved under
results/exp1/.
3. Evaluate a trained model
bsvae-train exp1 \
--gene-expression-filename data/expr.csv \
--is-eval-only
4. Extract networks and export latents
bsvae-networks extract-networks \
--model-path results/exp1 \
--dataset data/expr.csv \
--output-dir results/exp1/networks
# optional: --methods latent_cov graphical_lasso laplacian
bsvae-networks export-latents \
--model-path results/exp1 \
--dataset data/expr.csv \
--output results/exp1/latents.h5ad
The extractor writes adjacency matrices, edge lists, and optional heatmaps for
each requested method. By default the decoder-loading cosine similarity
(w_similarity) is computed; add other methods with --methods. Latent exports include per-sample mu and
logvar as tidy CSV or AnnData files.
⚙Configuration
Hyperparameters can be set via hyperparam.ini:
[Custom]
seed = 42
no_cuda = False
epochs = 100
batch_size = 64
latent_dim = 10
hidden_dims = [128, 64]
dropout = 0.1
l1_strength = 1e-3
lap_strength = 1e-4
Override from CLI if needed:
bsvae-train my_experiment --epochs 50 --latent-dim 20
PPI Priors
BSVAE supports automatic download & caching of STRING v12.0 PPI networks.
-
Supported species (via NCBI TaxID):
- Human (
9606) - Mouse (
10090) - Rat (
10116) - Fly (
7227)
- Human (
-
Cache location defaults to
~/.bsvae/ppi(override via--ppi-cache).
Prefetch PPI cache from the CLI
Use the lightweight downloader to cache a STRING network ahead of training:
bsvae-download-ppi --taxid 9606 --cache-dir ~/.bsvae/ppi
Troubleshooting PPI downloads on HPC systems
Some clusters block HTTPS certificate resolution for outbound downloads. If bsvae-download-ppi cannot reach STRING, manually
cache the file with wget (or curl) using --no-check-certificate and point --ppi-cache to the same directory:
OUTDIR="$HOME/.bsvae/ppi"
mkdir -p "${OUTDIR}"
wget --no-check-certificate \
"https://stringdb-static.org/download/protein.links.detailed.v12.0/9606.protein.links.detailed.v12.0.txt.gz" \
-O "${OUTDIR}/9606_string.txt.gz"
Use curl -k -L "<url>" -o "${OUTDIR}/9606_string.txt.gz" if wget is unavailable.
Integration notes
- The
bsvae-networksworkflows reuse the same gene ordering as training. When loading a standalone expression file, ensure columns correspond to the genes seen by the checkpoint. - The CLI automatically handles CPU/GPU placement based on availability; models are loaded in evaluation mode without modifying training metadata.
- Network extraction functions are written to be test-friendly: they accept
PyTorch
DataLoaderinstances, operate without global state, and persist outputs as CSV/TSV/NPY for interoperability with graph toolchains.
Citation
If you use BSVAE in your research, please cite:
@article{Benjamin2025bsvae,
title={Structured Factor Variational Autoencoder with Biological Priors},
author={Kynon J. M. Benjamin},
year={2025},
journal={N/A}
}
License
This project is licensed under the GNU General Public License v3.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bsvae-0.2.0.tar.gz.
File metadata
- Download URL: bsvae-0.2.0.tar.gz
- Upload date:
- Size: 56.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
449c91ed3d2d222b2ee1dede4960154a0a3f66804f705f86413b6544255dc28a
|
|
| MD5 |
a0446adc70e4f4cdf8c308323a4ec48c
|
|
| BLAKE2b-256 |
e1d15458c4ec2bfcd9dfbec5becff01b8b635a5dd9cceeda9509f2d0eff4b71c
|
File details
Details for the file bsvae-0.2.0-py3-none-any.whl.
File metadata
- Download URL: bsvae-0.2.0-py3-none-any.whl
- Upload date:
- Size: 65.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.10.9 Linux/4.18.0-553.22.1.el8_10.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
887a27255e0965c5cf3f5d77a1fba5cc47c75636fb7e83f011757d43fecf6a08
|
|
| MD5 |
07b21f7ae4aebf44806f22027ad74c18
|
|
| BLAKE2b-256 |
6186a6948888c9a93fe03cb7b3ff3bf91e323ad3eb9bb29173d05d0d6aecf29d
|