Spatial isoform statistical modeling (SPLISOSM)
Project description
SPLISOSM - Spatial Isoform Statistical Modeling
SPLISOSM (SPatiaL ISOform Statistical Modeling) is a Python package for analyzing isoform usage patterns in spatial transcriptomics data. It employs optimized spatial and isoform kernels to capture complex dependencies between spatial locations and transcript usage, maximizing statistical power while maintaining well-calibrated, permutation-free test statistics even with extremely sparse data. The differential usage tests are conditioned on spatial locations to reduce false positives due to spatial autocorrelation.
SPLISOSM accommodates isoform quantification results of both full-length isoforms from long-read sequencing and local transcript diversity events from short-read sequencing platforms including 10X Visium and Slide-seqV2.
This repository is under active development. Please check back later for detailed documentations, tutorials and end-to-end analysis examples using public long-read and short-read datasets.
Installation
Via GitHub (latest version):
pip install git+https://github.com/JiayuSuPKU/SPLISOSM.git#egg=splisosm
Via PyPI (stable version):
pip install splisosm
Quick start
The basic unit of SPLISOSM analysis is the isoform-level event, which can be a full-length transcript from long-read sequencing or a local variable structure from short-read 3'end sequencing. Given spatial isoform quantification results, SPLISOSM runs two types of tests:
- Spatial variability (SV): Find spatially variable transcript usage (HSIC-IR), transcript expression (HSIC-IC) or gene expression (HSIC-GC).
- Differential usage (DU): Testing the conditional association between transcript usage and spatial covariates such as spatial domains and expression of potential regulators like RNA binding proteins (RBPs).
Both SV and DU tests are multivariate performed at the gene level (i.e. one test per gene or per gene-covariate pair).
SPLISOSM takes the following inputs:
data: list of per-gene isoform quantification for 'n_gene' genes. Each element is a (n_spot, n_iso) tensor, where 'n_spot' is the number of spatial spots and 'n_iso' is the number of isoforms (or TREND events) of that given gene.coordinates: spatial coordinates. (n_spot, 2).
And additionally for differential usage testing:
covariates: spatial variable such as RBP expression to test for association. (n_spot, n_covariate).
See the Isoform quantification and data preparation section below for tips on isoform quantification and data preparation. Given an AnnData object with isoform-level quantification results, inputs for SPLISOSM can be extracted using the extract_counts_n_ratios function from splisosm.utils.
The output of the SV test is a data frame of per-gene test statistics and p-values, and the output of the DU test is a data frame of per-gene-covariate-pair test statistics and p-values.
Example data
A small demo dataset of Visium-ONT mouse olfactory bulb (SiT-MOB) can be downloaded from here via Dropbox (~100Mb) to test the package functionality.
mob_ont_filtered_1107.h5ad: AnnData object with isoform quantification results.mob_visium_rbp_1107.h5ad: AnnData object with short-read-based RBP gene expression.
import scanpy as sc
from splisosm.utils import extract_counts_n_ratios
adata_ont = sc.read("mob_ont_filtered_1107.h5ad")
adata_rbp = sc.read("mob_visium_rbp_1107.h5ad")
# prepare per gene isoform tensor list
# data[i] = (n_spot, n_iso) tensor for gene i
# assuming isoforms (adata_ont.var_names) are grouped by adata_ont.var['gene_symbol']
data, _, gene_names, _ = extract_counts_n_ratios(
adata_ont, layer = 'counts', group_iso_by = 'gene_symbol'
)
# spatial coordinates
coordinates = adata_ont.obs.loc[:, ['array_row', 'array_col']]
# prepare covariates for differential usage testing
adata_rbp = adata_rbp[adata_ont.obs_names, :].copy() # align the RBP data with the isoform data
# focus on spatially variably expressed RBPs only
# adata_rbp.var['is_visium_sve'] = adata_rbp.var['pvalue_adj_sparkx'] < 0.01
covariates = adata_rbp[:, adata_rbp.var['is_visium_sve']].layers['log1p'].toarray()
covariate_names = adata_rbp.var.loc[adata_rbp.var['is_visium_sve'], 'features']
Testing for spatial variability (SV)
SPLISOSM uses Hilbert-Schmidt Independence Criterion (HSIC) to test for kernel independence between isoform quantities and spatial coordinates. Specifically, we have the following SV tests for variability in three aspects:
- HSIC-IR for isoform relative ratio.
- HSIC-GC for total gene expression. This test is similar to SPARK-X but with a different spatial kernel design and improved statistical power.
- HSIC-IC for isoform count. Biologically, isoform expression variability is the joint result of variability in total gene expression and in isoform usage. In practice the results of HSIC-IC and HSIC-GC are usually similar.
from splisosm import SplisosmNP
# minimal input data
data = ... # list of length n_gene, each a (n_spot, n_iso) tensor
coordinates = ..., # (n_spot, 2), spatial coordinates
gene_names = ... # list of length n_gene, gene names
# initialize the model
model_np = SplisosmNP()
model_np.setup_data(data, coordinates, gene_names = gene_names)
# per-gene test for spatial variability
# method can be 'hsic-ir' (isoform ratio), 'hsic-ic' (isoform counts)
# 'hsic-gc' (gene counts), 'spark-x' (gene counts)
model_np.test_spatial_variability(
method = "hsic-ir",
ratio_transformation = 'none', # only applicable to 'hsic-ir', can be 'none', 'clr', 'ilr', 'alr', 'radial'
nan_filling = 'mean', # how to fill NaN values in the data, can be 'mean' (global mean), 'none' (ignoring NaN spots)
)
# extract the per-gene test statistics
df_sv_res = model_np.get_formatted_test_results(test_type = 'sv')
Testing for differential isoform usage (DU)
We implemented conditional association tests for isoform usage in both parametric and non-parametric settings.
The non-parametric test is again based on the HSIC kernel independence test and relies on Gaussian process regression for spatial conditioning.
# from splisosm.hyptest import IsoSDE # will be deprecated in the future
from splisosm.hyptest_np import SplisosmNP
# minimal input data
# after running the SV test, keep only SVS genes for DU testing
data_svs = ... # list of length n_gene, each a (n_spot, n_iso) tensor
gene_svs_names = ... # list of length n_gene, gene names
coordinates = ..., # (n_spot, 2), spatial coordinates
# covariates for differential usage testing
# e.g. spatial domains, RBP expression, etc.
covariates = ... # (n_spot, n_factor), design matrix of covariates
covariate_names = ... # list of length n_factor, covariate names
# initialize the model with covariates
model_np = SplisosmNP()
model_np.setup_data(
data_svs,
coordinates,
design_mtx = covariates,
gene_names = gene_svs_names,
covariate_names = covariate_names
)
# run the conditional HSIC test for differential usage
model_np.test_differential_usage(
method = "hsic-gp", # can be "hsic", "hsic-knn", "hsic-gp", "t-fisher", "t-tippett". See the function docstring for details.
ratio_transformation = 'none', nan_filling = 'mean', # same as above
hsic_eps = 1e-3, # regularization parameter kernel regression, only applicable to 'hsic'. If set to None, will be the unconditional HSIC test.
gp_configs = None, # dictionary of configs for the Gaussian process regression, only applicable to 'hsic-gp'
print_progress = True, return_results = False
)
# extract per gene-factor pair test statistics
df_du_res = model_np.get_formatted_test_results(test_type = 'du')
The parametric test is based on a generalized linear mixed model (GLMM) with the spatial random effect term following a Gaussian random field. The model is fitted using the PyTorch Lightning framework with marginal likelihood (i.e. integrating out the random effect) approximated by the Laplace's method at the mode.
from splisosm.hyptest_glmm import SplisosmGLMM
# parametric model fitting
model_p = SplisosmGLMM(
model_type = 'glmm-full', # can be 'glmm-full', 'glmm-null', 'glm'
share_variance = True,
var_parameterization_sigma_theta = True,
var_fix_sigma = False,
var_prior_model = "none",
var_prior_model_params = {},
init_ratio = "observed",
fitting_method = "joint_gd",
fitting_configs = {'max_epochs': -1}
)
model_p.setup_data(
data_svs, # list of length n_genes, each element is a (n_spots, n_isos) tensor
coordinates, # (n_spots, 2), 2D array/tensor of spatial coordinates
design_mtx = covariates, # (n_spots, n_covariates), 2D array/tensor of covariates
gene_names = gene_svs_names, # list of length n_genes, gene names
covariate_names = covariate_names, # list of length n_covariates, covariate names
group_gene_by_n_iso = True, # whether to group genes by the number of isoforms for batch processing
)
model_p.fit(
n_jobs = 2, # number of cores to use
batch_size = 20, # number of genes with the same number of isoforms to process in parallel per core
quiet=True, print_progress=True,
with_design_mtx = False, # fit the model without covariates for the score DU test
from_null = False, refit_null = True, # for LLR test
random_seed = None
)
model_p.save("model_p.pkl") # save the fitted model
per_gene_glmm_models = model_p.get_fitted_models() # list of length n_genes, each element is a fitted model
# differential usage testing
model_p.test_differential_usage(method = "score", print_progress = True, return_results = False)
# extract per gene-factor pair test statistics
df_du_res = model_p.get_formatted_test_results(test_type = 'du')
Isoform quantification and data preparation
If you have long-read spatial transcriptomics data, feel free to pick your favorite isoform quantification tools (e.g. IsoQuant) to get the isoform-level quantification results.
If you have 3'end short-read spatial transcriptomics data, we recommend using Sierra to extract transcriptome 3'end diversity (TREND) events de novo. Please refer to the Sierra documentation for detailed instructions on how to run Sierra. For multiple samples, we do NOT recommend running Sierra's MergePeakCoordinates function as it sometimes creates overlapping peaks. Note that the 10X Visium FFPE protocol uses targeted gene panels and thus does not provide isoform-specific information.
SPLISOSM is agnostic to isoform/event structure and will compare all events associated with the same gene. For computational efficiency, we recommend filtering out low-abundance isoforms/events before running SPLISOSM.
Here is an example of Sierra run with bam from SpaceRanger
library(Sierra)
FindPeaks(
output.file = '${peak_file}',
gtf.file = '${gtf_file}', # SpaceRanger gtf file
bamfile = '${bam_file}', # bam file from SpaceRanger
junctions.file = '${junc_file}', # junctions bed files extracted using regtools junctions extract
# optional arguments for retainning low-abundance peaks
# min.jcutoff.prop = 0.0,
# min.cov.prop = 0.0,
# min.peak.prop = 0.0
)
CountPeaks(
peak.sites.file = '${peak_file}',
gtf.file = '${gtf_file}',
bamfile = '${bam_file}',
whitelist.file = '${whitelist_file}', # barcodes.tsv file from SpaceRanger
output.dir = '${output_dir}',
)
Here is an example of data preprocessing using scanpy and anndata
import scanpy as sc
import pandas as pd
from splisosm.utils import load_visium_sp_meta
sierra_out_dir = "path/to/sierra/output" # 'output.dir' in CountPeaks
sp_meta_dir = "path/to/visium/spatial/metadata" # 'spatial' directory in 10X Visium data
# load the Sierra outputs as an AnnData object
adata = sc.read(
f"{sierra_out_dir}/matrix.mtx.gz",
cache_compression='cache_compression',
).T
# load TREND peak metadata
peaks = pd.read_csv(
f"{sierra_out_dir}/sitenames.tsv.gz",
header=None,
sep="\t",
)
df_var = peaks[0].str.split(':', expand = True)
df_var.columns = ['gene_symbol', 'chr', 'position', 'strand']
df_var.index = peaks[0].values
# load spatial barcode metadata
barcodes = pd.read_csv(f"{sierra_out_dir}/barcodes.tsv.gz", header=None)
# add metadata to the AnnData object
adata.var_names = peaks[0].values
adata.obs_names = barcodes[0].values
adata.var = df_var
adata.var['gene_id'] = adata.var['gene_symbol']
# load Visium spatial metadata
adata = load_visium_sp_meta(adata, f"{sp_meta_dir}/", library_id='adata_peak')
adata = adata[adata.obs['in_tissue'].astype(bool), :].copy()
# filter out lowly expressed peaks
sc.pp.filter_genes(adata, min_cells=0.01 * adata.shape[0])
# extract gene symbols and peak ids
df_iso_meta = adata.var.copy() # gene_symbol, chr, position, strand, gene_id
df_iso_meta['peak_id'] = adata.var_names
# prepare gene-level metadata
df_gene_meta = df_iso_meta.groupby('gene_symbol').size().reset_index(name='n_peak')
df_gene_meta = df_gene_meta.set_index('gene_symbol')
print(f"Number of spots: {adata.shape[0]}")
print(f"Number of genes before QC: {df_gene_meta.shape[0]}")
print(f"Number of peaks before QC: {adata.shape[1]}")
print(f"Average number of peaks per gene before QC: {adata.shape[1] / df_gene_meta.shape[0]}")
# calculate the total counts per gene
mapping_matrix = pd.get_dummies(df_iso_meta['gene_symbol'])
mapping_matrix = mapping_matrix.loc[df_iso_meta.index, df_gene_meta.index]
isog_counts = adata[:, mapping_matrix.index].layers['counts'] @ mapping_matrix
# calculate mean and sd of total gene counts
df_gene_meta['pct_spot_on'] = (isog_counts > 0).mean(axis = 0)
df_gene_meta['count_avg'] = isog_counts.mean(axis = 0)
df_gene_meta['count_std'] = isog_counts.std(axis = 0)
# filter out lowly expressed genes
_gene_keep = df_gene_meta['pct_spot_on'] > 0.01
# _gene_keep = (df_gene_meta['count_avg'] > 0.5) & _gene_keep
# filter out genes with single isoform
_gene_keep = (df_gene_meta['n_peak'] > 1) & _gene_keep
# filter for isoforms
_iso_keep = df_iso_meta['gene_symbol'].isin(df_gene_meta.index[_gene_keep])
# update feature meta
df_gene_meta = df_gene_meta.loc[_gene_keep, :]
adata = adata[:, _iso_keep]
adata.var = df_iso_meta.loc[_iso_keep, :].copy()
print(f"Number of genes after QC: {sum(_gene_keep)}")
print(f"Number of peaks after QC: {sum(_iso_keep)}")
print(f"Average number of peaks per gene after QC: {sum(_iso_keep) / sum(_gene_keep)}")
Citation
Su, Jiayu, et al. "A computational framework for mapping isoform landscape and regulatory mechanisms from spatial transcriptomics data." bioRxiv (2025): 2025-05. link to preprint
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file splisosm-1.0.2.tar.gz.
File metadata
- Download URL: splisosm-1.0.2.tar.gz
- Upload date:
- Size: 76.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
903937693c2ce62044dae4fa998d61e294eb3b5c97205d0c70a9466f2c9464de
|
|
| MD5 |
2aea4444436ec9b61511cf37de5e9645
|
|
| BLAKE2b-256 |
0b07e0dcd793f20a3843e2e1670fe3e44d9cb7d048455a82917b3e5f19cf6146
|
File details
Details for the file splisosm-1.0.2-py3-none-any.whl.
File metadata
- Download URL: splisosm-1.0.2-py3-none-any.whl
- Upload date:
- Size: 68.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46d7326a47c898ca7d22af1a303d5f58b5b29f2008aa306028fb66616999076b
|
|
| MD5 |
67039c5ee7cd2095090b9ba06fc86a5b
|
|
| BLAKE2b-256 |
2d5ebf3bc6dd7c2a311e32604b22fffee3eb9b47b543c8730f18a589010a30c6
|