Skip to main content

Spatial isoform statistical modeling (SPLISOSM)

Project description

SPLISOSM - Spatial Isoform Statistical Modeling

overview

SPLISOSM (SPatiaL ISOform Statistical Modeling) is a Python package for analyzing isoform usage patterns in spatial transcriptomics data. It employs optimized spatial and isoform kernels to capture complex dependencies between spatial locations and transcript usage, maximizing statistical power while maintaining well-calibrated, permutation-free test statistics even with extremely sparse data. The differential usage tests are conditioned on spatial locations to reduce false positives due to spatial autocorrelation.

SPLISOSM accommodates isoform quantification results of both full-length isoforms from long-read sequencing and local transcript diversity events from short-read sequencing platforms including 10X Visium and Slide-seqV2.

This repository is under active development. Please check back later for detailed documentations, tutorials and end-to-end analysis examples using public long-read and short-read datasets.

Installation

Via GitHub (latest version):

pip install git+https://github.com/JiayuSuPKU/SPLISOSM.git#egg=splisosm

Via PyPI (stable version):

pip install splisosm

Quick start

The basic unit of SPLISOSM analysis is the isoform-level event, which can be a full-length transcript from long-read sequencing or a local variable structure from short-read 3'end sequencing. Given spatial isoform quantification results, SPLISOSM runs two types of tests:

  1. Spatial variability (SV): Find spatially variable transcript usage (HSIC-IR), transcript expression (HSIC-IC) or gene expression (HSIC-GC).
  2. Differential usage (DU): Testing the conditional association between transcript usage and spatial covariates such as spatial domains and expression of potential regulators like RNA binding proteins (RBPs).

Both SV and DU tests are multivariate performed at the gene level (i.e. one test per gene or per gene-covariate pair).

SPLISOSM takes the following inputs:

  • data: list of per-gene isoform quantification for 'n_gene' genes. Each element is a (n_spot, n_iso) tensor, where 'n_spot' is the number of spatial spots and 'n_iso' is the number of isoforms (or TREND events) of that given gene.
  • coordinates: spatial coordinates. (n_spot, 2).

And additionally for differential usage testing:

  • covariates: spatial variable such as RBP expression to test for association. (n_spot, n_covariate).

See the Isoform quantification and data preparation section below for tips on isoform quantification and data preparation. Given an AnnData object with isoform-level quantification results, inputs for SPLISOSM can be extracted using the extract_counts_n_ratios function from splisosm.utils.

The output of the SV test is a data frame of per-gene test statistics and p-values, and the output of the DU test is a data frame of per-gene-covariate-pair test statistics and p-values.

Example data

A small demo dataset of Visium-ONT mouse olfactory bulb (SiT-MOB) can be downloaded from here via Dropbox (~100Mb) to test the package functionality.

  • mob_ont_filtered_1107.h5ad: AnnData object with isoform quantification results.
  • mob_visium_rbp_1107.h5ad: AnnData object with short-read-based RBP gene expression.
import scanpy as sc
from splisosm.utils import extract_counts_n_ratios

adata_ont = sc.read("mob_ont_filtered_1107.h5ad")
adata_rbp = sc.read("mob_visium_rbp_1107.h5ad")

# prepare per gene isoform tensor list
# data[i] = (n_spot, n_iso) tensor for gene i
# assuming isoforms (adata_ont.var_names) are grouped by adata_ont.var['gene_symbol']
data, _, gene_names, _ = extract_counts_n_ratios(
    adata_ont, layer = 'counts', group_iso_by = 'gene_symbol'
)

# spatial coordinates
coordinates = adata_ont.obs.loc[:, ['array_row', 'array_col']]

# prepare covariates for differential usage testing
adata_rbp = adata_rbp[adata_ont.obs_names, :].copy() # align the RBP data with the isoform data
# focus on spatially variably expressed RBPs only
# adata_rbp.var['is_visium_sve'] = adata_rbp.var['pvalue_adj_sparkx'] < 0.01
covariates = adata_rbp[:, adata_rbp.var['is_visium_sve']].layers['log1p'].toarray()
covariate_names = adata_rbp.var.loc[adata_rbp.var['is_visium_sve'], 'features']

Testing for spatial variability (SV)

SPLISOSM uses Hilbert-Schmidt Independence Criterion (HSIC) to test for kernel independence between isoform quantities and spatial coordinates. Specifically, we have the following SV tests for variability in three aspects:

  1. HSIC-IR for isoform relative ratio.
  2. HSIC-GC for total gene expression. This test is similar to SPARK-X but with a different spatial kernel design and improved statistical power.
  3. HSIC-IC for isoform count. Biologically, isoform expression variability is the joint result of variability in total gene expression and in isoform usage. In practice the results of HSIC-IC and HSIC-GC are usually similar.
from splisosm import SplisosmNP

# minimal input data
data = ... # list of length n_gene, each a (n_spot, n_iso) tensor
coordinates = ..., # (n_spot, 2), spatial coordinates
gene_names = ... # list of length n_gene, gene names

# initialize the model
model_np = SplisosmNP()
model_np.setup_data(data, coordinates, gene_names = gene_names)

# per-gene test for spatial variability
# method can be 'hsic-ir' (isoform ratio), 'hsic-ic' (isoform counts)
# 'hsic-gc' (gene counts), 'spark-x' (gene counts)
model_np.test_spatial_variability(
    method = "hsic-ir",
    ratio_transformation = 'none', # only applicable to 'hsic-ir', can be 'none', 'clr', 'ilr', 'alr', 'radial'
    nan_filling = 'mean', # how to fill NaN values in the data, can be 'mean' (global mean), 'none' (ignoring NaN spots)
)

# extract the per-gene test statistics
df_sv_res = model_np.get_formatted_test_results(test_type = 'sv')

Testing for differential isoform usage (DU)

We implemented conditional association tests for isoform usage in both parametric and non-parametric settings.

The non-parametric test is again based on the HSIC kernel independence test and relies on Gaussian process regression for spatial conditioning.

# from splisosm.hyptest import IsoSDE # will be deprecated in the future
from splisosm.hyptest_np import SplisosmNP

# minimal input data
# after running the SV test, keep only SVS genes for DU testing
data_svs = ... # list of length n_gene, each a (n_spot, n_iso) tensor
gene_svs_names = ... # list of length n_gene, gene names
coordinates = ..., # (n_spot, 2), spatial coordinates

# covariates for differential usage testing
# e.g. spatial domains, RBP expression, etc.
covariates = ... # (n_spot, n_factor), design matrix of covariates
covariate_names = ... # list of length n_factor, covariate names

# initialize the model with covariates
model_np = SplisosmNP()
model_np.setup_data(
    data_svs,
    coordinates,
    design_mtx = covariates,
    gene_names = gene_svs_names,
    covariate_names = covariate_names
)

# run the conditional HSIC test for differential usage
model_np.test_differential_usage(
    method = "hsic-gp", # can be "hsic", "hsic-knn", "hsic-gp", "t-fisher", "t-tippett". See the function docstring for details.
    ratio_transformation = 'none', nan_filling = 'mean', # same as above
    hsic_eps = 1e-3, # regularization parameter kernel regression, only applicable to 'hsic'. If set to None, will be the unconditional HSIC test.
    gp_configs = None, # dictionary of configs for the Gaussian process regression, only applicable to 'hsic-gp'
    print_progress = True, return_results = False
)

# extract per gene-factor pair test statistics
df_du_res = model_np.get_formatted_test_results(test_type = 'du')

The parametric test is based on a generalized linear mixed model (GLMM) with the spatial random effect term following a Gaussian random field. The model is fitted using the PyTorch Lightning framework with marginal likelihood (i.e. integrating out the random effect) approximated by the Laplace's method at the mode.

from splisosm.hyptest_glmm import SplisosmGLMM

# parametric model fitting
model_p = SplisosmGLMM(
    model_type = 'glmm-full', # can be 'glmm-full', 'glmm-null', 'glm'
    share_variance = True,
    var_parameterization_sigma_theta = True,
    var_fix_sigma = False,
    var_prior_model = "none",
    var_prior_model_params = {},
    init_ratio = "observed",
    fitting_method = "joint_gd",
    fitting_configs = {'max_epochs': -1}
)
model_p.setup_data(
    data_svs, # list of length n_genes, each element is a (n_spots, n_isos) tensor
    coordinates, # (n_spots, 2), 2D array/tensor of spatial coordinates
    design_mtx = covariates, # (n_spots, n_covariates), 2D array/tensor of covariates
    gene_names = gene_svs_names, # list of length n_genes, gene names
    covariate_names = covariate_names, # list of length n_covariates, covariate names
    group_gene_by_n_iso = True, # whether to group genes by the number of isoforms for batch processing
)
model_p.fit(
    n_jobs = 2, # number of cores to use
    batch_size = 20, # number of genes with the same number of isoforms to process in parallel per core
    quiet=True, print_progress=True,
    with_design_mtx = False, # fit the model without covariates for the score DU test
    from_null = False, refit_null = True, # for LLR test
    random_seed = None
)
model_p.save("model_p.pkl") # save the fitted model
per_gene_glmm_models = model_p.get_fitted_models() # list of length n_genes, each element is a fitted model

# differential usage testing
model_p.test_differential_usage(method = "score", print_progress = True, return_results = False)

# extract per gene-factor pair test statistics
df_du_res = model_p.get_formatted_test_results(test_type = 'du')

Isoform quantification and data preparation

If you have long-read spatial transcriptomics data, feel free to pick your favorite isoform quantification tools (e.g. IsoQuant) to get the isoform-level quantification results.

If you have 3'end short-read spatial transcriptomics data, we recommend using Sierra to extract transcriptome 3'end diversity (TREND) events de novo. Please refer to the Sierra documentation for detailed instructions on how to run Sierra. For multiple samples, we do NOT recommend running Sierra's MergePeakCoordinates function as it sometimes creates overlapping peaks. Note that the 10X Visium FFPE protocol uses targeted gene panels and thus does not provide isoform-specific information.

SPLISOSM is agnostic to isoform/event structure and will compare all events associated with the same gene. For computational efficiency, we recommend filtering out low-abundance isoforms/events before running SPLISOSM.

Here is an example of Sierra run with bam from SpaceRanger
library(Sierra)
FindPeaks(
  output.file = '${peak_file}', 
  gtf.file = '${gtf_file}', # SpaceRanger gtf file
  bamfile = '${bam_file}', # bam file from SpaceRanger
  junctions.file = '${junc_file}', # junctions bed files extracted using regtools junctions extract
# optional arguments for retainning low-abundance peaks
#   min.jcutoff.prop = 0.0,
#   min.cov.prop = 0.0, 
#   min.peak.prop = 0.0
)

CountPeaks(
  peak.sites.file = '${peak_file}', 
  gtf.file = '${gtf_file}', 
  bamfile = '${bam_file}', 
  whitelist.file = '${whitelist_file}', # barcodes.tsv file from SpaceRanger
  output.dir = '${output_dir}',
)
Here is an example of data preprocessing using scanpy and anndata
import scanpy as sc
import pandas as pd

from splisosm.utils import load_visium_sp_meta

sierra_out_dir = "path/to/sierra/output" # 'output.dir' in CountPeaks
sp_meta_dir = "path/to/visium/spatial/metadata" # 'spatial' directory in 10X Visium data

# load the Sierra outputs as an AnnData object
adata = sc.read(
    f"{sierra_out_dir}/matrix.mtx.gz",
    cache_compression='cache_compression',
).T

# load TREND peak metadata
peaks = pd.read_csv(
    f"{sierra_out_dir}/sitenames.tsv.gz",
    header=None,
    sep="\t",
)
df_var = peaks[0].str.split(':', expand = True)
df_var.columns = ['gene_symbol', 'chr', 'position', 'strand']
df_var.index = peaks[0].values

# load spatial barcode metadata
barcodes = pd.read_csv(f"{sierra_out_dir}/barcodes.tsv.gz", header=None)

# add metadata to the AnnData object
adata.var_names = peaks[0].values
adata.obs_names = barcodes[0].values
adata.var = df_var
adata.var['gene_id'] = adata.var['gene_symbol']

# load Visium spatial metadata
adata = load_visium_sp_meta(adata, f"{sp_meta_dir}/", library_id='adata_peak')
adata = adata[adata.obs['in_tissue'].astype(bool), :].copy()

# filter out lowly expressed peaks
sc.pp.filter_genes(adata, min_cells=0.01 * adata.shape[0])

# extract gene symbols and peak ids
df_iso_meta = adata.var.copy() # gene_symbol, chr, position, strand, gene_id
df_iso_meta['peak_id'] = adata.var_names

# prepare gene-level metadata
df_gene_meta = df_iso_meta.groupby('gene_symbol').size().reset_index(name='n_peak')
df_gene_meta = df_gene_meta.set_index('gene_symbol')

print(f"Number of spots: {adata.shape[0]}")
print(f"Number of genes before QC: {df_gene_meta.shape[0]}")
print(f"Number of peaks before QC: {adata.shape[1]}")
print(f"Average number of peaks per gene before QC: {adata.shape[1] / df_gene_meta.shape[0]}")

# calculate the total counts per gene
mapping_matrix = pd.get_dummies(df_iso_meta['gene_symbol'])
mapping_matrix = mapping_matrix.loc[df_iso_meta.index, df_gene_meta.index]
isog_counts = adata[:, mapping_matrix.index].layers['counts'] @ mapping_matrix

# calculate mean and sd of total gene counts
df_gene_meta['pct_spot_on'] = (isog_counts > 0).mean(axis = 0)
df_gene_meta['count_avg'] = isog_counts.mean(axis = 0)
df_gene_meta['count_std'] = isog_counts.std(axis = 0)

# filter out lowly expressed genes
_gene_keep = df_gene_meta['pct_spot_on'] > 0.01
# _gene_keep = (df_gene_meta['count_avg'] > 0.5) & _gene_keep

# filter out genes with single isoform
_gene_keep = (df_gene_meta['n_peak'] > 1) & _gene_keep

# filter for isoforms
_iso_keep = df_iso_meta['gene_symbol'].isin(df_gene_meta.index[_gene_keep])

# update feature meta
df_gene_meta = df_gene_meta.loc[_gene_keep, :]
adata = adata[:, _iso_keep]
adata.var = df_iso_meta.loc[_iso_keep, :].copy()

print(f"Number of genes after QC: {sum(_gene_keep)}")
print(f"Number of peaks after QC: {sum(_iso_keep)}")
print(f"Average number of peaks per gene after QC: {sum(_iso_keep) / sum(_gene_keep)}")

Citation

Su, Jiayu, et al. "A computational framework for mapping isoform landscape and regulatory mechanisms from spatial transcriptomics data." bioRxiv (2025): 2025-05. link to preprint

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splisosm-1.0.2.tar.gz (76.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

splisosm-1.0.2-py3-none-any.whl (68.6 kB view details)

Uploaded Python 3

File details

Details for the file splisosm-1.0.2.tar.gz.

File metadata

  • Download URL: splisosm-1.0.2.tar.gz
  • Upload date:
  • Size: 76.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for splisosm-1.0.2.tar.gz
Algorithm Hash digest
SHA256 903937693c2ce62044dae4fa998d61e294eb3b5c97205d0c70a9466f2c9464de
MD5 2aea4444436ec9b61511cf37de5e9645
BLAKE2b-256 0b07e0dcd793f20a3843e2e1670fe3e44d9cb7d048455a82917b3e5f19cf6146

See more details on using hashes here.

File details

Details for the file splisosm-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: splisosm-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 68.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for splisosm-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 46d7326a47c898ca7d22af1a303d5f58b5b29f2008aa306028fb66616999076b
MD5 67039c5ee7cd2095090b9ba06fc86a5b
BLAKE2b-256 2d5ebf3bc6dd7c2a311e32604b22fffee3eb9b47b543c8730f18a589010a30c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page