Fast and accurate cell segmentation for single-molecule spatial omics (Stereo-seq)
Project description
StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics
Note: This project is heavily inspired by the original Segger implementation by Elyas Heidari. You can find the original repository at EliHei2/segger_dev.
Installation
StereoSegger requires CUDA 12 (specifically configured for CUDA 12.4 compatibility) for GPU acceleration.
Option 1: Automated Setup (Recommended)
We provide a setup script that handles the complex dependency chain (PyTorch 2.5.1, RAPIDS 24.08, CUDA 12.4) automatically. This is the most reliable method to ensure GPU acceleration works.
# Clone the repository
git clone https://github.com/nrclaudio/stereosegger.git
cd stereosegger
# Run the setup script (requires Conda installed)
bash scripts/setup_segger_env.sh
# Activate the environment
conda activate segger_env
Option 2: Pip Install (Advanced)
If you are managing your own CUDA environment, you can install via pip. Note that you must include the NVIDIA and PyTorch indices to get the correct GPU-accelerated wheels.
pip install stereosegger \
--extra-index-url https://pypi.nvidia.com \
--extra-index-url https://download.pytorch.org/whl/cu124
Inputs & Outputs
1. Inputs
StereoSegger primarily operates on Parquet files derived from standard spatial formats.
A. Raw Input (SAW Output)
- Format:
h5ad(AnnData) - Source: Output from the SAW pipeline (Stereo-seq Analysis Workflow).
- Requirements:
.X: Sparse matrix of gene counts..obsm['spatial']: (x, y) coordinates of the bins..var: Index must contain unique gene names.
B. Processed Input (StereoSegger Native)
If you are skipping the conversion step, provide a directory containing:
transcripts.parquet: Long-form table of gene-location occurrences (transcript_id,gene_id,x,y,count,bx,by).genes.parquet: Mapping ofgene_idtogene_name.boundaries.parquet(Optional): WKB-encoded polygons (e.g., nuclei masks).
2. Outputs
The pipeline produces three main types of output files, depending on the stage and your configuration.
A. Segmentation Results (.h5ad) - Recommended
The primary output for downstream analysis. Generated when file_format=anndata.
- Expression Matrix (
X): A sparse matrix of shape(n_cells, n_genes)containing UMI counts. - Cell Metadata (
obs):transcripts: Total UMI count per cell.unique_transcripts: Number of unique genes detected in the cell.cell_centroid_x,cell_centroid_y: Spatial center of the segmented cell.cell_area: Area of the cell (computed via Convex Hull).
- Gene Metadata (
var):total_assigned: Number of transcripts of this gene assigned to cells.total_unassigned: Number of transcripts of this gene that remain unassigned.
B. Segmentation Table (.csv or .parquet)
A long-form record of every transcript's assignment.
- Columns:
transcript_id: The unique ID of the input transcript.seg_label: The ID of the cell this transcript was assigned to.score: The model's confidence score for the assignment.bound: Boolean flag (1 = assigned to a nucleus/seed; 0 = assigned via graph-based connected components).
C. Intermediate Tiled Dataset (.pt)
Generated by create_dataset_fast in your data_dir.
- Content: Serialized PyTorch Geometric
HeteroDataobjects. - Use Case: These are used for training and as the immediate input for the
predictstep. They contain the spatial graph (nodes, edges, features) for 1000x1000 pixel tiles.
Quickstart: Stereo-seq SAW bin1
1. Convert Data & Create Dataset
# 1. Convert H5AD to Parquet
python -m stereosegger.cli.convert_saw_h5ad_to_segger_parquet \
--h5ad C04895D5_tissue.h5ad \
--out_dir ./raw_data \
--bin_pitch 1.0 \
--min_count 1
# 2. Build Graph Dataset
python -m stereosegger.cli.create_dataset_fast \
--base_dir ./raw_data \
--data_dir ./processed_dataset \
--sample_type saw_bin1 \
--tx_graph_mode grid_bins \
--grid_connectivity 8 \
--within_bin_edges star
2. Train Model
python -m stereosegger.cli.train_model \
--dataset_dir ./processed_dataset \
--models_dir ./models \
--sample_tag my_sample \
--max_epochs 200 \
--accelerator cuda \
--devices 1
3. Run Segmentation (Predict)
python -m stereosegger.cli.predict \
--segger_data_dir ./processed_dataset \
--models_dir ./models \
--benchmarks_dir ./results \
--transcripts_file ./raw_data/transcripts.parquet \
--model_version 0
Technical Details
Stereo-seq SAW bin1 Methodology
StereoSegger implements specific logic to handle SAW bin1 data efficiently:
- Regular Grid: SAW bin1 data is already a regular grid. We leverage this by using grid adjacency (neighbors are pixels up/down/left/right) which is
O(1)compared toO(N log N)for distance-based kNN on pseudo-points. - Consistency: Grid adjacency keeps local structure consistent with the chip layout and avoids sensitivity to sparsity or count magnitude.
Graph Modes & Definitions
- Pseudo-transcript (Gene-Bin Node): A node created from a nonzero (bin, gene) entry. Connects all genes in a bin to a central "hub" gene, and connects hubs across adjacent bins. Recommended for SAW.
- Aggregated Bin Node: A node representing an entire spatial bin, aggregating all transcripts within it. Features are
[log(total_count), log(n_genes)]. - Grid Adjacency: Two bins are neighbors if their integer grid coordinates differ by one step.
grid_connectivity=8includes diagonals.
Architecture
StereoSegger employs a Heterogeneous Graph Attention Network (GATv2) to segment transcripts based on their spatial neighborhood and identity.
1. Nodes (The Graph Components)
- Transcript Nodes (
tx): Represents a specific gene at a spatial location. Gene embeddings are scaled by(1 + log(count))to represent signal intensity without exploding graph size. - Boundary Nodes (
bd): Represents polygon boundaries (e.g., nuclei). Features like Area are log-transformed for numerical stability.
2. Edges (The Connections)
tx$\leftrightarrow$tx(Transcript-Transcript): Star topology (within bin) + Grid adjacency (across bins).tx$\rightarrow$bd(Transcript-Boundary Neighbors): Connects transcripts to nearby candidate cells.tx$\rightarrow$bd(Supervision): Connects a transcript to the correct ground-truth boundary during training.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stereosegger-0.1.1.tar.gz.
File metadata
- Download URL: stereosegger-0.1.1.tar.gz
- Upload date:
- Size: 92.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7941d69d4b8f278c4f12e95ef99642f056380746ec084d17fee828450f815733
|
|
| MD5 |
d63d12b430b5e855f2d3be3a58d51cd6
|
|
| BLAKE2b-256 |
29a4e478582d06d4c9be6d8162bfe8a4e5e344a42fd9910ceef8e86af35ba83b
|
File details
Details for the file stereosegger-0.1.1-py3-none-any.whl.
File metadata
- Download URL: stereosegger-0.1.1-py3-none-any.whl
- Upload date:
- Size: 101.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c467d7aa16dcd8e29a542df70a5158dc953e0db363044ea7b2f060bd8de829f4
|
|
| MD5 |
3c22a94ee6b587b50f66baecd3053e6f
|
|
| BLAKE2b-256 |
67ae02ecbc06c2256f75335ec1294c07239e22e4562e2c92b205132dc6f39e9b
|