Skip to main content

Fast and accurate cell segmentation for single-molecule spatial omics (Stereo-seq)

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

Note: This project is heavily inspired by the original Segger implementation by Elyas Heidari. You can find the original repository at EliHei2/segger_dev.

Installation

StereoSegger requires CUDA 12 (specifically configured for CUDA 12.4 compatibility) for GPU acceleration.

Option 1: Automated Setup (Recommended)

We provide a setup script that handles the complex dependency chain (PyTorch 2.5.1, RAPIDS 24.08, CUDA 12.4) automatically. This is the most reliable method to ensure GPU acceleration works.

# Clone the repository
git clone https://github.com/nrclaudio/stereosegger.git
cd stereosegger

# Run the setup script (requires Conda installed)
bash scripts/setup_segger_env.sh

# Activate the environment
conda activate segger_env

Option 2: Pip Install (Advanced)

If you are managing your own CUDA environment, you can install via pip. Note that you must include the NVIDIA and PyTorch indices to get the correct GPU-accelerated wheels.

pip install stereosegger \
  --extra-index-url https://pypi.nvidia.com \
  --extra-index-url https://download.pytorch.org/whl/cu124

Inputs & Outputs

1. Inputs

StereoSegger primarily operates on Parquet files derived from standard spatial formats.

A. Raw Input (SAW Output)

  • Format: h5ad (AnnData)
  • Source: Output from the SAW pipeline (Stereo-seq Analysis Workflow).
  • Requirements:
    • .X: Sparse matrix of gene counts.
    • .obsm['spatial']: (x, y) coordinates of the bins.
    • .var: Index must contain unique gene names.

B. Processed Input (StereoSegger Native)

If you are skipping the conversion step, provide a directory containing:

  • transcripts.parquet: Long-form table of gene-location occurrences (transcript_id, gene_id, x, y, count, bx, by).
  • genes.parquet: Mapping of gene_id to gene_name.
  • boundaries.parquet (Optional): WKB-encoded polygons (e.g., nuclei masks).

2. Outputs

The pipeline produces three main types of output files, depending on the stage and your configuration.

A. Segmentation Results (.h5ad) - Recommended

The primary output for downstream analysis. Generated when file_format=anndata.

  • Expression Matrix (X): A sparse matrix of shape (n_cells, n_genes) containing UMI counts.
  • Cell Metadata (obs):
    • transcripts: Total UMI count per cell.
    • unique_transcripts: Number of unique genes detected in the cell.
    • cell_centroid_x, cell_centroid_y: Spatial center of the segmented cell.
    • cell_area: Area of the cell (computed via Convex Hull).
  • Gene Metadata (var):
    • total_assigned: Number of transcripts of this gene assigned to cells.
    • total_unassigned: Number of transcripts of this gene that remain unassigned.

B. Segmentation Table (.csv or .parquet)

A long-form record of every transcript's assignment.

  • Columns:
    • transcript_id: The unique ID of the input transcript.
    • seg_label: The ID of the cell this transcript was assigned to.
    • score: The model's confidence score for the assignment.
    • bound: Boolean flag (1 = assigned to a nucleus/seed; 0 = assigned via graph-based connected components).

C. Intermediate Tiled Dataset (.pt)

Generated by create_dataset_fast in your data_dir.

  • Content: Serialized PyTorch Geometric HeteroData objects.
  • Use Case: These are used for training and as the immediate input for the predict step. They contain the spatial graph (nodes, edges, features) for 1000x1000 pixel tiles.

Quickstart: Stereo-seq SAW bin1

1. Convert Data & Create Dataset

# 1. Convert H5AD to Parquet
python -m stereosegger.cli.convert_saw_h5ad_to_segger_parquet \
  --h5ad C04895D5_tissue.h5ad \
  --out_dir ./raw_data \
  --bin_pitch 1.0 \
  --min_count 1

# 2. Build Graph Dataset
python -m stereosegger.cli.create_dataset_fast \
  --base_dir ./raw_data \
  --data_dir ./processed_dataset \
  --sample_type saw_bin1 \
  --tx_graph_mode grid_bins \
  --grid_connectivity 8 \
  --within_bin_edges star

2. Train Model

python -m stereosegger.cli.train_model \
  --dataset_dir ./processed_dataset \
  --models_dir ./models \
  --sample_tag my_sample \
  --max_epochs 200 \
  --accelerator cuda \
  --devices 1

3. Run Segmentation (Predict)

python -m stereosegger.cli.predict \
  --segger_data_dir ./processed_dataset \
  --models_dir ./models \
  --benchmarks_dir ./results \
  --transcripts_file ./raw_data/transcripts.parquet \
  --model_version 0

Technical Details

Stereo-seq SAW bin1 Methodology

StereoSegger implements specific logic to handle SAW bin1 data efficiently:

  • Regular Grid: SAW bin1 data is already a regular grid. We leverage this by using grid adjacency (neighbors are pixels up/down/left/right) which is O(1) compared to O(N log N) for distance-based kNN on pseudo-points.
  • Consistency: Grid adjacency keeps local structure consistent with the chip layout and avoids sensitivity to sparsity or count magnitude.

Graph Modes & Definitions

  1. Pseudo-transcript (Gene-Bin Node): A node created from a nonzero (bin, gene) entry. Connects all genes in a bin to a central "hub" gene, and connects hubs across adjacent bins. Recommended for SAW.
  2. Aggregated Bin Node: A node representing an entire spatial bin, aggregating all transcripts within it. Features are [log(total_count), log(n_genes)].
  3. Grid Adjacency: Two bins are neighbors if their integer grid coordinates differ by one step. grid_connectivity=8 includes diagonals.

Architecture

StereoSegger employs a Heterogeneous Graph Attention Network (GATv2) to segment transcripts based on their spatial neighborhood and identity.

1. Nodes (The Graph Components)

  • Transcript Nodes (tx): Represents a specific gene at a spatial location. Gene embeddings are scaled by (1 + log(count)) to represent signal intensity without exploding graph size.
  • Boundary Nodes (bd): Represents polygon boundaries (e.g., nuclei). Features like Area are log-transformed for numerical stability.

2. Edges (The Connections)

  • tx $\leftrightarrow$ tx (Transcript-Transcript): Star topology (within bin) + Grid adjacency (across bins).
  • tx $\rightarrow$ bd (Transcript-Boundary Neighbors): Connects transcripts to nearby candidate cells.
  • tx $\rightarrow$ bd (Supervision): Connects a transcript to the correct ground-truth boundary during training.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stereosegger-0.1.1.tar.gz (92.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stereosegger-0.1.1-py3-none-any.whl (101.7 kB view details)

Uploaded Python 3

File details

Details for the file stereosegger-0.1.1.tar.gz.

File metadata

  • Download URL: stereosegger-0.1.1.tar.gz
  • Upload date:
  • Size: 92.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7941d69d4b8f278c4f12e95ef99642f056380746ec084d17fee828450f815733
MD5 d63d12b430b5e855f2d3be3a58d51cd6
BLAKE2b-256 29a4e478582d06d4c9be6d8162bfe8a4e5e344a42fd9910ceef8e86af35ba83b

See more details on using hashes here.

File details

Details for the file stereosegger-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: stereosegger-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 101.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c467d7aa16dcd8e29a542df70a5158dc953e0db363044ea7b2f060bd8de829f4
MD5 3c22a94ee6b587b50f66baecd3053e6f
BLAKE2b-256 67ae02ecbc06c2256f75335ec1294c07239e22e4562e2c92b205132dc6f39e9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page