Fast and accurate cell segmentation for single-molecule spatial omics (Stereo-seq)

These details have not been verified by PyPI

Project links

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

Stereo-seq SAW bin1 Quickstart

Convert a SAW bin1 h5ad to StereoSegger parquet and build a dataset with grid-based transcript graphs:

python -m segger.cli.convert_saw_h5ad_to_segger_parquet \
  --h5ad C04895D5_tissue.h5ad \
  --out_dir /path/to/base_dir \
  --bin_pitch 1.0 \
  --min_count 1

python -m segger.cli.create_dataset_fast \
  --base_dir /path/to/base_dir \
  --data_dir /path/to/stereosegger_dataset \
  --sample_type saw_bin1 \
  --tx_graph_mode grid_bins \
  --grid_connectivity 8 \
  --within_bin_edges star

Recommended defaults: grid_connectivity=8, within_bin_edges=star (for gene-specific nodes), bin_pitch=1.0 when coords are in bin units.

End-to-End Workflow

After generating the dataset (see Stereo-seq SAW bin1 Quickstart), you can train a model and run segmentation.

1. Train Model

Train the StereoSegger model on your processed dataset.

python -m segger.cli.train_model \
  --dataset_dir /path/to/stereosegger_dataset \
  --models_dir /path/to/models_dir \
  --sample_tag my_sample \
  --max_epochs 200 \
  --accelerator cuda \
  --devices 1

--dataset_dir: Path to the processed StereoSegger dataset (output of create_dataset).
--models_dir: Directory where checkpoints and logs will be saved.
--sample_tag: Tag used during dataset creation (e.g., sample name).

2. Run Segmentation (Predict)

Apply the trained model to segment transcripts.

python -m segger.cli.predict \
  --segger_data_dir /path/to/stereosegger_dataset \
  --models_dir /path/to/models_dir \
  --benchmarks_dir /path/to/output_dir \
  --transcripts_file /path/to/base_dir/transcripts.parquet \
  --model_version 0

--segger_data_dir: Path to the processed StereoSegger dataset.
--models_dir: Directory containing the trained model (same as used in training).
--benchmarks_dir: Directory to save segmentation results (.h5ad, .csv, etc.).
--transcripts_file: Path to the transcripts.parquet file (generated during convert_saw_h5ad_to_segger_parquet or dataset creation).
--model_version: Version number of the training run (e.g., 0 for version_0).

Inputs & Outputs

Input Format

StereoSegger relies on a standard H5AD file (AnnData) for Stereo-seq data, which is then converted into intermediate Parquet files.

Source H5AD:
- Expression Matrix (.X): Sparse matrix of gene counts.
- Coordinates (.obsm['spatial']): (x, y) coordinates of the bins.
- Variables (.var): Must contain gene names (index or a specific column).
Intermediate Parquet: The conversion tool (convert_saw_h5ad_to_segger_parquet) generates a directory with:
- transcripts.parquet: Contains pseudo-transcripts derived from non-zero bins. Columns: x, y, gene_id, count, bx (bin x), by (bin y).
- genes.parquet: Mapping of gene_id to gene_name.
- boundaries.parquet (Optional): Polygon geometries from existing segmentation masks (if provided).

Output

StereoSegger Dataset:
- A directory of processed PyTorch Geometric graphs (.pt files) representing tiled regions of the tissue.
- Nodes: Pseudo-transcripts (with count features) and boundaries (if available).
- Edges: Spatial grid connections (same gene across neighbors) and local co-expression (within-bin).
- Trained Model:
- PyTorch Lightning checkpoints (.ckpt) saved in the models_dir.

Data Requirements

If bypassing the H5AD conversion tools, ensure your input Parquet files adhere to these schemas:

transcripts.parquet (Required):
- One row per detected gene-location (long format).
- Columns:
  - transcript_id (int64): Unique identifier for the transcript node.
  - x, y (float): Spatial coordinates.
  - gene_id (int32): Integer index corresponding to genes.parquet.
  - count (int32, optional): UMI count for this gene at this location (used for scaling embeddings).
  - bx, by (int32, optional): Grid indices (required if using grid-based graph construction).
  - overlaps_nucleus (int/bool, optional): 1 if inside a nucleus, 0 otherwise (for supervision).
  - cell_id (int/str, optional): Ground truth cell assignment (for supervision).
genes.parquet (Required):
- Mapping between integer IDs and gene names.
- Columns:
  - gene_id (int32): Matching transcripts.parquet.
  - gene_name (string): Human-readable gene symbol.
boundaries.parquet (Optional):
- Polygon geometries for cell/nuclei boundaries.
- Columns:
  - boundary_id (int/str): Unique identifier.
  - geometry (bytes/binary): WKB (Well-Known Binary) encoded polygon.

Installation (Linux with NVIDIA GPU)

StereoSegger requires CUDA 11 or CUDA 12 for GPU acceleration. The recommended way to install is using pip inside a clean environment.

Step 1: Create a clean environment

# Using conda for isolation
conda create -n stereosegger python=3.10 pip -y
conda activate stereosegger

Step 2: Install Core Dependencies (PyTorch, RAPIDS, CuPy)

StereoSegger relies on GPU acceleration. We recommend installing the core stack first to ensure compatibility.

Using Pip (Recommended for Linux/WSL with CUDA 12):

# 1. Install PyTorch (match your CUDA version, e.g., 12.4)
pip install torch==2.5.1 torchvision --index-url https://download.pytorch.org/whl/cu124

# 2. Install RAPIDS (cuDF, cuML, cuGraph) and CuPy
pip install cudf-cu12==24.8.* cuml-cu12==24.8.* cugraph-cu12==24.8.* cuspatial-cu12==24.8.* cupy-cuda12x \
    --extra-index-url https://pypi.nvidia.com

# 3. Install PyTorch Geometric and optimized kernels
pip install torch_geometric
pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.5.1+cu124.html

Note: When using pip for RAPIDS, you may need to update your LD_LIBRARY_PATH if you encounter import errors (e.g., libcusolver.so.11: cannot open shared object file).
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/your/env/lib/python3.10/site-packages/nvidia/cusparse/lib:/path/to/your/env/lib/python3.10/site-packages/nvidia/cublas/lib

Using Conda:

# Install compatible versions of PyTorch, RAPIDS, and CuPy
conda install -c pytorch -c nvidia -c rapidsai -c conda-forge \
    pytorch=2.5.1 torchvision pytorch-cuda=12.4 \
    rapids=24.8 python=3.10 cuda-version=12.4 \
    cupy
pip install torch_geometric

Step 3: Install StereoSegger

pip install -e .

Stereo-seq SAW bin1 Notes and Design Choices

Why SAW bin1 is handled differently:

SAW bin1 is already a regular grid with counts per (bin, gene), not per-molecule coordinates.
Grid adjacency is a more faithful neighborhood for bins than a distance-based kNN on pseudo-points. It keeps local structure consistent with the chip layout and avoids sensitivity to sparsity or count magnitude.
The added log1p(count) feature lets the model see expression strength without exploding the number of nodes (no count expansion).

Definitions:

Pseudo-transcript (Gene-Bin Node): A node created from a nonzero (bin, gene) entry. It carries the gene identity, uses the bin’s (x, y) coordinates, and stores log1p(count) as an extra scalar feature. This is the behavior when within_bin_edges=star.
Aggregated Bin Node: A node representing an entire spatial bin, aggregating all transcripts within it. Features are [log(total_count), log(n_genes)]. Gene identity is implicit. This is the behavior when within_bin_edges=none.
Grid adjacency: Two bins are neighbors if their integer grid coordinates differ by one step. With 4-connectivity this is up/down/left/right; with 8-connectivity it also includes diagonals.

Graph mode guidance (--tx_graph_mode grid_bins):

within_bin_edges=star: Recommended for SAW. Creates independent nodes for each gene present in a bin. Connects all genes in a bin to a central "hub" gene, and connects hubs across adjacent bins. Preserves gene identity while leveraging the grid structure.
within_bin_edges=none: Aggregates all transcripts in a bin into a single node. Fastest, but loses gene-specific identity in the graph topology (features are just counts). Useful for very high density or quick prototyping.

Other modes:

kdtree: Original behavior for Xenium/MERSCOPE (single-molecule resolution); still supported for SAW if you want distance-based adjacency.

Modeling choices and impact:

Token-based gene embeddings (default): When using kdtree or grid_bins (star), each node carries a gene ID token, and the model learns an embedding per gene.
Count feature (log1p(count)): when a count column exists, StereoSegger adds expression strength without expanding nodes. For token-based embeddings it scales the gene embedding by (1 + log1p(count)).

Architecture

StereoSegger employs a Heterogeneous Graph Attention Network (GATv2) to segment transcripts based on their spatial neighborhood and identity.

1. Nodes (The Graph Components)

The graph consists of two distinct node types:

Transcript Nodes (tx):
- Identity: Represents a specific gene detected at a specific spatial location.
- Resolution: For Stereo-seq (SAW bin1), each node corresponds to a (bin, gene) tuple. If a single bin contains 10 distinct genes, it generates 10 distinct nodes (in "star" mode).
- Features:
  - Gene Embedding: A learnable vector representing the gene identity.
  - UMI Count: The embedding vector is scaled by (1 + log(count)) to represent the signal intensity.
Boundary Nodes (bd):
- Identity: Represents a polygon boundary (e.g., a cell or nucleus segmentation).
- Features: Geometric properties like Area, Convexity, Elongation, and Circularity.

2. Edges (The Connections)

Information flows between nodes via three types of directed edges:

tx $\leftrightarrow$ tx (Transcript-Transcript):
- Grid Star Topology:
  - Within Bin: All gene nodes in a bin connect to a central "hub" node (one of the genes in that bin).
  - Across Bins: Hub nodes connect to hub nodes of adjacent spatial bins (Grid Adjacency).
- Allows the model to aggregate local co-expression (within bin) and spatial continuity (across bins).
tx $\rightarrow$ bd (Transcript-Boundary Neighbors):
- Connects a transcript to nearby boundaries (potential candidate cells).
- The model uses this edge to decide if the transcript belongs to that boundary.
tx $\rightarrow$ bd (Transcript-Boundary Assignment - Supervision):
- Connects a transcript to the correct boundary (Ground Truth) during training.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.3

Jan 29, 2026

0.2.2

Jan 29, 2026

0.2.1

Jan 29, 2026

0.2.0

Jan 29, 2026

0.1.3

Jan 29, 2026

0.1.1

Jan 28, 2026

This version

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stereosegger-0.1.0.tar.gz (95.3 kB view details)

Uploaded Jan 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stereosegger-0.1.0-py3-none-any.whl (102.5 kB view details)

Uploaded Jan 27, 2026 Python 3

File details

Details for the file stereosegger-0.1.0.tar.gz.

File metadata

Download URL: stereosegger-0.1.0.tar.gz
Upload date: Jan 27, 2026
Size: 95.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for stereosegger-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`620b17e3564dc4b877a12c8690547aed51af82842012cbfb879cc0267cee1a41`
MD5	`9e5e5f16066ca6b37532dfb1d6aeff09`
BLAKE2b-256	`57db8ad46edede55af58c6f5fb80bb5a1c52afbb7dd1a042250a2db2ab041157`

See more details on using hashes here.

File details

Details for the file stereosegger-0.1.0-py3-none-any.whl.

File metadata

Download URL: stereosegger-0.1.0-py3-none-any.whl
Upload date: Jan 27, 2026
Size: 102.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for stereosegger-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`957b7662d06ba81e35e8ee546f3ee26ba34ac0ba42eb356d9c11db2d352a3db6`
MD5	`2053c109ed499d525cd6e99d7bd0cca8`
BLAKE2b-256	`6998c41c7717977f0e08805dd4e899eb5fadc5e2cec32009cb4a450c1e28f69c`

See more details on using hashes here.

stereosegger 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

Stereo-seq SAW bin1 Quickstart

End-to-End Workflow

1. Train Model

2. Run Segmentation (Predict)

Inputs & Outputs

Input Format

Output

Data Requirements

Installation (Linux with NVIDIA GPU)

Step 1: Create a clean environment

Step 2: Install Core Dependencies (PyTorch, RAPIDS, CuPy)

Step 3: Install StereoSegger

Stereo-seq SAW bin1 Notes and Design Choices

Architecture

1. Nodes (The Graph Components)

2. Edges (The Connections)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes