Fast and accurate cell segmentation for single-molecule spatial omics (Stereo-seq)

These details have not been verified by PyPI

Project links

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

Note: This project is heavily inspired by the original Segger implementation by Elyas Heidari. You can find the original repository at EliHei2/segger_dev.

Installation

StereoSegger requires CUDA 12 (specifically configured for CUDA 12.4 compatibility) for GPU acceleration.

Option 1: Automated Setup (Recommended)

We provide a setup script that handles the complex dependency chain (PyTorch 2.5.1, RAPIDS 24.08, CUDA 12.4) automatically. This is the most reliable method to ensure GPU acceleration works.

# Clone the repository
git clone https://github.com/nrclaudio/stereosegger.git
cd stereosegger

# Run the setup script (requires Conda installed)
bash scripts/setup_segger_env.sh

# Activate the environment
conda activate segger_env

Option 2: Pip Install (Advanced)

If you are managing your own CUDA environment, you can install via pip. Note that you must include the NVIDIA and PyTorch indices to get the correct GPU-accelerated wheels.

pip install stereosegger \
  --extra-index-url https://pypi.nvidia.com \
  --extra-index-url https://download.pytorch.org/whl/cu124

Inputs & Outputs

1. Inputs

StereoSegger primarily operates on Parquet files derived from standard spatial formats.

A. Raw Input (SAW Output)

Format: h5ad (AnnData)
Source: Output from the SAW pipeline (Stereo-seq Analysis Workflow).
Requirements:
- .X: Sparse matrix of gene counts.
- .obsm['spatial']: (x, y) coordinates of the bins.
- .var: Index must contain unique gene names.

B. Processed Input (StereoSegger Native)

If you are skipping the conversion step, provide a directory containing:

transcripts.parquet: Long-form table of gene-location occurrences (transcript_id, gene_id, x, y, count, bx, by).
genes.parquet: Mapping of gene_id to gene_name.
boundaries.parquet (Optional): WKB-encoded polygons (e.g., nuclei masks).

2. Outputs

The pipeline produces three main types of output files, depending on the stage and your configuration.

A. Segmentation Results (`.h5ad`) - Recommended

The primary output for downstream analysis. Generated when file_format=anndata.

Expression Matrix (X): A sparse matrix of shape (n_cells, n_genes) containing UMI counts.
Cell Metadata (obs):
- transcripts: Total UMI count per cell.
- unique_transcripts: Number of unique genes detected in the cell.
- cell_centroid_x, cell_centroid_y: Spatial center of the segmented cell.
- cell_area: Area of the cell (computed via Convex Hull).
Gene Metadata (var):
- total_assigned: Number of transcripts of this gene assigned to cells.
- total_unassigned: Number of transcripts of this gene that remain unassigned.

B. Segmentation Table (`.csv` or `.parquet`)

A long-form record of every transcript's assignment.

Columns:
- transcript_id: The unique ID of the input transcript.
- seg_label: The ID of the cell this transcript was assigned to.
- score: The model's confidence score for the assignment.
- bound: Boolean flag (1 = assigned to a nucleus/seed; 0 = assigned via graph-based connected components).

C. Intermediate Tiled Dataset (`.pt`)

Generated by create_dataset_fast in your data_dir.

Content: Serialized PyTorch Geometric HeteroData objects.
Use Case: These are used for training and as the immediate input for the predict step. They contain the spatial graph (nodes, edges, features) for 1000x1000 pixel tiles.

Quickstart: Stereo-seq SAW bin1

1. Convert Data & Create Dataset

# 1. Convert H5AD to Parquet
python -m stereosegger.cli.convert_saw_h5ad_to_segger_parquet \
  --h5ad C04895D5_tissue.h5ad \
  --out_dir ./raw_data \
  --bin_pitch 1.0 \
  --min_count 1

# 2. Build Graph Dataset
python -m stereosegger.cli.create_dataset_fast \
  --base_dir ./raw_data \
  --data_dir ./processed_dataset \
  --sample_type saw_bin1 \
  --tx_graph_mode grid_bins \
  --grid_connectivity 8 \
  --within_bin_edges star

2. Train Model

python -m stereosegger.cli.train_model \
  --dataset_dir ./processed_dataset \
  --models_dir ./models \
  --sample_tag my_sample \
  --max_epochs 200 \
  --accelerator cuda \
  --devices 1

3. Run Segmentation (Predict)

python -m stereosegger.cli.predict \
  --segger_data_dir ./processed_dataset \
  --models_dir ./models \
  --benchmarks_dir ./results \
  --transcripts_file ./raw_data/transcripts.parquet \
  --model_version 0

Technical Details

Stereo-seq SAW bin1 Methodology

StereoSegger implements specific logic to handle SAW bin1 data efficiently:

Regular Grid: SAW bin1 data is already a regular grid. We leverage this by using grid adjacency (neighbors are pixels up/down/left/right) which is O(1) compared to O(N log N) for distance-based kNN on pseudo-points.
Consistency: Grid adjacency keeps local structure consistent with the chip layout and avoids sensitivity to sparsity or count magnitude.

Graph Modes & Definitions

Pseudo-transcript (Gene-Bin Node): A node created from a nonzero (bin, gene) entry. Connects all genes in a bin to a central "hub" gene, and connects hubs across adjacent bins. Recommended for SAW.
Aggregated Bin Node: A node representing an entire spatial bin, aggregating all transcripts within it. Features are [log(total_count), log(n_genes)].
Grid Adjacency: Two bins are neighbors if their integer grid coordinates differ by one step. grid_connectivity=8 includes diagonals.

Architecture

StereoSegger employs a Heterogeneous Graph Attention Network (GATv2) to segment transcripts based on their spatial neighborhood and identity.

1. Nodes (The Graph Components)

Transcript Nodes (tx): Represents a specific gene at a spatial location. Gene embeddings are scaled by (1 + log(count)) to represent signal intensity without exploding graph size.
Boundary Nodes (bd): Represents polygon boundaries (e.g., nuclei). Features like Area are log-transformed for numerical stability.

2. Edges (The Connections)

tx $\leftrightarrow$ tx (Transcript-Transcript): Star topology (within bin) + Grid adjacency (across bins).
tx $\rightarrow$ bd (Transcript-Boundary Neighbors): Connects transcripts to nearby candidate cells.
tx $\rightarrow$ bd (Supervision): Connects a transcript to the correct ground-truth boundary during training.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.3

Jan 29, 2026

0.2.2

Jan 29, 2026

0.2.1

Jan 29, 2026

0.2.0

Jan 29, 2026

0.1.3

Jan 29, 2026

This version

0.1.1

Jan 28, 2026

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stereosegger-0.1.1.tar.gz (92.0 kB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stereosegger-0.1.1-py3-none-any.whl (101.7 kB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file stereosegger-0.1.1.tar.gz.

File metadata

Download URL: stereosegger-0.1.1.tar.gz
Upload date: Jan 28, 2026
Size: 92.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`7941d69d4b8f278c4f12e95ef99642f056380746ec084d17fee828450f815733`
MD5	`d63d12b430b5e855f2d3be3a58d51cd6`
BLAKE2b-256	`29a4e478582d06d4c9be6d8162bfe8a4e5e344a42fd9910ceef8e86af35ba83b`

See more details on using hashes here.

File details

Details for the file stereosegger-0.1.1-py3-none-any.whl.

File metadata

Download URL: stereosegger-0.1.1-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 101.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c467d7aa16dcd8e29a542df70a5158dc953e0db363044ea7b2f060bd8de829f4`
MD5	`3c22a94ee6b587b50f66baecd3053e6f`
BLAKE2b-256	`67ae02ecbc06c2256f75335ec1294c07239e22e4562e2c92b205132dc6f39e9b`

See more details on using hashes here.

stereosegger 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

Installation

Option 1: Automated Setup (Recommended)

Option 2: Pip Install (Advanced)

Inputs & Outputs

1. Inputs

A. Raw Input (SAW Output)

B. Processed Input (StereoSegger Native)

2. Outputs

A. Segmentation Results (.h5ad) - Recommended

B. Segmentation Table (.csv or .parquet)

C. Intermediate Tiled Dataset (.pt)

Quickstart: Stereo-seq SAW bin1

1. Convert Data & Create Dataset

2. Train Model

3. Run Segmentation (Predict)

Technical Details

Stereo-seq SAW bin1 Methodology

Graph Modes & Definitions

Architecture

1. Nodes (The Graph Components)

2. Edges (The Connections)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

A. Segmentation Results (`.h5ad`) - Recommended

B. Segmentation Table (`.csv` or `.parquet`)

C. Intermediate Tiled Dataset (`.pt`)