Fast and accurate cell segmentation for single-molecule spatial omics (Stereo-seq)

These details have not been verified by PyPI

Project links

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

Note: This project is heavily inspired by the original Segger implementation by Elyas Heidari. You can find the original repository at EliHei2/segger_dev.

Installation

StereoSegger requires CUDA 12 (specifically CUDA 12.4 compatibility) for GPU acceleration.

Quick Install (One-Liner)

To install StereoSegger with full GPU acceleration (PyTorch + RAPIDS), use the following command:

pip install stereosegger --extra-index-url https://download.pytorch.org/whl/cu124 --extra-index-url https://pypi.nvidia.com

Option 1: Automated Setup (Recommended for HPC/Conda)

We provide a setup script that handles the complex dependency chain (PyTorch 2.5.1, RAPIDS 24.08, CUDA 12.4) automatically inside a clean Conda environment.

# Clone the repository
git clone https://github.com/nrclaudio/stereosegger.git
cd stereosegger

# Run the setup script (requires Conda)
bash scripts/setup_segger_env.sh

# Activate the environment
conda activate segger_env

Why the extra index URLs?

https://download.pytorch.org/whl/cu124: Ensures you get the version of PyTorch compiled for CUDA 12.4. Without this, pip may download a CPU-only version or an incompatible CUDA build.
https://pypi.nvidia.com: Provides the RAPIDS (cuDF, cuML, etc.) wheels. While some RAPIDS components are moving to standard PyPI, this index ensures you get the most stable, CUDA-linked binaries.

Inputs & Outputs

1. Inputs

StereoSegger primarily operates on Parquet files derived from standard spatial formats.

A. Raw Input (SAW Output)

Format: h5ad (AnnData)
Source: Output from the SAW pipeline (Stereo-seq Analysis Workflow).
Requirements:
- .X: Sparse matrix of gene counts.
- .obsm['spatial']: (x, y) coordinates of the bins.
- .var: Index must contain unique gene names.

B. Processed Input (StereoSegger Native)

If you are skipping the conversion step, provide a directory containing:

transcripts.parquet: Long-form table of gene-location occurrences (transcript_id, gene_id, x, y, count, bx, by).
genes.parquet: Mapping of gene_id to gene_name.
boundaries.parquet (Optional): WKB-encoded polygons (e.g., nuclei masks).

Quickstart: Stereo-seq SAW bin1

1. Convert Data & Create Dataset

# 1. Convert H5AD to Parquet
python -m stereosegger.cli.convert_saw_h5ad_to_segger_parquet \
  --h5ad C04895D5_tissue.h5ad \
  --out_dir ./raw_data \
  --bin_pitch 1.0 \
  --min_count 1

# 2. Build Graph Dataset
python -m stereosegger.cli.create_dataset_fast \
  --base_dir ./raw_data \
  --data_dir ./processed_dataset \
  --sample_type saw_bin1 \
  --tx_graph_mode grid_bins \
  --grid_connectivity 8 \
  --within_bin_edges star

2. Train Model

python -m stereosegger.cli.train_model \
  --dataset_dir ./processed_dataset \
  --models_dir ./models \
  --sample_tag my_sample \
  --max_epochs 200 \
  --accelerator cuda \
  --devices 1

3. Run Segmentation (Predict)

For large datasets (like SAW bin1) using the grid optimizations, use the fast prediction script:

python -m stereosegger.cli.predict_fast \
  --segger_data_dir ./processed_dataset \
  --models_dir ./models \
  --benchmarks_dir ./results \
  --transcripts_file ./raw_data/transcripts.parquet \
  --model_version 0 \
  --tx_graph_mode grid_bins \
  --grid_connectivity 8

Command Reference

1. `convert_saw_h5ad_to_segger_parquet`

Converts Stereo-seq SAW pipeline output (H5AD) into the Parquet format required by StereoSegger.

Options:

--h5ad PATH: Path to SAW bin1 h5ad file. (Required)
--out_dir PATH: Output directory for Segger parquet files. (Required)
--bin_pitch FLOAT: Bin pitch for rounding to grid coordinates. Default: 1.0.
--min_count INT: Minimum count to keep a bin-gene entry. Default: 1.
--labels_tif PATH: Optional label TIFF for boundary polygons.
--tissue_mask_tif PATH: Optional tissue mask TIFF.
--bbox FLOAT FLOAT FLOAT FLOAT: Bounding box xmin xmax ymin ymax.
--gene_name_source TEXT: Column in adata.var for gene names. Default: "real_gene_name".
--top_genes INT: Keep only top K genes by total counts.

2. `create_dataset` (Fast)

Creates the graph-based dataset used for training and inference.

Options:

--base_dir PATH: Directory containing raw parquet files. (Required)
--data_dir PATH: Directory to save the processed dataset. (Required)
--sample_type TEXT: e.g., "xenium", "merscope", "saw_bin1".
--tx_graph_mode [kdtree|grid_bins]: Strategy for transcript edges. Default: "grid_bins".
--grid_connectivity INT: Grid connectivity (4 or 8). Default: 8.
--within_bin_edges [none|star]: Within-bin edge strategy. Default: "star".
--tile_size INT: Size of the spatial tiles.
--n_workers INT: Number of parallel workers. Default: 0 (serial).

3. `train_model`

Trains the Segger segmentation model.

Options:

--dataset_dir PATH: Directory containing the processed dataset. (Required)
--models_dir PATH: Directory to save the model and logs. (Required)
--sample_tag TEXT: Unique tag for the sample. (Required)
--batch_size INT: Training batch size. Default: 1.
--max_epochs INT: Number of training epochs. Default: 300.
--accelerator TEXT: "cuda" or "cpu". Default: "cuda".
--devices INT: Number of GPUs to use. Default: 4.
--learning_rate FLOAT: Learning rate. Default: 1e-4.

4. `predict` / `predict_fast`

Runs the segmentation inference. predict_fast is optimized for large grid-based datasets.

Options:

--segger_data_dir PATH: Processed dataset directory. (Required)
--models_dir PATH: Trained models directory. (Required)
--benchmarks_dir PATH: Output results directory. (Required)
--transcripts_file PATH: Original transcripts parquet file. (Required)
--model_version INT: Version of the model to load. Default: 0.
--tx_graph_mode [kdtree|grid_bins]: Strategy for transcript edges. Default: "grid_bins".
--grid_connectivity INT: Grid connectivity (4 or 8). Default: 8.
--within_bin_edges [none|star]: Within-bin edge strategy. Default: "star".
--use_cc BOOL: Use connected components for unassigned transcripts. Default: False.
--file_format TEXT: Output format ("anndata", "parquet", "csv"). Default: "anndata".
--k_bd / --dist_bd: Boundary neighborhood parameters.
--k_tx / --dist_tx: Transcript neighborhood parameters.

Technical Details

Stereo-seq SAW bin1 Methodology

StereoSegger implements specific logic to handle SAW bin1 data efficiently:

Regular Grid: SAW bin1 data is already a regular grid. We leverage this by using grid adjacency (neighbors are pixels up/down/left/right) which is O(1) compared to O(N log N) for distance-based kNN on pseudo-points.
Consistency: Grid adjacency keeps local structure consistent with the chip layout and avoids sensitivity to sparsity or count magnitude.

Graph Modes & Definitions

Pseudo-transcript (Gene-Bin Node): A node created from a nonzero (bin, gene) entry. Connects all genes in a bin to a central "hub" gene, and connects hubs across adjacent bins. Recommended for SAW.
Aggregated Bin Node: A node representing an entire spatial bin, aggregating all transcripts within it. Features are [log(total_count), log(n_genes)].
Grid Adjacency: Two bins are neighbors if their integer grid coordinates differ by one step. grid_connectivity=8 includes diagonals.

Architecture

StereoSegger employs a Heterogeneous Graph Attention Network (GATv2) to segment transcripts based on their spatial neighborhood and identity.

1. Nodes (The Graph Components)

Transcript Nodes (tx): Represents a specific gene at a spatial location. Gene embeddings are scaled by (1 + log(count)) to represent signal intensity without exploding graph size.
Boundary Nodes (bd): Represents polygon boundaries (e.g., nuclei). Features like Area are log-transformed for numerical stability.

2. Edges (The Connections)

tx $\leftrightarrow$ tx (Transcript-Transcript): Star topology (within bin) + Grid adjacency (across bins).
tx $\rightarrow$ bd (Transcript-Boundary Neighbors): Connects transcripts to nearby candidate cells.
tx $\rightarrow$ bd (Supervision): Connects a transcript to the correct ground-truth boundary during training.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.3

Jan 29, 2026

0.2.2

Jan 29, 2026

0.2.1

Jan 29, 2026

This version

0.2.0

Jan 29, 2026

0.1.3

Jan 29, 2026

0.1.1

Jan 28, 2026

0.1.0

Jan 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stereosegger-0.2.0.tar.gz (81.6 kB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stereosegger-0.2.0-py3-none-any.whl (89.3 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file stereosegger-0.2.0.tar.gz.

File metadata

Download URL: stereosegger-0.2.0.tar.gz
Upload date: Jan 29, 2026
Size: 81.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`674a8a74ee1f79f917c9f252ab530e7d6f0326f1b0f4112c4478764b57dc4423`
MD5	`f1d156dfb17111ae886487a1f18ece66`
BLAKE2b-256	`d628547f4d9e87f145e7ab3ccb380d01491ed880dea189b80421db5160e65ea1`

See more details on using hashes here.

File details

Details for the file stereosegger-0.2.0-py3-none-any.whl.

File metadata

Download URL: stereosegger-0.2.0-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 89.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`42248a503b5f64de681b55027a4085fc005f7dcfc6d6f37d59b9164758e83be2`
MD5	`a9080c40502a5ed06e960fd5e1ac648c`
BLAKE2b-256	`eabb45b8d5f8906d17f14a611f5510edd4d05641ac7bfc16778b703f68dc9ef1`

See more details on using hashes here.

stereosegger 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

Installation

Quick Install (One-Liner)

Option 1: Automated Setup (Recommended for HPC/Conda)

Why the extra index URLs?

Inputs & Outputs

1. Inputs

A. Raw Input (SAW Output)

B. Processed Input (StereoSegger Native)

Quickstart: Stereo-seq SAW bin1

1. Convert Data & Create Dataset

2. Train Model

3. Run Segmentation (Predict)

Command Reference

1. convert_saw_h5ad_to_segger_parquet

2. create_dataset (Fast)

3. train_model

4. predict / predict_fast

Technical Details

Stereo-seq SAW bin1 Methodology

Graph Modes & Definitions

Architecture

1. Nodes (The Graph Components)

2. Edges (The Connections)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `convert_saw_h5ad_to_segger_parquet`

2. `create_dataset` (Fast)

3. `train_model`

4. `predict` / `predict_fast`