Skip to main content

Fast and accurate cell segmentation for single-molecule spatial omics (Stereo-seq)

Project description

StereoSegger: Fast and Accurate Cell Segmentation for Stereo-seq

Note: This project is heavily inspired by the original Segger implementation by Elyas Heidari. You can find the original repository at EliHei2/segger_dev. This version is specifically optimized and refactored for Stereo-seq (SAW bin1) workflows.

Installation

StereoSegger requires CUDA 12 (specifically CUDA 12.4 compatibility) for GPU acceleration.

Quick Install (One-Liner)

pip install stereosegger --extra-index-url https://download.pytorch.org/whl/cu124 --extra-index-url https://pypi.nvidia.com

Option 1: Automated Setup (Recommended for HPC/Conda)

We provide a setup script that handles the complex dependency chain (PyTorch 2.5.1, RAPIDS 24.08, CUDA 12.4) automatically inside a clean Conda environment.

# Clone the repository
git clone https://github.com/nrclaudio/stereosegger.git
cd stereosegger

# Run the setup script (requires Conda)
bash scripts/setup_segger_env.sh

# Activate the environment
conda activate segger_env

Inputs & Outputs

StereoSegger operates on Parquet files. We provide a built-in command to convert raw Stereo-seq H5AD files into this format.

1. Raw Input (SAW Output)

  • Format: h5ad (AnnData)
  • Source: Output from the SAW pipeline (Stereo-seq Analysis Workflow).
  • Conversion: Use stereosegger convert_saw to prepare this for the pipeline.

2. Processed Input (Parquet)

The core pipeline expects a directory containing:

  • transcripts.parquet: Long-form table of gene-location occurrences.
  • genes.parquet: Mapping of gene_id to gene_name.
  • boundaries.parquet: Polygons (required for training, optional for prediction).

1. Prepare Data

Path A: For Training (Kidneys)

Training requires ground-truth labels. You must provide a label TIFF (e.g., ssdna_mask).

# 1. Convert with labels
stereosegger convert_saw \
  --h5ad kidney_sample.h5ad \
  --labels_tif ssdna_mask.tif \
  --out_dir ./raw_data_labeled

# 2. Build Dataset
stereosegger create_dataset \
  --base_dir ./raw_data_labeled \
  --data_dir ./dataset_labeled

Path B: For Prediction (Whole Chip)

Prediction on new data uses a pre-trained model and does not require a mask.

# 1. Convert without labels
stereosegger convert_saw \
  --h5ad whole_chip.h5ad \
  --out_dir ./raw_data_unlabeled

# 2. Build Dataset
stereosegger create_dataset \
  --base_dir ./raw_data_unlabeled \
  --data_dir ./dataset_unlabeled

2. Train Model (Requires Labeled Data)

Training requires that you provided a --labels_tif during the convert_saw step.

stereosegger train_model \
  --dataset_dir ./processed_dataset \
  --models_dir ./models \
  --sample_tag my_sample \
  --max_epochs 300 \
  --devices 1

3. Run Segmentation (Predict)

stereosegger predict_fast \
  --segger_data_dir ./processed_dataset \
  --models_dir ./models \
  --benchmarks_dir ./results \
  --transcripts_file ./raw_data/transcripts.parquet \
  --model_version 0

Command Reference

1. convert_saw

Converts Stereo-seq SAW pipeline output (H5AD) into Parquet format.

Options:

  • --h5ad PATH: Path to SAW bin1 h5ad file.
  • --out_dir PATH: Output directory.
  • --labels_tif PATH: (Optional) Label TIFF for boundary polygons (Required if you intend to train).
  • --bin_pitch FLOAT: Bin pitch for rounding. Default: 1.0.

2. create_dataset

Creates the graph-based dataset used for training and inference.

Options:

  • --base_dir PATH: Directory containing raw parquet files.
  • --data_dir PATH: Directory to save the processed dataset.
  • --tx_graph_mode [kdtree|grid_bins]: Transcript edge strategy. Default: "grid_bins".
  • --grid_connectivity INT: Grid connectivity (4 or 8). Default: 8.
  • --within_bin_edges [none|star]: Within-bin edge strategy. Default: "star".

3. train_model

Trains the Segger model. Will stop if the dataset is unlabeled.

Options:

  • --dataset_dir PATH: Processed dataset directory.
  • --models_dir PATH: Directory to save the model.
  • --sample_tag TEXT: Unique tag for the sample.
  • --max_epochs INT: Number of training epochs. Default: 300.

4. predict_fast

Runs fast segmentation inference for large grid-based datasets.

Options:

  • --segger_data_dir PATH: Processed dataset directory.
  • --models_dir PATH: Trained models directory.
  • --benchmarks_dir PATH: Output results directory.
  • --transcripts_file PATH: Original transcripts parquet file.
  • --model_version INT: Version of the model to load. Default: 0.

Technical Details

Architecture

StereoSegger employs a Heterogeneous Graph Attention Network (GATv2) to segment transcripts based on their spatial neighborhood and identity.

  • Transcript Nodes (tx): Represents a specific gene at a spatial location.
  • Boundary Nodes (bd): Represents polygon boundaries (e.g., nuclei).
  • Supervision: During training, the model learns to predict "belongs" edges between transcripts and ground-truth boundaries.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stereosegger-0.2.3.tar.gz (79.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stereosegger-0.2.3-py3-none-any.whl (88.9 kB view details)

Uploaded Python 3

File details

Details for the file stereosegger-0.2.3.tar.gz.

File metadata

  • Download URL: stereosegger-0.2.3.tar.gz
  • Upload date:
  • Size: 79.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.2.3.tar.gz
Algorithm Hash digest
SHA256 6439430e5a7796c422d1009b63194efa98fde951a63395f968bc7fb3e3207e6c
MD5 01f6652aea3cd89ad02e7ab0bea2d028
BLAKE2b-256 6957805282078b6ad743cc5fd91bafc20ee9571c56b3c460f0b5edcb61480bf3

See more details on using hashes here.

File details

Details for the file stereosegger-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: stereosegger-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 88.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stereosegger-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c4ff90a170fdccb212a2f341bb01672cbcb404ad6573315eb0ff5b16d300a77c
MD5 92fb182c5f87195def85abe1518e0d8b
BLAKE2b-256 799b043f9c43358878060f3ea8f60253ee01b169f505dc23a4c8a1941d3e9182

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page