Skip to main content

Read OME-Zarr whole-slide images and emit patch- and slide-level foundation-model embeddings - storage backend and model independently swappable.

Project description

raw2features

raw2features: OME-Zarr whole-slide image in, patch-level and slide-level foundation-model embeddings out

Read a whole-slide image in OME-Zarr / OME-NGFF and emit patch- and slide-level foundation-model embeddings - with storage backend and embedding models independently swappable.

Cloud-native and FAIR: slides read directly from cloud storage, and each embedding carries the metadata needed to interpret and reuse it.

By analogy to bioformats2raw and raw2ometiff, but for features: point it at a raw OME-Zarr WSI, choose from 30+ feature extractors (UNI/UNI2, Virchow/Virchow2, CONCH, GigaPath, H-optimus, Phikon, CTransPath, …; full list in MODELS.md), and get back a compact, self-describing *.embeddings.zarr with per-patch coordinates such that every embedding is relocatable to the slide.

Status: alpha, under active development. Contributions welcome.

What is OME-Zarr?

Zarr stores large N-dimensional arrays as chunked, compressed pieces you can read individually - enabling you to stream just the region you need, directly from cloud storage. OME-Zarr (OME-NGFF) is the bioimaging convention on top of Zarr: a multi-resolution pyramid plus standard metadata (pixel size, axes, channels). It's a community driven and widely adopted format for bioimaging. The BioImage Archive and IDR are adopting it at scale: the IDR has migrated its infrastructure to OME-Zarr (image viewing and raw-pixel access are now served directly from OME-Zarr in public object storage), and the BioImage Archive publishes whole-slide images in the same FAIR format. raw2features reads OME-Zarr (local or remote) and writes its embeddings back out as a Zarr store too.

Why

  • OME-Zarr in, embeddings out. raw2features focuses on cloud-optimised, parallel-friendly NGFF reads → embeddings.
  • Exact MPP. Patches are extracted at the requested microns/pixel (e.g. default 0.5 µm/px @ 224 px) by downsampling from the nearest finer pyramid level such that embeddings are comparable across slides and datasets.
  • Modular implementation. Reader, segmenter, patcher, embedder and sink are plugin seams exposed via Python entry-points: add a model or backend by shipping a package.
  • FAIR & provenance-first. Each model's weights are pinned to an immutable HuggingFace revision (or a sha256-pinned URL), with preprocessing sourced from each model's card. Every output records that provenance plus a 1:1 coords↔features mapping, so an embedding is reproducible and traceable to the exact weights that made it.

Install

pip install "raw2features[all]"     # full stack: OME-Zarr reads + segmentation + torch + models
pip install "raw2features[zarr]"    # lean: remote/Zarr reads only, no torch
pip install raw2features            # core only (bring your own reader/model extras)

Extras are composable - e.g. raw2features[zarr,torch,models]. The export bridges (spatialdata, h5) stay opt-in; see MODEL_LICENSES.md and INTEROP.md.

Gated git-package encoders. A few encoders (CONCH, KRONOS, MUSK) ship as gated, non-PyPI git packages, so they install in two steps. The extra pulls the PyPI stack, then one command installs the package itself:

pip install "raw2features[conch]"  && pip install git+https://github.com/Mahmoodlab/CONCH.git
pip install "raw2features[kronos]" && pip install git+https://github.com/mahmoodlab/KRONOS.git
pip install "raw2features[musk]"   && pip install git+https://github.com/lilab-stanford/MUSK

The same pattern covers the other gated encoders - mostly slide encoders (e.g. madeleine, gigapath_slide, seal), a few with extra model-specific steps (a pinned fork, flash-attn, or Drive-hosted weights). Each model's exact install is in its MODELS.md row and the matching extra's comment in pyproject.toml.

Development (from a clone, with uv):

uv sync                      # core
uv sync --extra zarr --extra image --extra torch --extra models   # full stack

Quickstart

With the stack installed (above):

raw2features sample sample.ome.zarr                          # synthetic slide
raw2features embed  sample.ome.zarr out/ -m resnet50 --device auto

--device auto picks CUDA → Apple MPS → CPU, so this runs anywhere. Tested on A100, L40S, GB10, and CPU.

Notebooks

Runnable tutorials live in notebooks/. Start with the visual walkthrough - a real SurGen H&E slide resolved from the BioImage Archive and taken cloud-direct (nothing downloaded) from thumbnail → tissue segmentation → patch tiles → a ResNet-50 feature map of the slide, all on CPU with no model-access-token. Its figures are pre-rendered on GitHub.

Usage

Full guide: docs/usage.md - every command, what actually happens under the hood (exact MPP, decode-once fan-out, output schema), the rerun-safe / skip-if-complete model, thumbnails, and example SLURM cohort runs.

raw2features info slide.ome.zarr
raw2features embed slide.ome.zarr out/ \
    --model uni --model resnet50 \
    --mpp 1.0 --patch-size 224 --hf-token "$HF_TOKEN" \
    --emit-thumbnail                                  # optional QC thumbnail + overlay
raw2features list embedders

# Thumbnails can also be made standalone, before/after the embed run. By default
# they render at the segmentation MPP, so --overlay aligns the tissue mask + the
# kept-patch grid with no resampling (--thumbnail-mpp / --max-px to override).
raw2features thumbnail slide.ome.zarr out/ --overlay

# Optional post-hoc exports from the native out/slide.embeddings.zarr store:
# SpatialData for squidpy/napari, or HDF5 for TRIDENT/CLAM/TITAN/STAMP.
# These never re-compute embeddings; install [spatialdata] or [h5] as needed.
raw2features export-spatialdata out/slide.embeddings.zarr   # -> slide.spatialdata.zarr
raw2features export-h5 out/slide.embeddings.zarr --layout trident   # or --layout clam / stamp

Output

<slide_id>.embeddings.zarr/
├── .zattrs                  # source, provenance + a grids index
└── grids/<mpp>_<px>/        # one per geometry (usually just one, e.g. mpp0.5_px224)
    ├── .zattrs              # this grid's full header (patching, models, provenance)
    ├── coords/              # (N,2) int32 level-0 (x,y) - 1:1 with every features/<model>
    ├── grid_index/          # (N,2) int32 (row,col)
    ├── mask/                # (rows,cols) uint8 fraction of each cell that is tissue, 0-255 (unless --no-seg)
    └── features/<model>/    # (N, dim) float16

<slide_id>.thumbnail.png            # optional (--emit-thumbnail / thumbnail cmd)
<slide_id>.thumbnail.overlay.png    # optional QC overlay: tissue tint + kept-patch grid

<slide_id>.spatialdata.zarr/        # optional - `export-spatialdata`, see docs/INTEROP.md
<slide_id>.h5                       # optional - `export-h5` (TRIDENT/STAMP), see docs/INTEROP.md

Interop (optional export for supported packages): export to scverse SpatialData (squidpy / napari-spatialdata) or to pathology-MIL HDF5 (TRIDENT/CLAM/TITAN, KatherLab STAMP). These are one-way export bridges so you can feed existing toolchains; for full FAIR provenance use the default .embeddings.zarr. See INTEROP.md.

Remote / cloud reads (no download)

Any command that takes a slide path also takes a remote OME-Zarr URL - the reader opens http(s)://, s3://, gs://, etc. via fsspec/zarr, so the whole pipeline (segment → tile → embed) runs directly against a cloud store without downloading the slide. Needs the [zarr] extra (ships fsspec); s3:///gs:// need s3fs/gcsfs.

# Extract straight from the EBI BioImage Archive - nothing lands on local disk.
raw2features embed \
  https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD1285/.../image.ome.zarr/0 \
  out/ -m uni --mpp 0.5 --read-block 16        # fewer, larger reads cut round-trips

Validated end-to-end against the EBI BioImage Archive. Remote reads are latency-bound, but in our read benchmark (raw2features benchmark) a cold embed-once run (the normal case) was only about 1.6x slower than local: the GPU, segmentation, and write work dominate and don't depend on where the slide lives, so the raw-read gap (around 16x on warm re-reads) mostly disappears. On a slow store, --read-block N groups patches into N×N reads to cut round-trips (bit-identical output; try 16 remote, 8 local), and 8 read-workers was the sweet spot either way. For large cohorts, staging slides to local storage is still faster. See docs/usage.md for the remote-read and --read-block guidance.

Licence

MIT - see LICENSE. If you use raw2features, please cite it (see CITATION.cff).

raw2features does not ship model weights and grants no rights to them. When using a pretrained encoder please refer to that model's own licence (several are non-commercial, e.g. CC-BY-NC-ND). See MODEL_LICENSES.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raw2features-0.1.0.tar.gz (163.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

raw2features-0.1.0-py3-none-any.whl (207.3 kB view details)

Uploaded Python 3

File details

Details for the file raw2features-0.1.0.tar.gz.

File metadata

  • Download URL: raw2features-0.1.0.tar.gz
  • Upload date:
  • Size: 163.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for raw2features-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ccbf63b32cfd1d184cef925152a02746f3b8872362609fcc6bf0031bb89e2787
MD5 36386465ab94b178ccf4a369885b88c5
BLAKE2b-256 4200fa36063b6f801e9fd5d110f3cfca796907555432266689a4b6df0d29142e

See more details on using hashes here.

Provenance

The following attestation bundles were made for raw2features-0.1.0.tar.gz:

Publisher: release.yml on CraigMyles/raw2features

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file raw2features-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: raw2features-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 207.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for raw2features-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d199cfce6e724f874ce9cddb700894901711a11af8ccd00698c3a2697876637a
MD5 d3a74d47bd17f2d0be22d35f501d11cd
BLAKE2b-256 c5f86511cd28d84abf570ee9c8d9115afdb87f30e902be324546e44d455ac430

See more details on using hashes here.

Provenance

The following attestation bundles were made for raw2features-0.1.0-py3-none-any.whl:

Publisher: release.yml on CraigMyles/raw2features

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page