Read OME-Zarr whole-slide images and emit patch- and slide-level foundation-model embeddings - storage backend and model independently swappable.
Project description
raw2features
Read a whole-slide image in OME-Zarr / OME-NGFF and emit patch- and slide-level foundation-model embeddings - with storage backend and embedding models independently swappable.
Cloud-native and FAIR: slides read directly from cloud storage, and each embedding carries the metadata needed to interpret and reuse it.
By analogy to bioformats2raw and raw2ometiff, but for features: point it at a raw OME-Zarr WSI,
choose from 30+ feature extractors (UNI/UNI2, Virchow/Virchow2, CONCH,
GigaPath, H-optimus, Phikon, CTransPath, …; full list in MODELS.md),
and get back a compact, self-describing *.embeddings.zarr with per-patch
coordinates such that every embedding is relocatable to the slide.
Status: alpha, under active development. Contributions welcome.
What is OME-Zarr?
Zarr stores large N-dimensional arrays as chunked, compressed pieces you can read individually - enabling you to stream just the region you need, directly from cloud storage. OME-Zarr (OME-NGFF) is the bioimaging convention on top of Zarr: a multi-resolution pyramid plus standard metadata (pixel size, axes, channels). It's a community driven and widely adopted format for bioimaging. The BioImage Archive and IDR are adopting it at scale: the IDR has migrated its infrastructure to OME-Zarr (image viewing and raw-pixel access are now served directly from OME-Zarr in public object storage), and the BioImage Archive publishes whole-slide images in the same FAIR format. raw2features reads OME-Zarr (local or remote) and writes its embeddings back out as a Zarr store too.
Why
- OME-Zarr in, embeddings out. raw2features focuses on cloud-optimised, parallel-friendly NGFF reads → embeddings.
- Exact MPP. Patches are extracted at the requested microns/pixel (e.g. default 0.5 µm/px @ 224 px) by downsampling from the nearest finer pyramid level such that embeddings are comparable across slides and datasets.
- Modular implementation.
Reader,segmenter,patcher,embedderandsinkare plugin seams exposed via Python entry-points: add a model or backend by shipping a package. - FAIR & provenance-first. Each model's weights are pinned to an immutable HuggingFace revision (or a sha256-pinned URL), with preprocessing sourced from each model's card. Every output records that provenance plus a 1:1 coords↔features mapping, so an embedding is reproducible and traceable to the exact weights that made it.
Install
pip install "raw2features[all]" # full stack: OME-Zarr reads + segmentation + torch + models
pip install "raw2features[zarr]" # lean: remote/Zarr reads only, no torch
pip install raw2features # core only (bring your own reader/model extras)
Extras are composable - e.g. raw2features[zarr,torch,models]. The export bridges
(spatialdata, h5) stay opt-in; see MODEL_LICENSES.md and
INTEROP.md.
Gated git-package encoders. A few encoders (CONCH, KRONOS, MUSK) ship as gated, non-PyPI git packages, so they install in two steps. The extra pulls the PyPI stack, then one command installs the package itself:
pip install "raw2features[conch]" && pip install git+https://github.com/Mahmoodlab/CONCH.git
pip install "raw2features[kronos]" && pip install git+https://github.com/mahmoodlab/KRONOS.git
pip install "raw2features[musk]" && pip install git+https://github.com/lilab-stanford/MUSK
The same pattern covers the other gated encoders - mostly slide encoders (e.g. madeleine,
gigapath_slide, seal), a few with extra model-specific steps (a pinned fork, flash-attn,
or Drive-hosted weights). Each model's exact install is in its MODELS.md row
and the matching extra's comment in pyproject.toml.
Development (from a clone, with uv):
uv sync # core
uv sync --extra zarr --extra image --extra torch --extra models # full stack
Quickstart
With the stack installed (above):
raw2features sample sample.ome.zarr # synthetic slide
raw2features embed sample.ome.zarr out/ -m resnet50 --device auto
--device auto picks CUDA → Apple MPS → CPU, so this runs anywhere. Tested on A100, L40S,
GB10, and CPU.
Notebooks
Runnable tutorials live in notebooks/. Start with the
visual walkthrough - a real SurGen H&E slide
resolved from the BioImage Archive and taken cloud-direct (nothing downloaded) from
thumbnail → tissue segmentation → patch tiles → a ResNet-50 feature map of the slide, all
on CPU with no model-access-token. Its figures are pre-rendered on GitHub.
Usage
Full guide: docs/usage.md - every command, what actually
happens under the hood (exact MPP, decode-once fan-out, output schema), the
rerun-safe / skip-if-complete model, thumbnails, and example SLURM cohort runs.
raw2features info slide.ome.zarr
raw2features embed slide.ome.zarr out/ \
--model uni --model resnet50 \
--mpp 1.0 --patch-size 224 --hf-token "$HF_TOKEN" \
--emit-thumbnail # optional QC thumbnail + overlay
raw2features list embedders
# Thumbnails can also be made standalone, before/after the embed run. By default
# they render at the segmentation MPP, so --overlay aligns the tissue mask + the
# kept-patch grid with no resampling (--thumbnail-mpp / --max-px to override).
raw2features thumbnail slide.ome.zarr out/ --overlay
# Optional post-hoc exports from the native out/slide.embeddings.zarr store:
# SpatialData for squidpy/napari, or HDF5 for TRIDENT/CLAM/TITAN/STAMP.
# These never re-compute embeddings; install [spatialdata] or [h5] as needed.
raw2features export-spatialdata out/slide.embeddings.zarr # -> slide.spatialdata.zarr
raw2features export-h5 out/slide.embeddings.zarr --layout trident # or --layout clam / stamp
Output
<slide_id>.embeddings.zarr/
├── .zattrs # source, provenance + a grids index
└── grids/<mpp>_<px>/ # one per geometry (usually just one, e.g. mpp0.5_px224)
├── .zattrs # this grid's full header (patching, models, provenance)
├── coords/ # (N,2) int32 level-0 (x,y) - 1:1 with every features/<model>
├── grid_index/ # (N,2) int32 (row,col)
├── mask/ # (rows,cols) uint8 fraction of each cell that is tissue, 0-255 (unless --no-seg)
└── features/<model>/ # (N, dim) float16
<slide_id>.thumbnail.png # optional (--emit-thumbnail / thumbnail cmd)
<slide_id>.thumbnail.overlay.png # optional QC overlay: tissue tint + kept-patch grid
<slide_id>.spatialdata.zarr/ # optional - `export-spatialdata`, see docs/INTEROP.md
<slide_id>.h5 # optional - `export-h5` (TRIDENT/STAMP), see docs/INTEROP.md
Interop (optional export for supported packages):
export to scverse SpatialData (squidpy / napari-spatialdata) or to pathology-MIL
HDF5 (TRIDENT/CLAM/TITAN, KatherLab STAMP). These are one-way export bridges so you
can feed existing toolchains; for full FAIR provenance use the default
.embeddings.zarr. See INTEROP.md.
Remote / cloud reads (no download)
Any command that takes a slide path also takes a remote OME-Zarr URL - the reader
opens http(s)://, s3://, gs://, etc. via fsspec/zarr, so the whole pipeline
(segment → tile → embed) runs directly against a cloud store without downloading the
slide. Needs the [zarr] extra (ships fsspec); s3:///gs:// need s3fs/gcsfs.
# Extract straight from the EBI BioImage Archive - nothing lands on local disk.
raw2features embed \
https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD1285/.../image.ome.zarr/0 \
out/ -m uni --mpp 0.5 --read-block 16 # fewer, larger reads cut round-trips
Validated end-to-end against the EBI BioImage Archive. Remote reads are latency-bound, but
in our read benchmark (raw2features benchmark) a cold embed-once run (the normal case)
was only about 1.6x slower than local: the GPU, segmentation, and write work dominate and
don't depend on where the slide lives, so the raw-read gap (around 16x on warm re-reads)
mostly disappears. On a slow store, --read-block N groups patches
into N×N reads to cut round-trips (bit-identical output; try 16 remote, 8 local), and 8
read-workers was the sweet spot either way. For large cohorts, staging slides to local
storage is still faster. See docs/usage.md for the remote-read and
--read-block guidance.
Licence
MIT - see LICENSE. If you use raw2features, please cite it (see CITATION.cff).
raw2features does not ship model weights and grants no rights to them. When using a pretrained encoder please refer to that model's own licence (several are non-commercial, e.g. CC-BY-NC-ND). See MODEL_LICENSES.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file raw2features-0.1.0.tar.gz.
File metadata
- Download URL: raw2features-0.1.0.tar.gz
- Upload date:
- Size: 163.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccbf63b32cfd1d184cef925152a02746f3b8872362609fcc6bf0031bb89e2787
|
|
| MD5 |
36386465ab94b178ccf4a369885b88c5
|
|
| BLAKE2b-256 |
4200fa36063b6f801e9fd5d110f3cfca796907555432266689a4b6df0d29142e
|
Provenance
The following attestation bundles were made for raw2features-0.1.0.tar.gz:
Publisher:
release.yml on CraigMyles/raw2features
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
raw2features-0.1.0.tar.gz -
Subject digest:
ccbf63b32cfd1d184cef925152a02746f3b8872362609fcc6bf0031bb89e2787 - Sigstore transparency entry: 2027754765
- Sigstore integration time:
-
Permalink:
CraigMyles/raw2features@6c8e3908f073ab0476afc1d23203353cf807c4b1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/CraigMyles
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6c8e3908f073ab0476afc1d23203353cf807c4b1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file raw2features-0.1.0-py3-none-any.whl.
File metadata
- Download URL: raw2features-0.1.0-py3-none-any.whl
- Upload date:
- Size: 207.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d199cfce6e724f874ce9cddb700894901711a11af8ccd00698c3a2697876637a
|
|
| MD5 |
d3a74d47bd17f2d0be22d35f501d11cd
|
|
| BLAKE2b-256 |
c5f86511cd28d84abf570ee9c8d9115afdb87f30e902be324546e44d455ac430
|
Provenance
The following attestation bundles were made for raw2features-0.1.0-py3-none-any.whl:
Publisher:
release.yml on CraigMyles/raw2features
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
raw2features-0.1.0-py3-none-any.whl -
Subject digest:
d199cfce6e724f874ce9cddb700894901711a11af8ccd00698c3a2697876637a - Sigstore transparency entry: 2027755070
- Sigstore integration time:
-
Permalink:
CraigMyles/raw2features@6c8e3908f073ab0476afc1d23203353cf807c4b1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/CraigMyles
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6c8e3908f073ab0476afc1d23203353cf807c4b1 -
Trigger Event:
release
-
Statement type: