Read OME-Zarr whole-slide images and emit patch- and slide-level foundation-model embeddings - storage backend and model independently swappable.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

craigmyles

These details have not been verified by PyPI

Project description

raw2features

raw2features: OME-Zarr whole-slide image in, patch-level and slide-level foundation-model embeddings out

Read a whole-slide image in OME-Zarr / OME-NGFF and emit patch- and slide-level foundation-model embeddings - with storage backend and embedding models independently swappable.

Cloud-native and FAIR: slides read directly from cloud storage, and each embedding carries the metadata needed to interpret and reuse it.

By analogy to bioformats2raw and raw2ometiff, but for features: point it at a raw OME-Zarr WSI, choose from 30+ feature extractors (UNI/UNI2, Virchow/Virchow2, CONCH, GigaPath, H-optimus, Phikon, CTransPath, …; full list in MODELS.md), and get back a compact, self-describing *.embeddings.zarr with per-patch coordinates such that every embedding is relocatable to the slide.

Status: alpha, under active development. Contributions welcome.

What is OME-Zarr?

Zarr stores large N-dimensional arrays as chunked, compressed pieces you can read individually - enabling you to stream just the region you need, directly from cloud storage. OME-Zarr (OME-NGFF) is the bioimaging convention on top of Zarr: a multi-resolution pyramid plus standard metadata (pixel size, axes, channels). It's a community driven and widely adopted format for bioimaging. The BioImage Archive and IDR are adopting it at scale: the IDR has migrated its infrastructure to OME-Zarr (image viewing and raw-pixel access are now served directly from OME-Zarr in public object storage), and the BioImage Archive publishes whole-slide images in the same FAIR format. raw2features reads OME-Zarr (local or remote) and writes its embeddings back out as a Zarr store too.

Why

OME-Zarr in, embeddings out. raw2features focuses on cloud-optimised, parallel-friendly NGFF reads → embeddings.
Exact MPP. Patches are extracted at the requested microns/pixel (e.g. default 0.5 µm/px @ 224 px) by downsampling from the nearest finer pyramid level such that embeddings are comparable across slides and datasets.
Modular implementation. Reader, segmenter, patcher, embedder and sink are plugin seams exposed via Python entry-points: add a model or backend by shipping a package.
FAIR & provenance-first. Each model's weights are pinned to an immutable HuggingFace revision (or a sha256-pinned URL), with preprocessing sourced from each model's card. Every output records that provenance plus a 1:1 coords↔features mapping, so an embedding is reproducible and traceable to the exact weights that made it.

Install

pip install "raw2features[all]"     # full stack: OME-Zarr reads + segmentation + torch + models
pip install "raw2features[zarr]"    # lean: remote/Zarr reads only, no torch
pip install raw2features            # core only (bring your own reader/model extras)

Extras are composable - e.g. raw2features[zarr,torch,models]. The export bridges (spatialdata, h5) stay opt-in; see MODEL_LICENSES.md and INTEROP.md.

Gated git-package encoders. A few encoders (CONCH, KRONOS, MUSK) ship as gated, non-PyPI git packages, so they install in two steps. The extra pulls the PyPI stack, then one command installs the package itself:

pip install "raw2features[conch]"  && pip install git+https://github.com/Mahmoodlab/CONCH.git
pip install "raw2features[kronos]" && pip install git+https://github.com/mahmoodlab/KRONOS.git
pip install "raw2features[musk]"   && pip install git+https://github.com/lilab-stanford/MUSK

The same pattern covers the other gated encoders - mostly slide encoders (e.g. madeleine, gigapath_slide, seal), a few with extra model-specific steps (a pinned fork, flash-attn, or Drive-hosted weights). Each model's exact install is in its MODELS.md row and the matching extra's comment in pyproject.toml.

Development (from a clone, with uv):

uv sync                      # core
uv sync --extra zarr --extra image --extra torch --extra models   # full stack

Quickstart

With the stack installed (above):

raw2features sample sample.ome.zarr                          # synthetic slide
raw2features embed  sample.ome.zarr out/ -m resnet50 --device auto

--device auto picks CUDA → Apple MPS → CPU, so this runs anywhere. Tested on A100, L40S, GB10, and CPU.

Notebooks

Runnable tutorials live in notebooks/. Start with the visual walkthrough - a real SurGen H&E slide resolved from the BioImage Archive and taken cloud-direct (nothing downloaded) from thumbnail → tissue segmentation → patch tiles → a ResNet-50 feature map of the slide, all on CPU with no model-access-token. Its figures are pre-rendered on GitHub.

Usage

Full guide: docs/usage.md - every command, what actually happens under the hood (exact MPP, decode-once fan-out, output schema), the rerun-safe / skip-if-complete model, thumbnails, and example SLURM cohort runs.

raw2features info slide.ome.zarr
raw2features embed slide.ome.zarr out/ \
    --model uni --model resnet50 \
    --mpp 1.0 --patch-size 224 --hf-token "$HF_TOKEN" \
    --emit-thumbnail                                  # optional QC thumbnail + overlay
raw2features list embedders

# Thumbnails can also be made standalone, before/after the embed run. By default
# they render at the segmentation MPP, so --overlay aligns the tissue mask + the
# kept-patch grid with no resampling (--thumbnail-mpp / --max-px to override).
raw2features thumbnail slide.ome.zarr out/ --overlay

# Optional post-hoc exports from the native out/slide.embeddings.zarr store:
# SpatialData for squidpy/napari, or HDF5 for TRIDENT/CLAM/TITAN/STAMP.
# These never re-compute embeddings; install [spatialdata] or [h5] as needed.
raw2features export-spatialdata out/slide.embeddings.zarr   # -> slide.spatialdata.zarr
raw2features export-h5 out/slide.embeddings.zarr --layout trident   # or --layout clam / stamp

Output

<slide_id>.embeddings.zarr/
├── .zattrs                  # source, provenance + a grids index
└── grids/<mpp>_<px>/        # one per geometry (usually just one, e.g. mpp0.5_px224)
    ├── .zattrs              # this grid's full header (patching, models, provenance)
    ├── coords/              # (N,2) int32 level-0 (x,y) - 1:1 with every features/<model>
    ├── grid_index/          # (N,2) int32 (row,col)
    ├── mask/                # (rows,cols) uint8 fraction of each cell that is tissue, 0-255 (unless --no-seg)
    └── features/<model>/    # (N, dim) float16

<slide_id>.thumbnail.png            # optional (--emit-thumbnail / thumbnail cmd)
<slide_id>.thumbnail.overlay.png    # optional QC overlay: tissue tint + kept-patch grid

<slide_id>.spatialdata.zarr/        # optional - `export-spatialdata`, see docs/INTEROP.md
<slide_id>.h5                       # optional - `export-h5` (TRIDENT/STAMP), see docs/INTEROP.md

Interop (optional export for supported packages): export to scverse SpatialData (squidpy / napari-spatialdata) or to pathology-MIL HDF5 (TRIDENT/CLAM/TITAN, KatherLab STAMP). These are one-way export bridges so you can feed existing toolchains; for full FAIR provenance use the default .embeddings.zarr. See INTEROP.md.

Remote / cloud reads (no download)

Any command that takes a slide path also takes a remote OME-Zarr URL - the reader opens http(s)://, s3://, gs://, etc. via fsspec/zarr, so the whole pipeline (segment → tile → embed) runs directly against a cloud store without downloading the slide. Needs the [zarr] extra (ships fsspec); s3:///gs:// need s3fs/gcsfs.

# Extract straight from the EBI BioImage Archive - nothing lands on local disk.
raw2features embed \
  https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD1285/.../image.ome.zarr/0 \
  out/ -m uni --mpp 0.5 --read-block 16        # fewer, larger reads cut round-trips

Validated end-to-end against the EBI BioImage Archive. Remote reads are latency-bound, but in our read benchmark (raw2features benchmark) a cold embed-once run (the normal case) was only about 1.6x slower than local: the GPU, segmentation, and write work dominate and don't depend on where the slide lives, so the raw-read gap (around 16x on warm re-reads) mostly disappears. On a slow store, --read-block N groups patches into N×N reads to cut round-trips (bit-identical output; try 16 remote, 8 local), and 8 read-workers was the sweet spot either way. For large cohorts, staging slides to local storage is still faster. See docs/usage.md for the remote-read and --read-block guidance.

Licence

MIT - see LICENSE. If you use raw2features, please cite it (see CITATION.cff).

raw2features does not ship model weights and grants no rights to them. When using a pretrained encoder please refer to that model's own licence (several are non-commercial, e.g. CC-BY-NC-ND). See MODEL_LICENSES.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

craigmyles

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raw2features-0.1.0.tar.gz (163.5 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

raw2features-0.1.0-py3-none-any.whl (207.3 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file raw2features-0.1.0.tar.gz.

File metadata

Download URL: raw2features-0.1.0.tar.gz
Upload date: Jun 30, 2026
Size: 163.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for raw2features-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ccbf63b32cfd1d184cef925152a02746f3b8872362609fcc6bf0031bb89e2787`
MD5	`36386465ab94b178ccf4a369885b88c5`
BLAKE2b-256	`4200fa36063b6f801e9fd5d110f3cfca796907555432266689a4b6df0d29142e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for raw2features-0.1.0.tar.gz:

Publisher: release.yml on CraigMyles/raw2features

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: raw2features-0.1.0.tar.gz
- Subject digest: ccbf63b32cfd1d184cef925152a02746f3b8872362609fcc6bf0031bb89e2787
- Sigstore transparency entry: 2027754765
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: CraigMyles/raw2features@6c8e3908f073ab0476afc1d23203353cf807c4b1
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/CraigMyles
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6c8e3908f073ab0476afc1d23203353cf807c4b1
- Trigger Event: release

File details

Details for the file raw2features-0.1.0-py3-none-any.whl.

File metadata

Download URL: raw2features-0.1.0-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 207.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for raw2features-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d199cfce6e724f874ce9cddb700894901711a11af8ccd00698c3a2697876637a`
MD5	`d3a74d47bd17f2d0be22d35f501d11cd`
BLAKE2b-256	`c5f86511cd28d84abf570ee9c8d9115afdb87f30e902be324546e44d455ac430`

See more details on using hashes here.

Provenance

The following attestation bundles were made for raw2features-0.1.0-py3-none-any.whl:

Publisher: release.yml on CraigMyles/raw2features

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: raw2features-0.1.0-py3-none-any.whl
- Subject digest: d199cfce6e724f874ce9cddb700894901711a11af8ccd00698c3a2697876637a
- Sigstore transparency entry: 2027755070
- Sigstore integration time: Jun 30, 2026
Source repository:
- Permalink: CraigMyles/raw2features@6c8e3908f073ab0476afc1d23203353cf807c4b1
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/CraigMyles
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@6c8e3908f073ab0476afc1d23203353cf807c4b1
- Trigger Event: release

raw2features 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

raw2features

What is OME-Zarr?

Why

Install

Quickstart

Notebooks

Usage

Output

Remote / cloud reads (no download)

Licence

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance