Skip to main content

Modular online patch streaming from whole-slide images for computational pathology

Project description

wsistream

Modular online patch streaming from whole-slide images for computational pathology.

PyPI Python License Docs

Stream patches directly from WSIs during training — no disk pre-extraction, no storage overhead. Every component is pluggable: backends, tissue detectors, samplers, filters, transforms, dataset adapters.

Install

pip install "wsistream[openslide]"   # with OpenSlide
pip install "wsistream[tiffslide]"   # with TiffSlide (pure Python)
pip install "wsistream[torch]"       # add PyTorch integration (WsiStreamDataset, DDP)
pip install "wsistream[all]"         # everything (OpenSlide + TiffSlide + PyTorch + albumentations + matplotlib)

For development:

git clone https://github.com/RamonKaspar/wsistream.git
cd wsistream
pip install -e ".[dev]"

Documentation

Full documentation: ramonkaspar.github.io/wsistream

To build locally:

pip install mkdocs-material
mkdocs serve          # local preview at http://127.0.0.1:8000

How it works

Each slide goes through a fixed pipeline:

  1. Open slide: via an explicit backend (OpenSlideBackend or TiffSlideBackend)
  2. Detect tissue: run a TissueDetector on a low-res thumbnail to get a binary mask
  3. Sample coordinates: a PatchSampler proposes (x, y) locations within tissue regions
  4. Extract patch: read the pixel data from the slide at each coordinate
  5. Filter patch: a PatchFilter accepts or rejects the tile based on its pixels
  6. Transform patch: apply augmentations (HEDColorAugmentation, RandomFlipRotate, etc.)
  7. Yield result: PatchResult with image, coordinates, tissue fraction, and metadata

Quick start

from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import CLAMTissueDetector
from wsistream.sampling import RandomSampler
from wsistream.filters import HSVPatchFilter
from wsistream.transforms import ComposeTransforms, HEDColorAugmentation, RandomFlipRotate, ResizeTransform
from wsistream.datasets import TCGAAdapter

pipeline = PatchPipeline(
    slide_paths="/data/tcga",  # directory or list of files
    backend=OpenSlideBackend(),
    tissue_detector=CLAMTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
    patch_filter=HSVPatchFilter(min_pixel_fraction=0.6),
    transforms=ComposeTransforms(transforms=[
        HEDColorAugmentation(sigma=0.05),
        RandomFlipRotate(),
        ResizeTransform(target_size=224),
    ]),
    dataset_adapter=TCGAAdapter(),
    pool_size=8,
    patches_per_slide=100,
    cycle=True,
)

for result in pipeline:
    print(result.image.shape)                # (224, 224, 3) uint8
    print(result.coordinate.mpp)             # ~0.5
    print(result.tissue_fraction)            # 0.87
    print(result.slide_metadata.patient_id)  # TCGA-3L-AA1B

Pool-based slide interleaving

The pipeline keeps pool_size slides open simultaneously and takes patches_per_slide patches from each before closing it and opening the next. With cycle=True, slides are re-queued for infinite streaming. Set patches_per_visit (default 1) to read multiple patches from the same slide before round-robining, which can significantly improve I/O throughput on network filesystems.

PyTorch integration

wsistream.torch provides WsiStreamDataset (an IterableDataset), MonitoredLoader for throughput tracking, and partition_slides_by_rank for DDP. Worker-level slide partitioning is handled automatically.

from torch.utils.data import DataLoader
from wsistream.backends import OpenSlideBackend
from wsistream.sampling import RandomSampler
from wsistream.tissue import OtsuTissueDetector
from wsistream.torch import WsiStreamDataset, partition_slides_by_rank

my_slides = partition_slides_by_rank("/data/tcga", rank=rank, world_size=world_size)

dataset = WsiStreamDataset(
    slide_paths=my_slides,
    backend=OpenSlideBackend(),
    tissue_detector=OtsuTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
)

loader = DataLoader(dataset, batch_size=64, num_workers=4, pin_memory=True)
loader_iter = iter(loader)

for step in range(total_steps):
    batch = next(loader_iter)
    images = batch["image"].to(device, non_blocking=True)  # (B, 3, H, W) float32

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wsistream-0.1.2.tar.gz (31.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wsistream-0.1.2-py3-none-any.whl (53.6 kB view details)

Uploaded Python 3

File details

Details for the file wsistream-0.1.2.tar.gz.

File metadata

  • Download URL: wsistream-0.1.2.tar.gz
  • Upload date:
  • Size: 31.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wsistream-0.1.2.tar.gz
Algorithm Hash digest
SHA256 343d7be010bcb1fdd2b6f997de3dc09baac14e37d05daff09b971e070bb06d4e
MD5 ed0a71e7075708bd2fd86e1c0abe1693
BLAKE2b-256 160c2a532389f1fba6ab65ba4531328d5c536af2d3704a76c51cd191e1f455f0

See more details on using hashes here.

File details

Details for the file wsistream-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: wsistream-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 53.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wsistream-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ffff5a6c37758f6a73d2f106e2367a8bf3e60417fa66a3c982122f04cb80a44f
MD5 c6b2449d2c283558541e999d5a468853
BLAKE2b-256 546eb59cd432225aefc8398f9145429169536d0cb1238c245fa923c75e88ba86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page