Skip to main content

Modular online patch streaming from whole-slide images for computational pathology

Project description

wsistream

Modular online patch streaming from whole-slide images for computational pathology.

PyPI Python License Docs

Stream patches directly from WSIs during training — no disk pre-extraction, no storage overhead. Every component is pluggable: backends, tissue detectors, samplers, filters, transforms, dataset adapters.

Install

pip install "wsistream[openslide]"   # with OpenSlide
pip install "wsistream[tiffslide]"   # with TiffSlide (pure Python)
pip install "wsistream[torch]"       # add PyTorch integration (WsiStreamDataset, DDP)
pip install "wsistream[all]"         # everything (OpenSlide + TiffSlide + PyTorch + albumentations + matplotlib)

For development:

git clone https://github.com/RamonKaspar/wsistream.git
cd wsistream
pip install -e ".[dev]"

Documentation

Full documentation: ramonkaspar.github.io/wsistream

To build locally:

pip install mkdocs-material
mkdocs serve          # local preview at http://127.0.0.1:8000

How it works

Each slide goes through a fixed pipeline:

  1. Open slide: via an explicit backend (OpenSlideBackend or TiffSlideBackend)
  2. Detect tissue: run a TissueDetector on a low-res thumbnail to get a binary mask
  3. Sample coordinates: a PatchSampler proposes (x, y) locations within tissue regions
  4. Extract patch: read the pixel data from the slide at each coordinate
  5. Filter patch: a PatchFilter accepts or rejects the tile based on its pixels
  6. Transform patch: apply augmentations (HEDColorAugmentation, RandomFlipRotate, etc.)
  7. Yield result: PatchResult with image, coordinates, tissue fraction, and metadata

Quick start

from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import CLAMTissueDetector
from wsistream.sampling import RandomSampler
from wsistream.filters import HSVPatchFilter
from wsistream.transforms import ComposeTransforms, HEDColorAugmentation, RandomFlipRotate, ResizeTransform
from wsistream.datasets import TCGAAdapter

pipeline = PatchPipeline(
    slide_paths="/data/tcga",  # directory or list of files
    backend=OpenSlideBackend(),
    tissue_detector=CLAMTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
    patch_filter=HSVPatchFilter(min_pixel_fraction=0.6),
    transforms=ComposeTransforms(transforms=[
        HEDColorAugmentation(sigma=0.05),
        RandomFlipRotate(),
        ResizeTransform(target_size=224),
    ]),
    dataset_adapter=TCGAAdapter(),
    pool_size=8,
    patches_per_slide=100,
    cycle=True,
)

for result in pipeline:
    print(result.image.shape)                # (224, 224, 3) uint8
    print(result.coordinate.mpp)             # ~0.5
    print(result.tissue_fraction)            # 0.87
    print(result.slide_metadata.patient_id)  # TCGA-3L-AA1B

Pool-based slide interleaving

The pipeline keeps pool_size slides open simultaneously and takes patches_per_slide patches from each before closing it and opening the next. With cycle=True, slides are re-queued for infinite streaming. Set patches_per_visit (default 1) to read multiple patches from the same slide before round-robining, which can significantly improve I/O throughput on network filesystems.

PyTorch integration

wsistream.torch provides WsiStreamDataset (an IterableDataset), MonitoredLoader for throughput tracking, and partition_slides_by_rank for DDP. Worker-level slide partitioning is handled automatically.

from torch.utils.data import DataLoader
from wsistream.backends import OpenSlideBackend
from wsistream.sampling import RandomSampler
from wsistream.tissue import OtsuTissueDetector
from wsistream.torch import WsiStreamDataset, partition_slides_by_rank

my_slides = partition_slides_by_rank("/data/tcga", rank=rank, world_size=world_size)

dataset = WsiStreamDataset(
    slide_paths=my_slides,
    backend=OpenSlideBackend(),
    tissue_detector=OtsuTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
)

loader = DataLoader(dataset, batch_size=64, num_workers=4, pin_memory=True)
loader_iter = iter(loader)

for step in range(total_steps):
    batch = next(loader_iter)
    images = batch["image"].to(device, non_blocking=True)  # (B, 3, H, W) float32

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wsistream-0.1.3.tar.gz (31.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wsistream-0.1.3-py3-none-any.whl (53.8 kB view details)

Uploaded Python 3

File details

Details for the file wsistream-0.1.3.tar.gz.

File metadata

  • Download URL: wsistream-0.1.3.tar.gz
  • Upload date:
  • Size: 31.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wsistream-0.1.3.tar.gz
Algorithm Hash digest
SHA256 17978148058a392e9f28fe4d8ea308591530d0596df5691c55f9f844e1139b5e
MD5 afcb034500fe62af516d760e5e9c9ad8
BLAKE2b-256 7988f1d86c4067baf2b0a8eea644f760f6d7eb9450b936468ccb5b96fb603e29

See more details on using hashes here.

File details

Details for the file wsistream-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: wsistream-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 53.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wsistream-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 784a0fde5c6ef2797f319533b3cd8f5296ab52fb0a62789c1d6c0fe776a4019e
MD5 6455ed9c5d7048a8f1e55c1c6b3ede26
BLAKE2b-256 19b11b46bc40bf124397219a73d3963822da546445097657ab52649ad2e0f341

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page