Skip to main content

Modular online patch streaming from whole-slide images for computational pathology

Project description

wsistream

Modular online patch streaming from whole-slide images for computational pathology.

PyPI Python License Docs

Stream patches directly from WSIs during training — no disk pre-extraction, no storage overhead. Every component is pluggable: backends, tissue detectors, samplers, filters, transforms, dataset adapters.

Install

pip install "wsistream[openslide]"   # with OpenSlide
pip install "wsistream[tiffslide]"   # with TiffSlide (pure Python)
pip install "wsistream[torch]"       # add PyTorch integration (WsiStreamDataset, DDP)
pip install "wsistream[all]"         # everything (OpenSlide + TiffSlide + PyTorch + albumentations + matplotlib)

For development:

git clone https://github.com/RamonKaspar/wsistream.git
cd wsistream
pip install -e ".[dev]"

Documentation

Full documentation: ramonkaspar.github.io/wsistream

To build locally:

pip install mkdocs-material
mkdocs serve          # local preview at http://127.0.0.1:8000

How it works

Each slide goes through a fixed pipeline:

  1. Open slide: via an explicit backend (OpenSlideBackend or TiffSlideBackend)
  2. Detect tissue: run a TissueDetector on a low-res thumbnail to get a binary mask
  3. Sample coordinates: a PatchSampler proposes (x, y) locations within tissue regions
  4. Extract patch: read the pixel data from the slide at each coordinate
  5. Filter patch: a PatchFilter accepts or rejects the tile based on its pixels
  6. Transform patch: apply augmentations (HEDColorAugmentation, RandomFlipRotate, etc.)
  7. Yield result: PatchResult with image, coordinates, tissue fraction, and metadata

Quick start

from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import CLAMTissueDetector
from wsistream.sampling import RandomSampler
from wsistream.filters import HSVPatchFilter
from wsistream.transforms import ComposeTransforms, HEDColorAugmentation, RandomFlipRotate, ResizeTransform
from wsistream.datasets import TCGAAdapter

pipeline = PatchPipeline(
    slide_paths="/data/tcga",  # directory or list of files
    backend=OpenSlideBackend(),
    tissue_detector=CLAMTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
    patch_filter=HSVPatchFilter(min_pixel_fraction=0.6),
    transforms=ComposeTransforms(transforms=[
        HEDColorAugmentation(sigma=0.05),
        RandomFlipRotate(),
        ResizeTransform(target_size=224),
    ]),
    dataset_adapter=TCGAAdapter(),
    pool_size=8,
    patches_per_slide=100,
    cycle=True,
)

for result in pipeline:
    print(result.image.shape)                # (224, 224, 3) uint8
    print(result.coordinate.mpp)             # ~0.5
    print(result.tissue_fraction)            # 0.87
    print(result.slide_metadata.patient_id)  # TCGA-3L-AA1B

Pool-based slide interleaving

The pipeline keeps pool_size slides open simultaneously and takes patches_per_slide patches from each before closing it and opening the next. With cycle=True, slides are re-queued for infinite streaming. Set patches_per_visit (default 1) to read multiple patches from the same slide before round-robining, which can significantly improve I/O throughput on network filesystems.

PyTorch integration

wsistream.torch provides WsiStreamDataset (an IterableDataset), MonitoredLoader for throughput tracking, and partition_slides_by_rank for DDP. Worker-level slide partitioning is handled automatically.

from torch.utils.data import DataLoader
from wsistream.backends import OpenSlideBackend
from wsistream.sampling import RandomSampler
from wsistream.tissue import OtsuTissueDetector
from wsistream.torch import WsiStreamDataset, partition_slides_by_rank

my_slides = partition_slides_by_rank("/data/tcga", rank=rank, world_size=world_size)

dataset = WsiStreamDataset(
    slide_paths=my_slides,
    backend=OpenSlideBackend(),
    tissue_detector=OtsuTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
)

loader = DataLoader(dataset, batch_size=64, num_workers=4, pin_memory=True)
loader_iter = iter(loader)

for step in range(total_steps):
    batch = next(loader_iter)
    images = batch["image"].to(device, non_blocking=True)  # (B, 3, H, W) float32

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wsistream-0.1.4.tar.gz (31.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wsistream-0.1.4-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file wsistream-0.1.4.tar.gz.

File metadata

  • Download URL: wsistream-0.1.4.tar.gz
  • Upload date:
  • Size: 31.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wsistream-0.1.4.tar.gz
Algorithm Hash digest
SHA256 37e9e9341c8f6c174b41b31a6c2b2e7f438b23b15d878d0f2da86d40b514cb14
MD5 3de92620a95a091ad05a6eb1378478e0
BLAKE2b-256 474a3ad0d78bb56e22abd48e45d42b871f1892834c563e7dc1a26d3cfd5b9056

See more details on using hashes here.

File details

Details for the file wsistream-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: wsistream-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 55.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for wsistream-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2c1081788b54f678d225bce6ac870db9fa4635450f2b05968df498c27785eedf
MD5 51ada18c7c3993002fcff6c65053bf0b
BLAKE2b-256 ee12f5c7508ae5029c2dd1cb1d96c1fcad8cf0419fb38bf5abcf38680e0df66e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page