Skip to main content

Dataset preparation, dataloaders, and ecosystem bridges for 2D anatomical segmentation.

Project description

2D Anatomy Segmentation Datasets Repository

Welcome to the 2D Anatomy Segmentation Datasets repository! This project provides a collection of scripts, tools, and datasets to streamline training and evaluation for anatomical segmentation tasks in medical imaging. Whether you're a researcher or practitioner in the medical imaging field, this repository is designed to collect 2D anatomical datasets and make it easy to download, process, and visualize data from various imaging modalities.

Prepared datasets plug directly into SMP, mmsegmentation, mmdetection, or HuggingFace through a single splits-JSON contract that carries license, citation, normalization stats, and split seed alongside the data.

Example image / mask overlays per dataset


60-second demo (CPU)

git clone https://github.com/ConstantinSeibold/2DAnatomyDatasets.git
cd 2DAnatomyDatasets
pip install -e '.[smp]'
sh src/prepare_data/prepare_drive/get_drive_full.sh
export DRIVE_ROOT_FOLDER=./datasets/drive
python configs/smp/drive_unet.py        # one forward + backward step on CPU

Install

pip install -e .                # core (dataset prep + raw PyTorch dataloaders)
pip install -e '.[smp]'         # + segmentation_models_pytorch
pip install -e '.[mmseg]'       # + mmsegmentation adapter / exporter
pip install -e '.[mmdet]'       # + mmdetection adapter / exporter
pip install -e '.[hf]'          # + HuggingFace datasets
pip install -e '.[all]'         # everything

Pick a path

You want… Start with
Fastest end-to-end training SMPfrom anatomy_datasets import DRIVE; DRIVE(split="train") works with any SMP model. See configs/smp/.
Already using mmseg mmseg runtime adapterAnatomyMulticlassDataset in your config; no on-disk export. See configs/mmseg/drive_unet.py.
Training Mask R-CNN / detection mmdet — small datasets use shipped COCO; PAXRay-scale uses sharded export. See configs/mmdet/.
Training SegFormer / Mask2Former HuggingFaceto_hf_dataset(splits_json, root) returns a DatasetDict. See configs/hf/.
Just the raw masks Use the splits JSON directly. Format spec below.

Datasets

Run sh src/prepare_data/prepare_<name>/get_<name>_full.sh to download + prepare. Each emits a splits JSON with full metadata (license, citation, BibTeX, normalization stats, seed). Grouped by modality, sorted ascending by image count within each group.

Modality Dataset #Images #Classes Task License Env var Notes
Chest X-ray Montgomery 138 3 multiclass (left / right lung) custom-academic MONTGOMERY_ROOT_FOLDER PA chest radiographs from the NLM Montgomery County, MD TB-screening program — 80 normal + 58 TB-positive cases — with bilateral lung-field masks. (Jaeger et al., QIMS 2014.)
JSRT 247 5 multilabel (heart, clavicles, lungs) custom-academic JSRT_ROOT_PATH Japanese Society of Radiological Technology lung-nodule reference set: 247 PA chest radiographs (originally for CAD ROC analysis of nodule detection) with subsequent manual heart, lung-field, and clavicle annotations from the segmentation02 release. (Shiraishi et al., AJR 2000.)
PAXRay 852 166 multilabel (fine-grained anatomy) custom-academic PAXRAY_ROOT_PATH Synthetic chest radiographs generated by projecting fine-grained anatomical labels from CT volumes; 166 classes covering individual bones, lung lobes, mediastinal organs and vessels. (Seibold et al., BMVC 2022.)
PAXRay++ 14,754 157 multilabel custom-academic PAXRAYPP_ROOT Scaled-up PAXRay: 14,754 CT-projected chest radiographs (frontal + lateral) with 157 anatomy classes and >2M instance masks, built by ensembling 3D segmentation models over 10,021 thoracic CTs (RSNA-PE, RibFrac, LIDC-IDRI, COVID-19 cohorts) and projecting the volumetric pseudo-labels to 2D. (Seibold et al., 2023.)
Fundus (color) STARE 20 2 multiclass (vessel) unknown STARE_ROOT_FOLDER Structured Analysis of the REtina — classic 20-image vessel-segmentation benchmark with two independent expert annotators. (Hoover et al., IEEE TMI 2000.)
CHASE_DB1 28 2 multiclass (vessel) custom-academic CHASEDB1_ROOT_FOLDER Retinal fundus images collected by the Child Heart And health Study in England; both eyes of 14 subjects with manual vessel annotations. (Fraz et al., IEEE TBME 2012.)
DRIVE 40 2 multiclass (vessel) custom-academic DRIVE_ROOT_FOLDER Digital Retinal Images for Vessel Extraction — canonical vessel-seg benchmark; 40 fundus images from a Dutch diabetic-retinopathy screening program. (Staal et al., IEEE TMI 2004.)
HRF 45 2 multiclass (vessel) CC-BY-4.0 HRF_ROOT_FOLDER High-Resolution Fundus images (3504×2336 or 3072×2048) across three cohorts — 15 healthy, 15 glaucoma, 15 diabetic retinopathy — with manual vessel annotations and FOV masks. (Budai et al., IJBI 2013.)
FIVES 800 2 multiclass (vessel) CC-BY-4.0 FIVES_ROOT_FOLDER Fundus Image Vessel Segmentation set: 800 high-resolution color fundus images balanced across Normal / AMD / Diabetic retinopathy / Glaucoma; binary vessel annotations from expert crowdsourcing. (Jin et al., Scientific Data 2022.)
IR retina RAVIR 23 3 multiclass (arteries, veins) custom-academic RAVIR_ROOT_FOLDER Retinal Arteries and Veins in Infrared Reflectance imaging — semantic segmentation dataset distinguishing arteries from veins, used for vessel-morphometry analyses. (Hatamizadeh et al., IEEE JBHI 2022.)
OCT DUKE_OCT 110 16 multiclass (retinal layers) custom-academic DUKE_ROOT_PATH Spectral-domain OCT B-scans from patients with diabetic macular edema; retinal-layer boundaries plus fluid regions annotated for layer-segmentation research. (Chiu et al., BOE 2015.)
Panoramic X-ray Teeth 598 1 detection (tooth) custom-academic TEETH_ROOT Panoramic dental radiographs with per-tooth polygon instance annotations; Kaggle release curated by Humans in the Loop.
Microscopy MedakaHeart 805 4 multiclass (bulbus, atrium, heart) GPL-3.0 MEDAKA_HEART_ROOT_FOLDER Brightfield microscopy of medaka (Japanese rice fish) hatchlings, ventral view; cardiac-chamber segmentation supporting developmental-biology and frugal-labeler methodology research. (Schutera et al., PLOS ONE 2022.)
RGB video JumpBroadcast 1,809 15 multiclass (body parts) custom-academic JUMP_BROADCAST_ROOT_FOLDER Cropped frames from TV broadcasts of triple-, high-, and long-jump athletes; per-frame body-part segmentation paired with an arbitrary-keypoint annotation set. (Ludwig et al., CVPRW 2023.)
Scintigraphy BS80k 82,544 many multilabel (bone regions) custom-academic BS80K_ROOT First large open-access bone-scan dataset: 82,544 images from 3,247 patients at West China Hospital, with anatomical-region segmentations. (Huang et al., Computers in Biology and Medicine 2022.)

Programmatic metadata + BibTeX access:

from anatomy_datasets import get_dataset_info
info = get_dataset_info("DRIVE")
print(info.license, info.source_url, info.citation, info.bibtex)

Visualization

After preparing a dataset, two ways to inspect image / mask overlays.

Per-dataset verification script — writes overlays to ./verify_visually/<dataset>/:

python src/prepare_data/prepare_drive/verify_drive_visually.py
python src/prepare_data/prepare_hrf/verify_hrf_visually.py
# ...one per dataset under src/prepare_data/prepare_<name>/verify_<name>_visually.py

Programmatic — use the helpers in src/visualization/visualize.py:

from anatomy_datasets import DRIVE
from src.visualization.visualize import visualize_multiclass
import numpy as np

ds = DRIVE(split="train")
image, mask = ds[0]
overlay = visualize_multiclass(np.array(image), np.array(mask).astype(int), ds.label_dict)

Interactivenotebooks/Dataloader_example.ipynb walks through every dataset with rendered overlays.


Splits JSON contract

Every prepared dataset writes one file at <root>/<name>_splits.json:

{
  "name": "DRIVE",
  "version": "2026-05-18",
  "seed": 42,                                    // null if upstream split is fixed
  "modality": "fundus",
  "license": "custom-academic",
  "license_url": "...",
  "source_url": "...",
  "paper_url": "...",
  "citation": "Staal et al., IEEE TMI 2004.",
  "bibtex": "@article{...}",
  "normalization": {"mean": [r, g, b], "std": [r, g, b]},   // [0,1] float, train images only
  "label_dict": {"0": "background", "1": "vessel"},          // keys = class IDs
  "train": [{"image": "rel/path.png", "target": "rel/path.png"}, ...],
  "val":   [...],
  "test":  [{"image": "..."} , ...]   // target may be missing (e.g. DRIVE test); dataloader returns (image, None)
}

Target file conventions:

  • Multilabel (.npy): 3D mask stack (C, H, W). Channel = class ID = label_dict key.
  • Multiclass (PNG): single-channel integer mask. Pixel value = class ID.

Missing-GT entries return (image, None). Use collate_optional_target for batching:

from anatomy_datasets import collate_optional_target
DataLoader(ds_test, collate_fn=collate_optional_target)

Exporters (on-disk → other frameworks)

# mmsegmentation (img_dir / ann_dir / palette PNG; auto multiclass vs per-channel)
python -m anatomy_datasets.exporters.mmseg \
    --splits ./datasets/drive/drive_splits.json --root ./datasets/drive \
    --out ./exports/mmseg_drive

# COCO (mmdet / detectron2 / generic; auto-shards above 50k anns)
python -m anatomy_datasets.exporters.coco \
    --splits ./datasets/paxray/paxray.json --root ./datasets/paxray \
    --out ./exports/coco_paxray

# Sharded SA-1B per-image RLE JSON (lazy load; for PAXRay-scale instance seg)
python -m anatomy_datasets.formats.sharded_coco \
    --splits ./datasets/paxray/paxray.json --root ./datasets/paxray \
    --out ./exports/sharded_paxray --image-link-mode symlink

# HuggingFace DatasetDict (Python only, no CLI)
python -c "from anatomy_datasets.exporters import to_hf_dataset; \
           to_hf_dataset('./datasets/drive/drive_splits.json', './datasets/drive')"

Adapters (runtime, no on-disk export)

# mmseg config side-effect import
custom_imports = dict(imports=["anatomy_datasets.adapters.mmseg"])
train_dataloader = dict(dataset=dict(
    type="AnatomyMulticlassDataset",
    splits_json="./datasets/drive/drive_splits.json",
    split="train",
))

# mmdet config (reads from sharded export)
custom_imports = dict(imports=["anatomy_datasets.adapters.mmdet"])
train_dataloader = dict(dataset=dict(
    type="ShardedCocoMMDetDataset",
    sharded_root="./exports/sharded_paxray",
    split="train",
))

Layout

src/anatomy_datasets/        # the importable package
  base.py                    # BaseMultiClassDataset, BaseMultiLabelDataset, BaseDetectionDataset, collate_optional_target
  datasets/                  # per-dataset class aliases
  transforms.py              # get_transform / get_transform_det (albumentations)
  registry.py                # DATASET_REGISTRY (license / citation / modality / BibTeX)
  postprocess.py             # add_metadata_to_splits_json
  stats.py                   # compute_image_stats
  formats/sharded_coco.py    # SA-1B-style writer + reader
  exporters/{coco,mmseg,hf}  # one-shot exporters + CLIs
  adapters/{mmseg,mmdet}     # runtime adapters (lazy lib imports)
src/prepare_data/prepare_<name>/  # per-dataset download / labels / splits / verify scripts
src/visualization/visualize.py    # overlay helpers (no CLI; import or use verify scripts)
configs/{smp,mmseg,mmdet,hf}/     # copy-paste training templates
notebooks/                        # Download_datasets.ipynb, Dataloader_example.ipynb
src/training/                     # legacy backcompat shim (train.py example, deprecated)

Adding a dataset

  1. Add a DatasetInfo (including bibtex) to src/anatomy_datasets/registry.py.
  2. Write src/prepare_data/prepare_<name>/{download_<name>.py, prepare_<name>_splits.py, get_<name>_full.sh}. Splits script must call add_metadata_to_splits_json(...) at the end.
  3. Add an entry to src/anatomy_datasets/_discovery.py (alias → class + env var + default JSON name).
  4. Add a thin alias module under src/anatomy_datasets/datasets/<name>.py.
  5. Add verify_<name>_visually.py under the prep dir and verify by loading + overlaying masks.

Caveats

  • mmseg has no native multilabel. Use per-channel export (--mode per_channel) or AnatomyMultilabelDataset with target_channel.
  • Sharded format currently only writes from .npy multilabel sources; polygon/COCO passthrough is TODO.
  • COCO exporter is multilabel-only; for multiclass PNG datasets use the mmseg exporter.
  • HF exporter has no CLI yet (Python function only).
  • Some registry license fields are "unknown" — check the source URL before commercial use.

License

This project: MIT. Each upstream dataset retains its own license — see the table above and the license / license_url fields in each splits JSON.

Citation

If you use any dataset prepared by this repo, cite the upstream paper. BibTeX is embedded in every splits JSON under the bibtex key, accessible via anatomy_datasets.get_dataset_info(name).bibtex.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anatomy_datasets-0.1.0.tar.gz (46.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anatomy_datasets-0.1.0-py3-none-any.whl (52.1 kB view details)

Uploaded Python 3

File details

Details for the file anatomy_datasets-0.1.0.tar.gz.

File metadata

  • Download URL: anatomy_datasets-0.1.0.tar.gz
  • Upload date:
  • Size: 46.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for anatomy_datasets-0.1.0.tar.gz
Algorithm Hash digest
SHA256 09c138d3ff936d6952c7b85bfebc1b2e8e3d866626372f37ebae6124fe67941c
MD5 da23fdfaadc25b5848e9e990b9e21e75
BLAKE2b-256 873654735578090b6b627e9be983cb49be63e13d135740e9f545b12af491b6ea

See more details on using hashes here.

File details

Details for the file anatomy_datasets-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for anatomy_datasets-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ef376cac10eb4314c40fd16774290fd51a5d1a6f1e8c3189b58156cd281f93be
MD5 a75997fc0669b70b1a47376d195c867a
BLAKE2b-256 b69ac78c47887266c40e21e2ccf83a0e92d5a28e844842dec70833d35aff8f5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page