Dataset validation and preprocessing toolkit for neurology brain imaging (NIfTI)

Project description

NeuroTK: Dataset Validation for Neurology Brain Imaging

Motivation

Neurology brain imaging datasets are heterogeneous and frequently contain inconsistencies. Geometry, spacing, orientation, and annotation issues occur commonly across CT and MRI collections. These problems often surface late in modeling, when remediation is costly and compromises reproducibility. NeuroTK surfaces issues early, explicitly, and reproducibly to support dataset hygiene prior to analysis.

Scope

NeuroTK focuses on dataset quality assurance prior to downstream analysis. It provides dataset-level and file-level validation with structural and geometric consistency checks, and assessment of annotation presence and integrity.

Dataset-level and file-level validation
Structural and geometric consistency checks
Annotation presence and integrity assessment

NeuroTK does not modify scientific data.

Installation

pip install neurotk

Quickstart

neurotk validate --images imagesTr --labels labelsTr --out report.json

For validate, NeuroTK scans directories recursively for .nii/.nii.gz files. Filenames must match exactly for image-label pairing.

dataset/
  imagesTr/
    case_001.nii.gz
    case_002.nii.gz
  labelsTr/
    case_001.nii.gz
    case_002.nii.gz

CLI Reference

Validate:

neurotk validate \
  --images imagesTr \
  --labels labelsTr \
  --out report.json \
  --max-samples 10 \
  --html report.html \
  --summary-only

Key options:

--images (required): directory of input NIfTI images.
--labels (optional): directory of label NIfTI files.
--out (required): output JSON report path.
--max-samples (optional): limit number of images processed.
--html (optional): write HTML report.
--summary-only (optional): print text summary to stdout.

Preprocess:

neurotk preprocess \
  --images imagesTr \
  --labels labelsTr \
  --out preprocessed/ \
  --spacing 1.0 1.0 1.0 \
  --orientation RAS \
  --copy-metadata

Key options:

--images (required): directory of input NIfTI images.
--labels (optional): directory of label NIfTI files.
--out (required): output directory for preprocessed files.
--spacing (required): target spacing as 3 floats.
--orientation (optional): target orientation (default RAS).
--dry-run (optional): preview preprocessing without writing outputs.
--copy-metadata (optional): preserve metadata when applicable.

Inference (MONAI bundles)

NeuroTK can run inference from external MONAI bundles via the optional inference extras:

pip install neurotk[inference]

Single image:

neurotk infer \
  --bundle-dir /path/to/bundle \
  --input image.nii.gz \
  --output-dir outputs/

Default bundle (uses NEUROTK_DEFAULT_BUNDLE or UMNSHAMLAB/segresnet):

neurotk infer \
  --input image.nii.gz \
  --output-dir outputs/

Default HF bundle repo: UMNSHAMLAB/segresnet.

From Hugging Face (auto-download + cache full bundle):

neurotk infer \
  --bundle-dir hf:UMNSHAMLAB/segresnet \
  --input image.nii.gz \
  --output-dir outputs/

You can also pass a Hugging Face repo URL:

neurotk infer \
  --bundle-dir https://huggingface.co/UMNSHAMLAB/segresnet \
  --input image.nii.gz \
  --output-dir outputs/

Batch mode:

neurotk infer \
  --bundle-dir /path/to/bundle \
  --input-list images.txt \
  --output-dir outputs/

Key options:

--bundle-dir (optional): local MONAI bundle path, org/model, hf:org/model, or HF URL.
--input (optional): one NIfTI file or a directory of NIfTI files.
--input-list (optional): text file with one image path per line.
Use exactly one of --input or --input-list.
--output-dir (required): output directory for predictions.
--device (optional): inference device (for example cuda, cuda:0, mps, cpu).
--save-probs (optional): save probability output (*_prob.nii.gz) instead of segmentation (*_seg.nii.gz).
--force (optional): recompute outputs even if prediction files already exist.
--skip-invalid-inputs (optional): continue inference by skipping files that fail (for example incompatible channels/dimensions).
--labels-dir (optional): labels directory used to compute Dice during inference.
--reference-image (optional): image whose affine/header are used for saved outputs.

Device selection:

# CUDA
neurotk infer --device cuda --input image.nii.gz --output-dir outputs/

# Apple Silicon
neurotk infer --device mps --input image.nii.gz --output-dir outputs/

# CPU
neurotk infer --device cpu --input image.nii.gz --output-dir outputs/

If inference runs on CPU (explicitly or via fallback), NeuroTK prints a warning because runtime may be significantly slower.

Dice during inference:

neurotk infer computes Dice and writes outputs/dice_scores.csv only when labels are available.
If --labels-dir is omitted and --input is a directory, NeuroTK auto-detects sibling labels directories such as images -> labels and imagesTr -> labelsTr.
If labels are not present, Dice is skipped.
If --input path does not exist, inference fails fast with a clear error.
Existing prediction outputs are skipped by default; pass --force to recompute.
With --skip-invalid-inputs, invalid files are skipped and recorded in outputs/skipped_inputs.csv.

Dice after inference:

neurotk dice \
  --preds outputs/ \
  --labels-dir labels/ \
  --output outputs/dice_scores.csv

Lesion volume from predictions:

neurotk lesion-volume \
  --preds outputs/ \
  --output outputs/lesion_volumes.csv \
  --summary-output outputs/lesion_volumes_summary.csv

With histogram:

neurotk lesion-volume \
  --preds outputs/ \
  --output outputs/lesion_volumes.csv \
  --histogram outputs/lesion_volume_hist.png \
  --hist-bins 30

Output columns:

image
lesion_voxels
voxel_volume_mm3
lesion_volume_mm3
lesion_volume_ml

Summary CSV columns:

category (range or overall)
metric (range label or stat name)
count
percent
value_ml

Included overall stats:

total_images
min_ml
p25_ml
median_ml
p75_ml
max_ml
mean_ml

Key options:

--preds (optional): one prediction NIfTI file or a directory of predictions.
--preds-list (optional): text file with one prediction path per line.
Use exactly one of --preds or --preds-list.
--labels-dir (required): labels directory.
--output (required): CSV output path for Dice/Hausdorff metrics.

Lesion volume options:

--preds (optional): one prediction NIfTI file or a directory of predictions.
--preds-list (optional): text file with one prediction path per line.
--output (required): CSV output path for lesion volume report.
--summary-output (optional): CSV output path for lesion-volume range summary.
--threshold (optional): threshold for binarizing 3D probability maps (default 0.5).
--histogram (optional): path to save histogram image of lesion volumes (mL).
--hist-bins (optional): number of histogram bins (default 30).

Cohort selection stats from original label scans:

neurotk cohort-stats \
  --labels labelsTr/ \
  --normal-csv normal_ct_flags.csv \
  --output cohort_classification.csv \
  --summary-output cohort_summary.csv \
  --tn-threshold-ml 0.2 \
  --low-max-ml 5.0 \
  --medium-max-ml 20.0

Generate normal_ct_flags.csv from original labels:

neurotk make-normal-csv \
  --images imagesTr/ \
  --labels labelsTr/ \
  --output normal_ct_flags.csv \
  --threshold-ml 0.2 \
  --train-selection-json train_selection.json \
  --train-min-lesion-ml 1.0

Classification rule:

true_negative: normal_ct == true and lesion volume <= tn-threshold-ml (default 0.2 mL).
true_positive: all other cases.
True positives are subdivided into low, medium, high by lesion volume.

Cohort stats options:

--labels (optional): one label NIfTI file or directory of label files.
--labels-list (optional): text file with one label path per line.
--normal-csv (required): CSV with normal CT flag. Supported columns include image/id and normal_ct/normal/is_normal.
--output (required): per-case classification CSV path.
--summary-output (required): cohort summary CSV path.
--tn-threshold-ml (optional): TN threshold in mL (default 0.2).
--low-max-ml (optional): upper bound for TP low group (default 5.0).
--medium-max-ml (optional): upper bound for TP medium group (default 20.0).

Normal-CT CSV generator options:

--images (optional): one image NIfTI file or directory of image files (required when --train-selection-json is used).
--images-list (optional): text file with one image path per line (required alternative to --images for train JSON).
--labels (optional): one label NIfTI file or directory of label files.
--labels-list (optional): text file with one label path per line.
--output (required): output CSV path.
--threshold-ml (optional): threshold used to set normal_ct=true from label lesion volume (default 0.2).
--train-selection-json (optional): write MONAI datalist-style JSON for selected training cases.
--train-min-lesion-ml (optional): include only labels with lesion volume > this threshold in training JSON (default 1.0).
--num-folds (optional): number of CV folds for assigning fold in training entries (default 5).

Train-selection JSON structure (MONAI-style):

description
labels
training with {image, label, fold}
validation (empty by default)
testing with {image}

Note: for full-bundle HF usage, the repo must contain a valid MONAI bundle layout (e.g., configs/ with inference/evaluate config and models/ checkpoints).

Output

NeuroTK emits a JSON report containing a dataset-level summary, per-file diagnostics, and explicit listings of detected issues. For validate+preprocess runs, the report includes a processed summary and preprocess traceability so original and processed states are unambiguous.

{
  "summary": {"scope": "original_inputs", "num_images": 100, "files_with_issues": 7},
  "summary_processed": {"scope": "processed_outputs", "num_images": 100},
  "files": {"case_001.nii.gz": {"issues": ["label_missing"]}}
}

Validate vs preprocess semantics

summary always reflects original inputs.
summary_processed is present only for validate+preprocess runs and reflects outputs after preprocessing.
run_mode indicates whether preprocessing was requested.

Upgrading to v0.3.0

Reports now include explicit scope fields and preprocess traceability blocks. These additions are backward-compatible for validation-only users.

Web UI

The FastAPI app in webapp/ is the primary landing page and execution interface. The older site/ Next.js prototype is deprecated and should not be used for deployment.

Citation

If you use NeuroTK in your research, please cite it as follows:

@software{neurotk,
  title  = {NeuroTK: Dataset Validation for Neurology Brain Imaging},
  author = {Sakshi Rathi},
  year   = {2026},
  doi    = {10.5281/zenodo.18252017},
  url    = {https://github.com/SakshiRa/neurotk},
  note   = {Open-source toolkit for dataset validation and quality assurance in neurology brain imaging}
}

Project details

Release history Release notifications | RSS feed

0.3.4

Jul 3, 2026

This version

0.3.3

Feb 8, 2026

0.3.2

Feb 8, 2026

0.3.1

Feb 7, 2026

0.3.0

Jan 24, 2026

0.2.1

Jan 17, 2026

0.1.1

Jan 17, 2026

0.1.0

Jan 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neurotk-0.3.3.tar.gz (54.4 kB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neurotk-0.3.3-py3-none-any.whl (48.2 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file neurotk-0.3.3.tar.gz.

File metadata

Download URL: neurotk-0.3.3.tar.gz
Upload date: Feb 8, 2026
Size: 54.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for neurotk-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`a9a9d1b65c363dcde9fca9d41f7a8b7f14cc48a933d621792d3ec486057ea6e5`
MD5	`81835a9d6f67dfcd1ce54d81eb34604f`
BLAKE2b-256	`97bae6c28413175c2ee264e81e2a31df9a7ab2a2b46634da522490a232fa9495`

See more details on using hashes here.

Provenance

The following attestation bundles were made for neurotk-0.3.3.tar.gz:

Publisher: python-publish.yml on SakshiRa/neurotk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: neurotk-0.3.3.tar.gz
- Subject digest: a9a9d1b65c363dcde9fca9d41f7a8b7f14cc48a933d621792d3ec486057ea6e5
- Sigstore transparency entry: 928740469
- Sigstore integration time: Feb 8, 2026
Source repository:
- Permalink: SakshiRa/neurotk@891ea197503c38e519a9aabbfb09dec6df9e35bb
- Branch / Tag: refs/tags/v0.3.3
- Owner: https://github.com/SakshiRa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@891ea197503c38e519a9aabbfb09dec6df9e35bb
- Trigger Event: release

File details

Details for the file neurotk-0.3.3-py3-none-any.whl.

File metadata

Download URL: neurotk-0.3.3-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 48.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for neurotk-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`150fa17e8a673af51099bbe28ae3eabdfc86d65b4b1240ff51b074bb8acffd56`
MD5	`30093abb98c57ef92c761cb23ee71948`
BLAKE2b-256	`9aa6af31fef2f3828bb203f6f905e1ea12b324762613d1cdcc113e94a5187d21`

See more details on using hashes here.

Provenance

The following attestation bundles were made for neurotk-0.3.3-py3-none-any.whl:

Publisher: python-publish.yml on SakshiRa/neurotk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: neurotk-0.3.3-py3-none-any.whl
- Subject digest: 150fa17e8a673af51099bbe28ae3eabdfc86d65b4b1240ff51b074bb8acffd56
- Sigstore transparency entry: 928740471
- Sigstore integration time: Feb 8, 2026
Source repository:
- Permalink: SakshiRa/neurotk@891ea197503c38e519a9aabbfb09dec6df9e35bb
- Branch / Tag: refs/tags/v0.3.3
- Owner: https://github.com/SakshiRa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@891ea197503c38e519a9aabbfb09dec6df9e35bb
- Trigger Event: release

neurotk 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

NeuroTK: Dataset Validation for Neurology Brain Imaging

Motivation

Scope

Installation

Quickstart

CLI Reference

Inference (MONAI bundles)

Output

Validate vs preprocess semantics

Upgrading to v0.3.0

Web UI

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance