Skip to main content

No project description provided

Project description

UMIE_datasets

contributors last update license

🤩 About the Project

Warning: This project is currently in alpha stage and may be subject to major changes

This repository presents a suite of unified scripts to standardize, preprocess, and integrate 882,774 images from 20 open-source medical imaging datasets, spanning modalities such as X-ray, CT, and MR. The scripts allow for seamless and fast download of a diverse medical data set. We create a unified set of annotations allowing for merging the datasets together without mislabelling. Each dataset is preprocessed with a custom sklearn pipeline. The pipeline steps are reusable across the datasets. The code was designed so that preorocessing a new dataset is simple and requires only reusing the available pipeline steps with customization performed through setting the appropriate values of the pipeline params.

The labels and segmentation masks were unified to be compliant with RadLex ontology.

Preprocessing_modules

Datasets

uid Dataset Modality TASK
0 KITS-23 CT classification/segmentation
1 CoronaHack XRAY classification
2 Alzheimers Dataset MRI Classification
3 Brain Tumor Classification MRI classification
4 COVID-19 Detection X-Ray XRAY classification
5 Finding and Measuring Lungs in CT Data CT Segmentation
6 Brain CT Images with Intracranial Hemorrhage Masks CT Classification
7 Liver and Liver Tumor Segmentation CT Classification, Segmentation
8 Brain MRI Images for Brain Tumor Detection MRI Classification
9 Knee Osteoarthritis Dataset with Severity Grading X-Ray Classification
10 Brain Tumor Progression MRI segmentation
11 Chest X-ray 14 XRAY classification
12 COCA- Coronary Calcium and chest CTs CT Segmentation
13 BrainMetShare MRI Segmentation

Using the datasets

Installing requirements

poetry install

Creating the dataset

Due to the copyright restrictions of the source datasets, we can't share the files directly. To obtain the full dataset you have to download the source datasets yourself and run the preprocessing scripts.

0.KITS-23

KITS-23

  1. Clone the KITS-23 repository.
  2. Enter the KITS-23 directory and install the packages with pip.
    cd kits23
    pip3 install -e .
    
  3. Run the following command to download the data to the dataset/ folder.
    kits23_download_data
    
  4. Fill in the source_path and target_path KITS-23Pipeline() in config/runner_config.py. e.g.
     KITS23Pipeline(
          path_args={
              "source_path": "kits23/dataset",  # Path to the dataset directory in KITS23 repo
              "target_path": TARGET_PATH,
              "labels_path": "kits23/dataset/kits23.json",  # Path to kits23.json
          },
          dataset_args=dataset_config.KITS23
      ),
    
1. Xray CoronaHack -Chest X-Ray-Dataset

1. Xray CoronaHack -Chest X-Ray-Dataset

  1. Go to CoronaHack page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in CoronaHackPipeline() in config/runner_config.py.
2. Alzheimer's Dataset

2. Alzheimer's Dataset ( 4 class of Images)

  1. Go to Alzheimer's Dataset page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in AlzheimersPipeline() in config/runner_config.py.
3. Brain Tumor Classification (MRI

3. Brain Tumor Classification (MRI)

  1. Go to Brain Tumor Classification page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in BrainTumorClassificationPipeline() in config/runner_config.py.
4. COVID-19 Detection X-Ray

4. COVID-19 Detection X-Ray

  1. Go to COVID-19 Detection X-Ray page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. REMOVE TrainData folder. We do not want augmented data at this stage.
  6. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py.
5. Finding and Measuring Lungs in CT Dat

5. Finding and Measuring Lungs in CT Data

  1. Go to Finding and Measuring Lungs in CT Data page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive/2d_images folder in FindingAndMeasuringLungsPipeline() in config/runner_config.py. Fill in masks_path with the location of the archive/2d_masks folder.
6. Brain CT Images with Intracranial Hemorrhage Masks

6. Brain CT Images with Intracranial Hemorrhage Masks

  1. Go to Brain With Intracranial Hemorrhage page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in BrainWithIntracranialHemorrhagePipeline() in config/runner_config.py. Fill in masks_path with the same path as the source_path.
7. Liver and Liver Tumor Segmentation (LITS)

7. Liver and Liver Tumor Segmentation (LITS)

  1. Go to Liver and Liver Tumor Segmentation.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py. Fill in masks_path too.
8. Brain MRI Images for Brain Tumor Detection

8. Brain MRI Images for Brain Tumor Detection

  1. Go to Brain MRI Images for Brain Tumor Detection page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in BrainTumorDetectionPipeline() in config/runner_config.py.
9. Knee Osteoarthrithis Dataset with Severity Grading

9. Knee Osteoarthrithis Dataset with Severity Grading 1. Go to Knee Osteoarthritis Dataset with Severity Grading. 2. Login to your Kaggle account. 3. Download the data. 4. Extract archive.zip. 5. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py.

10. Brain-Tumor-Progression

10. Brain-Tumor-Progression

  1. Go to Brain Tumor Progression dataset from the cancer imaging archive.
11. Chest X-ray 14

11. Chest X-ray 14

  1. Go to Chest X-ray 14.
  2. Create an account.
  3. Download the images folder and DataEntry2017_v2020.csv.
12. COCA- Coronary Calcium and chest CTs

12. COCA- Coronary Calcium and chest CTs

  1. Go to COCA- Coronary Calcium and chest CTs.
  2. Log in or sign up for a Stanford AIMI account.
  3. Fill in your contact details.
  4. Download the data with azcopy.
  5. Fill in the source_path with the location of the cocacoronarycalciumandchestcts-2/Gated_release_final/patient folder. Fill in masks_path with cocacoronarycalciumandchestcts-2/Gated_release_final/calcium_xml xml file.
13. BrainMetShare

13. BrainMetShare

  1. Go to BrainMetShare.
  2. Log in or sign up for a Stanford AIMI account.
  3. Fill in your contact details.
  4. Download the data with azcopy.

To preprocess the dataset that is not among the above, search the preprocessing folder. It contains the reusable steps for changing imaging formats, extracting masks, creating file trees, etc. Go to the config file to check which masks and label encodings are available. Append new labels and mask encodings if needed.

Overall the dataset should have ** 882,774** images in .png format

  • CT - 500k+
  • X-Ray - 250k+
  • MRI - 100k+

🎯 Roadmap

  • dcm
  • jpg
  • nii
  • tif
  • Shared radlex ontology
  • Huggingface datasets
  • Data dashboards

:wave: Contributors

:handshake: Contact

Barbara Klaudel

TheLion.AI

Development

Pre-commits

Install pre-commits https://pre-commit.com/#installation

If you are using VS-code install the extention https://marketplace.visualstudio.com/items?itemName=MarkLarah.pre-commit-vscode

To make a dry-run of the pre-commits to see if your code passes run

pre-commit run --all-files

Adding python packages

Dependencies are handeled by poetry framework, to add new dependency run

poetry add <package_name>

Debugging

To modify and debug the app, development in containers can be useful .

Testing

run_tests.sh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

umie_datasets-0.1.4.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

umie_datasets-0.1.4-py3-none-any.whl (58.8 kB view details)

Uploaded Python 3

File details

Details for the file umie_datasets-0.1.4.tar.gz.

File metadata

  • Download URL: umie_datasets-0.1.4.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.10 Darwin/23.5.0

File hashes

Hashes for umie_datasets-0.1.4.tar.gz
Algorithm Hash digest
SHA256 59896649f008c0fa556303ddf90666dc907dcfe0063669250dd08fef680a3e63
MD5 c41b0446eebe8b8ce3a325a8e5aa77f5
BLAKE2b-256 dce6eddce12c84524f82d502bfdf3d3ce20daa24466ae72870c82743e34d8956

See more details on using hashes here.

File details

Details for the file umie_datasets-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: umie_datasets-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 58.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.10 Darwin/23.5.0

File hashes

Hashes for umie_datasets-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a71bbbdaf1f3885440b21e5dec95675d3137f707b35b86cccc081440ced772b5
MD5 2bce2884a298e1317f7ab3b3d3b50669
BLAKE2b-256 1716e346f665de8e8fe91e553e881b6f69d482bb05dd4f4b1bb99df755830b54

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page