No project description provided
Project description
UMIE_datasets
🤩 About the Project
Warning: This project is currently in alpha stage and may be subject to major changes
This repository presents a suite of unified scripts to standardize, preprocess, and integrate 882,774 images from 20 open-source medical imaging datasets, spanning modalities such as X-ray, CT, and MR. The scripts allow for seamless and fast download of a diverse medical data set. We create a unified set of annotations allowing for merging the datasets together without mislabelling. Each dataset is preprocessed with a custom sklearn pipeline. The pipeline steps are reusable across the datasets. The code was designed so that preorocessing a new dataset is simple and requires only reusing the available pipeline steps with customization performed through setting the appropriate values of the pipeline params.
The labels and segmentation masks were unified to be compliant with RadLex ontology.
Datasets
| uid | Dataset | Modality | TASK |
|---|---|---|---|
| 0 | KITS-23 | CT | classification/segmentation |
| 1 | CoronaHack | XRAY | classification |
| 2 | Alzheimers Dataset | MRI | Classification |
| 3 | Brain Tumor Classification | MRI | classification |
| 4 | COVID-19 Detection X-Ray | XRAY | classification |
| 5 | Finding and Measuring Lungs in CT Data | CT | Segmentation |
| 6 | Brain CT Images with Intracranial Hemorrhage Masks | CT | Classification |
| 7 | Liver and Liver Tumor Segmentation | CT | Classification, Segmentation |
| 8 | Brain MRI Images for Brain Tumor Detection | MRI | Classification |
| 9 | Knee Osteoarthritis Dataset with Severity Grading | X-Ray | Classification |
| 10 | Brain Tumor Progression | MRI | segmentation |
| 11 | Chest X-ray 14 | XRAY | classification |
| 12 | COCA- Coronary Calcium and chest CTs | CT | Segmentation |
| 13 | BrainMetShare | MRI | Segmentation |
Using the datasets
Installing requirements
poetry install
Creating the dataset
Due to the copyright restrictions of the source datasets, we can't share the files directly. To obtain the full dataset you have to download the source datasets yourself and run the preprocessing scripts.
0.KITS-23
KITS-23
- Clone the KITS-23 repository.
- Enter the KITS-23 directory and install the packages with pip.
cd kits23 pip3 install -e .
- Run the following command to download the data to the
dataset/folder.kits23_download_data - Fill in the
source_pathandtarget_pathKITS-23Pipeline()inconfig/runner_config.py. e.g.KITS23Pipeline( path_args={ "source_path": "kits23/dataset", # Path to the dataset directory in KITS23 repo "target_path": TARGET_PATH, "labels_path": "kits23/dataset/kits23.json", # Path to kits23.json }, dataset_args=dataset_config.KITS23 ),
1. Xray CoronaHack -Chest X-Ray-Dataset
1. Xray CoronaHack -Chest X-Ray-Dataset
- Go to CoronaHack page on Kaggle.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - Fill in the
source_pathto the location of thearchivefolder inCoronaHackPipeline()inconfig/runner_config.py.
2. Alzheimer's Dataset
2. Alzheimer's Dataset ( 4 class of Images)
- Go to Alzheimer's Dataset page on Kaggle.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - Fill in the
source_pathto the location of thearchivefolder inAlzheimersPipeline()inconfig/runner_config.py.
3. Brain Tumor Classification (MRI
3. Brain Tumor Classification (MRI)
- Go to Brain Tumor Classification page on Kaggle.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - Fill in the
source_pathto the location of thearchivefolder inBrainTumorClassificationPipeline()inconfig/runner_config.py.
4. COVID-19 Detection X-Ray
4. COVID-19 Detection X-Ray
- Go to COVID-19 Detection X-Ray page on Kaggle.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - REMOVE TrainData folder. We do not want augmented data at this stage.
- Fill in the
source_pathto the location of thearchivefolder inCOVID19DetectionPipeline()inconfig/runner_config.py.
5. Finding and Measuring Lungs in CT Dat
5. Finding and Measuring Lungs in CT Data
- Go to Finding and Measuring Lungs in CT Data page on Kaggle.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - Fill in the
source_pathto the location of thearchive/2d_imagesfolder inFindingAndMeasuringLungsPipeline()inconfig/runner_config.py. Fill inmasks_pathwith the location of thearchive/2d_masksfolder.
6. Brain CT Images with Intracranial Hemorrhage Masks
6. Brain CT Images with Intracranial Hemorrhage Masks
- Go to Brain With Intracranial Hemorrhage page on Kaggle.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - Fill in the
source_pathto the location of thearchivefolder inBrainWithIntracranialHemorrhagePipeline()inconfig/runner_config.py. Fill inmasks_pathwith the same path as thesource_path.
7. Liver and Liver Tumor Segmentation (LITS)
7. Liver and Liver Tumor Segmentation (LITS)
- Go to Liver and Liver Tumor Segmentation.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - Fill in the
source_pathto the location of thearchivefolder inCOVID19DetectionPipeline()inconfig/runner_config.py. Fill inmasks_pathtoo.
8. Brain MRI Images for Brain Tumor Detection
8. Brain MRI Images for Brain Tumor Detection
- Go to Brain MRI Images for Brain Tumor Detection page on Kaggle.
- Login to your Kaggle account.
- Download the data.
- Extract
archive.zip. - Fill in the
source_pathto the location of thearchivefolder inBrainTumorDetectionPipeline()inconfig/runner_config.py.
9. Knee Osteoarthrithis Dataset with Severity Grading
9. Knee Osteoarthrithis Dataset with Severity Grading
1. Go to Knee Osteoarthritis Dataset with Severity Grading.
2. Login to your Kaggle account.
3. Download the data.
4. Extract archive.zip.
5. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py.
10. Brain-Tumor-Progression
10. Brain-Tumor-Progression
- Go to Brain Tumor Progression dataset from the cancer imaging archive.
11. Chest X-ray 14
11. Chest X-ray 14
- Go to Chest X-ray 14.
- Create an account.
- Download the
imagesfolder andDataEntry2017_v2020.csv.
12. COCA- Coronary Calcium and chest CTs
12. COCA- Coronary Calcium and chest CTs
- Go to COCA- Coronary Calcium and chest CTs.
- Log in or sign up for a Stanford AIMI account.
- Fill in your contact details.
- Download the data with azcopy.
- Fill in the
source_pathwith the location of thecocacoronarycalciumandchestcts-2/Gated_release_final/patientfolder. Fill inmasks_pathwithcocacoronarycalciumandchestcts-2/Gated_release_final/calcium_xmlxml file.
13. BrainMetShare
13. BrainMetShare
- Go to BrainMetShare.
- Log in or sign up for a Stanford AIMI account.
- Fill in your contact details.
- Download the data with azcopy.
To preprocess the dataset that is not among the above, search the preprocessing folder. It contains the reusable steps for changing imaging formats, extracting masks, creating file trees, etc. Go to the config file to check which masks and label encodings are available. Append new labels and mask encodings if needed.
Overall the dataset should have ** 882,774** images in .png format
- CT - 500k+
- X-Ray - 250k+
- MRI - 100k+
🎯 Roadmap
- dcm
- jpg
- nii
- tif
- Shared radlex ontology
- Huggingface datasets
- Data dashboards
:wave: Contributors
:handshake: Contact
Development
Pre-commits
Install pre-commits https://pre-commit.com/#installation
If you are using VS-code install the extention https://marketplace.visualstudio.com/items?itemName=MarkLarah.pre-commit-vscode
To make a dry-run of the pre-commits to see if your code passes run
pre-commit run --all-files
Adding python packages
Dependencies are handeled by poetry framework, to add new dependency run
poetry add <package_name>
Debugging
To modify and debug the app, development in containers can be useful .
Testing
run_tests.sh
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file umie_datasets-0.1.4.tar.gz.
File metadata
- Download URL: umie_datasets-0.1.4.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.10 Darwin/23.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59896649f008c0fa556303ddf90666dc907dcfe0063669250dd08fef680a3e63
|
|
| MD5 |
c41b0446eebe8b8ce3a325a8e5aa77f5
|
|
| BLAKE2b-256 |
dce6eddce12c84524f82d502bfdf3d3ce20daa24466ae72870c82743e34d8956
|
File details
Details for the file umie_datasets-0.1.4-py3-none-any.whl.
File metadata
- Download URL: umie_datasets-0.1.4-py3-none-any.whl
- Upload date:
- Size: 58.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.10 Darwin/23.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a71bbbdaf1f3885440b21e5dec95675d3137f707b35b86cccc081440ced772b5
|
|
| MD5 |
2bce2884a298e1317f7ab3b3d3b50669
|
|
| BLAKE2b-256 |
1716e346f665de8e8fe91e553e881b6f69d482bb05dd4f4b1bb99df755830b54
|