Python library for cleaning data in large datasets of Xrays
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images. The images can be extracted from DICOM files or used directly. The primary authors are Candace Makeda H. Moore, Oleg Sivokon, and Andrew Murphy.
Online documentation is at https://drcandacemakedamoore.github.io/cleanX/. You can also build up-to-date documentation, which will be generated in ./build/sphinx/html directory, by command as follows:
python setup.py apidoc python setup.py build_sphinx
Special additional documentation for medical professionals with limited programming ability is available here on the project wiki. To get a high level overview of some of the functionality of the program you can look at the Jupyter notebooks inside workflow_demo folder.
- Python 3.7, 3.8, 3.9. Python 3.10 has not been tested yet.
- Ability to create virtual environments (recommended, not absolutely necessary)
- Optional recommendation of
pydicomfor DICOM/dcm to JPG conversion
- Anaconda is now supported, but not technically necessary
cleanX is a pure Python package, but it has many dependencies on native libraries. We try to test it on as many platforms as we can to see if dependencies can be installed there. Below is the list of platforms that will potentially work. Please note that where python.org Python or Anaconda Python stated as supported, it means that versions 3.7, 3.8 and 3.9 (but not 3.10) are supported.
Unsupported at the moment on both Linux and OSX, but it's likely that support will be added in the future.
32-bit Intell and ARM
We don't know if either one of these is supported. There's a good chance that 32-bit Intell will work. There's a good chance that ARM won't. It's unlikely that the support for ARM will be added in the future.
setting up a virtual environment is desirable, but not absolutely necessary
activate the environment
- use command for conda as below
conda install -c doctormakeda -c conda-forge cleanx
You need to specify both channels because there are some cleanX dependencies that exist in both Anaconda main channel and in conda-forge
- use pip as below
pip install cleanX
tesserocr package deserves a special mention. It is not
possible to install
tesseract library from PyPI server. The
tesserocr is simply a binding to the library. You will need to
install the library yourself. For example, on Debian flavor Linux,
this might work:
sudo apt-get install libleptonica-dev \ tesseract-ocr-all \ libtesseract-dev
We will imagine a very simple scenario, where we need to automate
normalization of the images we have. We stored the images in
/images/to/clean/ and they all have a
jpg extension. We
want the cleaned images to be saved in the
Normalization here means ensuring that the lowest pixel value (the darkest part of the image) is as dark as possible and that the lightest part of the image is as light as possible.
The problem above doesn't require writing any new Python code. We can
accomplish our task by calling the
cleanX command like this:
mkdir cleaned python -m cleanX images run-pipeline \ -s Acquire \ -s Normalize \ -s "Save(target='cleaned')" \ -j \ -r "/images/to/clean/*.jpg"
Let's look at the command's options and arguments:
python -m cleanXis the Python's command-line option for loading the
cleanXpackage. All command-line arguments that follow this part are interpreted by
imagessub-command is used for processing of images.
run-pipelinesub-command is used to start a
Pipelineto process the images.
-s(repeatable) option specifies
Step. Steps map to their class names as found in the
cleanX.image_work.stepsmodule. If the
__init__function of a step doesn't take any arguments, only the class name is necessary. If, however, it takes arguments, they must be given using Python's literals, using Python's named arguments syntax.
-joption instructs to create journaling pipeline. Journaling pipelines can be restarted from the point where they failed, or had been interrupted.
-rallows to specify source for the pipeline. While, normally, we will want to start with
Acquirestep, if the pipeline was interrupted, we need to tell it where to look for the initial sources.
Once the command finishes, we should see the
cleaned directory filled
with images with the same names they had in the source directory.
Let's consider another simple task: batch-extraction of images from DICOM files:
mkdir extracted python -m cleanX dicom extract \ -i dir /path/to/dicoms/ -o extracted
cleanX CLI in the way similar to the example above, however,
it calls the
dicom sub-command with
cleanXto look for directory named
cleanXto save extracted JPGs in
If you have any problems with this check #40 and add issues or discussions.
Below is the equivalent code in Python:
import os from cleanX.image_work import ( Acquire, Save, GlobSource, Normalize, create_pipeline, ) dst = 'cleaned' os.mkdir(dst) src = GlobSource('/images/to/clean/*.jpg') p = create_pipeline( steps=( Acquire(), Normalize(), Save(dst), ), journal=True, ) p.process(src)
Let's look at what's going on here. As before, we've created a
create_pipeline with three steps:
Save. There are several kinds of sources available
for pipelines. We'll use the
GlobSource to match our CLI example.
journal=True to match the
-j flag in our CLI
And for the DICOM extraction we might use similar code:
import os from cleanX.dicom_processing import DicomReader, DirectorySource dst = 'extracted' os.mkdir(dst) reader = DicomReader() reader.rip_out_jpgs(DirectorySource('/path/to/dicoms/', 'file'), dst)
This will look for the files with
dcm extension in
/path/to/dicoms/ and try to extract images found in those files,
saving them in
Please refer to Developer's Guide for more detailed explanation.
Developing Using Anaconda's Python
Use Git to check out the project's source, then, in the source directory run:
conda create -n cleanx conda activate cleanx python ./setup.py genconda python ./setup.py install_dev
Note that the last command may result in errors related to
being unable to delete Microsoft's C++ runtime DLL. This is typical
conda-build as can be seen here:
The workaround is to add:
conda config --set always_copy true
And re-run the last step (this will make virtual environment created with
conda noticeably bigger).
You may have to do this for Python 3.7, Python 3.8 and Python 3.9 if you need to check that your changes will work in all supported versions.
genconda command needs to run only once per checkout and version
of Python used. At the moment, it's not possible to have multiple
conda package configurations generated at the same time. So, if you
are switching Python versions, you will need to rerun this command.
Also note that the build will package only the changes committed to
the Git repository. This means that if you are building with
uncommitted changes, they will not make it into the built package.
The decision to do this was motivated by the presence of symbolic
links in the working directory, which makes it impossible to build
without superuser permissions on MS Windows. It is possible that in
the future we will add a flag to
setup.py install to allow "dirty"
To run unit test and linter you may use:
python setup.py lint
python setup.py test
respectively. Note that by default, these commands will try to install
cleanX and its dependencies before doing any work. This may take a
very long time, especially on MS Windows. There is a way to skip the
installation part by running:
python setup.py lint --fast
python setup.py test --fast
Developing Using python.org's Python
Use Git to check out the project's source, then in the source directory run:
python -m venv .venv . ./.venv/bin/activate python ./setup.py install_dev
conda based setup, you may have to use Python versions
3.7, 3.8 and 3.9 to create three different environments to recreate
our CI process.
Build up-to-date documentation
Documentation can be generated by command. The documentation
will be generated in a
./build/sphinx/html directory. Documentation is generated
automatically as new functions are added.
About using this library
If you use the library, please cite the package. CleanX is free ONLY when used according to license.
You can get in touch with me by starting a discussion if you have a legitimate reason to use my library without open-sourcing your code base, or following other conditions, and I can make you, specifically, a different license.
We are adding new functions and classes all the time. Many unit tests are available in the test folder. Test coverage is currently partial. Some newly added functions allow for rapid automated data augmentation (in ways that are realistic for radiological data) and some preliminary image quality checks. Some other classes and functions are for cleaning datasets including ones that:
- Get image and metadata out of dcm (DICOM) files into jpeg and csv files
- Process datasets from csv or json or other formats to generate reports
- Run on dataframes to make sure there is no image leakage
- Run on a dataframe to look for demographic or other biases in patients
- Crop off excessive black frames (run this on single images) one at a time
- Run on a list to make a prototype tiny Xray others can be compared to
- Run on image files which are inside a folder to check if they are "clean"
- Take a dataframe with image names and return plotted(visualized) images
- Run to make a dataframe of pics in a folder (assuming they all have the same 'label'/diagnosis)
- Normalize images in terms of pixel values (multiple methods)
All important functions are documented in the online documentation for programmers. You can also check out one of our videos by clicking the linked picture below:
cleanX: video demonstration
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.