Skip to main content

Python library for cleaning data in large datasets of Xrays

Project description


Zenodo DOI License GPL-3 Anaconda-Server Badge JOSS Publication Anaconda-Server Badge PYPI Version Anaconda-Server Badge Sanity Sanity Documentation GitHub issues GitHub Discussions

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images. The images can be extracted from DICOM files or used directly. The primary authors are Candace Makeda H. Moore, Oleg Sivokon, and Andrew Murphy.


Online documentation is at You can also build up-to-date documentation, which will be generated in ./build/sphinx/html directory, by command as follows:

python apidoc
python build_sphinx

Special additional documentation for medical professionals with limited programming ability is available here on the project wiki. To get a high level overview of some of the functionality of the program you can look at the Jupyter notebooks inside workflow_demo folder.


  • Python 3.7, 3.8, 3.9. Python 3.10 has not been tested yet.
  • Ability to create virtual environments (recommended, not absolutely necessary)
  • tesserocr, matplotlib, pandas, and opencv
  • Optional recommendation of SimpleITK or pydicom for DICOM/dcm to JPG conversion
  • Anaconda is now supported, but not technically necessary

Supported Platforms

cleanX is a pure Python package, but it has many dependencies on native libraries. We try to test it on as many platforms as we can to see if dependencies can be installed there. Below is the list of platforms that will potentially work. Please note that where Python or Anaconda Python stated as supported, it means that versions 3.7, 3.8 and 3.9 (but not 3.10) are supported.

AMD64 (x86)

Linux Win OSX
p Supported Unknown Unknown
a Supported Supported Supported


Unsupported at the moment on both Linux and OSX, but it's likely that support will be added in the future.

32-bit Intell and ARM

We don't know if either one of these is supported. There's a good chance that 32-bit Intell will work. There's a good chance that ARM won't. It's unlikely that the support for ARM will be added in the future.


  • setting up a virtual environment is desirable, but not absolutely necessary

  • activate the environment

Anaconda Installation

  • use command for conda as below
conda install -c doctormakeda -c conda-forge cleanx

You need to specify both channels because there are some cleanX dependencies that exist in both Anaconda main channel and in conda-forge

pip installation

  • use pip as below
pip install cleanX

The tesserocr package deserves a special mention. It is not possible to install tesseract library from PyPI server. The tesserocr is simply a binding to the library. You will need to install the library yourself. For example, on Debian flavor Linux, this might work:

sudo apt-get install libleptonica-dev \
    tesseract-ocr-all \

Getting Started

We will imagine a very simple scenario, where we need to automate normalization of the images we have. We stored the images in directory /images/to/clean/ and they all have a jpg extension. We want the cleaned images to be saved in the cleaned directory.

Normalization here means ensuring that the lowest pixel value (the darkest part of the image) is as dark as possible and that the lightest part of the image is as light as possible.

CLI Example

The problem above doesn't require writing any new Python code. We can accomplish our task by calling the cleanX command like this:

mkdir cleaned

python -m cleanX images run-pipeline \
    -s Acquire \
    -s Normalize \
    -s "Save(target='cleaned')" \
    -j \
    -r "/images/to/clean/*.jpg"

Let's look at the command's options and arguments:

  • python -m cleanX is the Python's command-line option for loading the cleanX package. All command-line arguments that follow this part are interpreted by cleanX.
  • images sub-command is used for processing of images.
  • run-pipeline sub-command is used to start a Pipeline to process the images.
  • -s (repeatable) option specifies Pipeline Step. Steps map to their class names as found in the cleanX.image_work.steps module. If the __init__ function of a step doesn't take any arguments, only the class name is necessary. If, however, it takes arguments, they must be given using Python's literals, using Python's named arguments syntax.
  • -j option instructs to create journaling pipeline. Journaling pipelines can be restarted from the point where they failed, or had been interrupted.
  • -r allows to specify source for the pipeline. While, normally, we will want to start with Acquire step, if the pipeline was interrupted, we need to tell it where to look for the initial sources.

Once the command finishes, we should see the cleaned directory filled with images with the same names they had in the source directory.

Let's consider another simple task: batch-extraction of images from DICOM files:

mkdir extracted

python -m cleanX dicom extract \
    -i dir /path/to/dicoms/
    -o extracted

This calls cleanX CLI in the way similar to the example above, however, it calls the dicom sub-command with extract-images subcommand.

  • -i tells cleanX to look for directory named /path/to/dicoms
  • -o tells cleanX to save extracted JPGs in extracted directory.

If you have any problems with this check #40 and add issues or discussions.

Coding Example

Below is the equivalent code in Python:

import os

from cleanX.image_work import (

dst = 'cleaned'

src = GlobSource('/images/to/clean/*.jpg')
p = create_pipeline(


Let's look at what's going on here. As before, we've created a pipeline using create_pipeline with three steps: Acquire, Normalize and Save. There are several kinds of sources available for pipelines. We'll use the GlobSource to match our CLI example. We'll specify journal=True to match the -j flag in our CLI example.

And for the DICOM extraction we might use similar code:

import os

from cleanX.dicom_processing import DicomReader, DirectorySource

dst = 'extracted'

reader = DicomReader()
reader.rip_out_jpgs(DirectorySource('/path/to/dicoms/', 'file'), dst)

This will look for the files with dcm extension in /path/to/dicoms/ and try to extract images found in those files, saving them in extracted directory.

Developer's Guide

Please refer to Developer's Guide for more detailed explanation.

Developing Using Anaconda's Python

Use Git to check out the project's source, then, in the source directory run:

conda create -n cleanx
conda activate cleanx
python ./ genconda
python ./ install_dev

Note that the last command may result in errors related to conda-build being unable to delete Microsoft's C++ runtime DLL. This is typical behavior of conda-build as can be seen here:

The workaround is to add:

conda config --set always_copy true

And re-run the last step (this will make virtual environment created with conda noticeably bigger).

You may have to do this for Python 3.7, Python 3.8 and Python 3.9 if you need to check that your changes will work in all supported versions.

The genconda command needs to run only once per checkout and version of Python used. At the moment, it's not possible to have multiple conda package configurations generated at the same time. So, if you are switching Python versions, you will need to rerun this command.

Also note that the build will package only the changes committed to the Git repository. This means that if you are building with uncommitted changes, they will not make it into the built package. The decision to do this was motivated by the presence of symbolic links in the working directory, which makes it impossible to build without superuser permissions on MS Windows. It is possible that in the future we will add a flag to install to allow "dirty" builds.

To run unit test and linter you may use:

python lint


python test

respectively. Note that by default, these commands will try to install cleanX and its dependencies before doing any work. This may take a very long time, especially on MS Windows. There is a way to skip the installation part by running:

python lint --fast


python test --fast

Developing Using's Python

Use Git to check out the project's source, then in the source directory run:

python -m venv .venv
. ./.venv/bin/activate
python ./ install_dev

Similar to conda based setup, you may have to use Python versions 3.7, 3.8 and 3.9 to create three different environments to recreate our CI process.

Build up-to-date documentation

Documentation can be generated by command. The documentation will be generated in a ./build/sphinx/html directory. Documentation is generated automatically as new functions are added.

About using this library

If you use the library, please cite the package. CleanX is free ONLY when used according to license.

You can get in touch with me by starting a discussion if you have a legitimate reason to use my library without open-sourcing your code base, or following other conditions, and I can make you, specifically, a different license.

We are adding new functions and classes all the time. Many unit tests are available in the test folder. Test coverage is currently partial. Some newly added functions allow for rapid automated data augmentation (in ways that are realistic for radiological data) and some preliminary image quality checks. Some other classes and functions are for cleaning datasets including ones that:

  • Get image and metadata out of dcm (DICOM) files into jpeg and csv files
  • Process datasets from csv or json or other formats to generate reports
  • Run on dataframes to make sure there is no image leakage
  • Run on a dataframe to look for demographic or other biases in patients
  • Crop off excessive black frames (run this on single images) one at a time
  • Run on a list to make a prototype tiny Xray others can be compared to
  • Run on image files which are inside a folder to check if they are "clean"
  • Take a dataframe with image names and return plotted(visualized) images
  • Run to make a dataframe of pics in a folder (assuming they all have the same 'label'/diagnosis)
  • Normalize images in terms of pixel values (multiple methods)

All important functions are documented in the online documentation for programmers. You can also check out one of our videos by clicking the linked picture below:

cleanX: video demonstration


Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

cleanX-0.1.13-py3.8.egg (139.8 kB view hashes)

Uploaded 0 1 13

cleanX-0.1.13-py3-none-any.whl (74.8 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page