Skip to main content

Command Line Package that can be used to analyse locations datasets

Project description

  1. About the Project
  2. Getting Started
  3. Usage
  4. Report Issues
  5. License
  6. Contact

About The Project

The checkContaminants package provides a quick way to look for contaminants in a location dataset. It runs as an independent application with a simple command line interface so no coding is required to use it. It provides summary tables and charts that help identify the major contaminants.

There are three kinds of outputs:

  • Text output: You can choose from 3 verbosity modes for text output. The result can be output to the terminal or to a command file.
  • Charts PDF: a PDF with three bar charts detailing the top 10 most prevalent species and most contaminated locations
  • Venn Diagrams with traits (radiation resistance, thermophilic) as the sets

How does it work?

The user provides a location dataset with columns as locations on the spacecraft surface, rows as species, and cells with number of reads of each species at the location.

We aim to look for the most concerning contaminants. For each species, we look for information about their traits in our curated dataset (included in the package). We look for a presence of the following 5 traits: radiation resistance, sporulation, thermophiles, psychrophiles, and anaerobic metabolism. Depending on how many of these traits the species has we assign a score (the default is 1 point for each of these traits). Both the score weights and the curated data file can be changed (using flags -config and -datfile respectively).

Once we have information about the species, we report positives as follows:

  • Species must have score above a threshold (default is 1 but may be changed using the -t flag)
  • Species must be present on at least one location. The threshold for the minimum number of reads that qualifies as a species as present is the local threshold. (default is 2000, may be changed using the -local flag)

These species are reported as positive contaminants. You can also generate pdfs with relevant charts and venn diagrams using the package.

This was built as a part of my SURF Project over Summer 2021.

Built With

This section should list any major frameworks that you built your project using.

Environment: Visual Studio (1.49), MacOS Catalina (10.15.7)

Getting Started

To install and run the package follow these simple steps.


You only require python3 and pip to install the package.


Use the following command in your terminal to install the package.

pip install checkContaminants

or alternatively

pip3 install checkContaminants

To check if the package has installed, run the following command.

pip show checkContaminants

Test the package

pytest --pyargs checkContaminants


This package can be used both using a command line interface. It can also be imported into a python script and the methods can be used individually.

I. Through the Command Line Interface

Use the commands:



checkContaminants -h


checkContaminants --help

to view the help menu with a complete list of options.

This is the menu that you should see:

Help Menu Image

There are 3 categories of arguments:

  • basic usage
  • configuration setup
  • output preferences

Basic Usage Arguments


infile is a required argument. Follow the infile argument with the file name of thhe file that is to be analyzed.

checkContaminants -infile inputdatafile.csv

Details of the input file:

  • The column names must be location names, the rows must be species names. Each cell contains the number of reads of the particular specie at that location. This value must be a non-negative integer.
  • The input file may be a csv, tsv, or json. It could also be a compressed file (with ending .csv.gz or .tsv.gz)
  • If the location names are unavailable (columns are not named in the csv/tsv), then we name the columns with loc1, loc2, etc. for the output and the charts. If location names are unavailable, use the flag '-noheader'

Example csv:


outfile is an optional argument. If it is unspecified, the output is printed to the terminal.

There are three possible output file types (.txt, .csv, .tsv, or .json)

checkContaminants -infile inputdatafile.csv -outfile results.txt

A text output looks as follows (different verbosities are detailed later):

A csv output looks as follows:

Configuration Setup


The output may be sorted by score (S), number of positive locations (L), or alphabetically by species name (A). You can also give it a combination of two (SL means to first sort by score then by number of positive locations). The default value is SLA.

If you do not want the order of species to change, use flag -sort I which will leave the result in the input's order.


The local threshold the number of reads beyond which we consider a location to be contaminated by the species. The default value is 2000. It can be changed by including a tag as follows:

checkContaminants -infile inputdatafile.csv -local 3000

The score threshold may also be changed. By default, all species with 1 or more of the 5 concerning traits (radiation resistance, thermophilic, psychrophilic, sporulating, anaerobic) are considered contaminants.


This is where a datafile may be specified from where the program can access information about the species.

We have provided a default datafile with 1857 unique species and a value of 0 or 1 assigned to 5 columns (psychrophilic, thermophilic, anaerobe, Radiation Tolerance, Spore formation). There are also other columns (aerobe, mesophilic, etc.) By changing the configuration file using the -config flag detailed below, you can vary how much weight to give these traits in the final score. Default weight for these is 0.


The default values for configuration are as follows:

{"psychrophilic": 1,
 "mesophilic": 0,
 "thermophilic": 1,
 "Spore formation": 1,
 "aerobe": 0,
 "anaerobe": 1,
 "obligate aerobe": 0,
 "obligate anaerobe": 1,
 "facultative aerobe": 0,
 "facultative anaerobe": 1,
 "microaerophile": 1,
 "aerotolerant": 1,
 "Radiation Tolerance": 1}

Use the -config flag to specify a text file in the same format to change the weights given to each trait. For instance:

{"psychrophilic": 1.2,
 "mesophilic": 0.3,
 "thermophilic": 1.6,
 "Spore formation": 1.6,
 "anaerobe": 1.2,
 "Radiation Tolerance": 2.5}
  • If non-integer values are used for this configuration the -pdf flag cannot be used
  • Keys specified in this file must also be columns in the datfile above.

Output Preferences

Verbosity Modes:

There are 3 verbosity modes.

The first is the least verbose, it is the default (with no flags). It prints the number of species above the threshold value. It lists the contaminants (species with scores above the threshold).

The second is with -v as flag. It prints Species name (score; number of positive location). It also outputs a summary table with scores, number of species of that score, and the number of locations over which they are spread. Verbosity -v outputs also provide the number of species that were not found in the curated species in the datfile as well as the total number of locations processed.

The verbosity -vv output is mostly the same as the verbosity -v output. Except that the names of the species their scores and # of locations are tabulated. There is also a column where the location names are listed.


By including this flag, a pdf with 3 charts is generated. Also, a pdf with relevant venn diagrams are generated. These are saved to the directory in which the script is run.

The pdf looks as follows:

The venn diagrams file looks like this:

The second chart of the pdf may switch to a log scale y-axis when reasonable. For example:

Linear Scale Chart Example:

Log Scale Chart Example:

The scale of the x-axis of the third chart is controlled by the -logchart flag. If the flag is included, the chart may have a log scale x-axis if the ratio of the biggest bar to the smallest bar is greater than 100. The two diagrams look as follows:

Linear Scale Chart Example:

Log Scale Chart Example:

II By Importing as a Module

Available methods:

  • get_score
  • get_score_dict


  • bar_species_for_each_score
  • bar_locs_for_top10_species
  • survey_reads_at_top10_locs


Report Issues

See the open issues for a list of proposed features (and known issues)


Distributed under the MIT License. See LICENSE for more information.


Dr. Ashish Mahabal -

Dr. Nitin Singh -

Nishka Arora -

Dr. Moogega Cooper -

Pypi Link:

Project Link:

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

checkContaminants-0.99.4.tar.gz (7.1 kB view hashes)

Uploaded source

Built Distribution

checkContaminants-0.99.4-py3-none-any.whl (6.6 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page