Skip to main content

WinstonCleaner - transcriptomic data cross-contamination eliminator

Project description

WinstonCleaner

WinstonCleaner is a software tool for detecting and removing cross-contaminated contigs from assembled transcriptomes. The program uses BLAST to identify suspicious contigs and RPKM values to sort these as either correct or contamination.

Requirements

To run WinstonCleaner, the following requirements must be satisfied:

Installation

  1. Checkout repository

    git clone https://github.com/kolecko007/WinstonCleaner.git

    cd WinstonCleaner

  2. Install pip dependencies:

    pip2 install --user -r requirements.txt

  3. Initialize settings:

    cp config/settings.yml.default config/settings.yml

  4. Check installation by running test/integration/run.sh from WinstonCleaner folder

Quick Start

  1. Prepare the folder with input data and an empty folder for the results
  2. Open config/settings.yml and specify input and output paths
  3. bin/prepare_data.py
  4. bin/find_contaminations.py
  5. Inspect the results in the output folder

Usage

Input

The input data should be presented as a set of triads of files for each dataset. For each dataset it is necessary to prepare:

  • left reads .fastq
  • right reads .fastq
  • assembled transcriptome .fasta file

Names of the files must be in the following format:

  • NAME_1.fastq
  • NAME_2.fastq
  • NAME.fasta

For example:

  • brucei_1.fastq
  • brucei_2.fastq
  • brucei.fasta
  • giardia_1.fastq
  • giardia_2.fastq
  • giardia.fasta

For file names only letters, digits and _ symbols are allowed.

All the files must be placed together in one folder.

Configuration

All the settings are declared in config/settings.yml.

  • winston.paths.input — input folder with reads and contigs
  • winston.paths.output — output folder with the results
  • winston.paths.tools.pileup_sh — (optional) bbtools pileup.sh execution command
  • winston.paths.tools.bowtie2 — (optional) bowtie2 execution command
  • winston.paths.tools.bowtie2_build — (optional) bowtie2-build execution command
  • winston.hits_filtering.len_ratio — minimal qcovhsp for hits filtering
  • winston.hits_filtering.len_minimum — minimal hit lenth for hits filtering
  • winston.coverage_ratio.regular — coverage ratio for REGULAR dataset pair type (lower values make contamination prediction more strict, less contaminations will be found)
  • winston.coverage_ratio.close — coverage ratio for CLOSE dataset pair type
  • winston.threads.multithreading — enable multithreading (disabling is convenient for debugging purposes)
  • winston.threads.count — number of threads if multithreading enabled
  • winston.tools.blast.threads — number of threads for BLAST processing
  • winston.tools.bowtie.threads — number of threads for bowtie2 processing
  • winston.in_memory_db — load coverage database to RAM in the beginning. Makes contamination lookup faster, but requires decent amount of memory.

The default configuration can be found in file config/settings.yml.default.

winston:
  in_memory_db: false

  paths:
    input: /path/to/folder/with/data/
    output: /path/to/output/folder

  hits_filtering:
    len_ratio: 70
    len_minimum: 100

  coverage_ratio:
    REGULAR: 1.1
    CLOSE: 0.04

  threads:
    multithreading:  true
    count:   8

  tools:
    blast:
      threads: 8
    bowtie:
      threads: 8

Data preparation

The first step is to prepare the data for WinstonCleaner processing.

bin/prepare_data.py

The result will be stored in the folder, specified in winston.paths.output option.

After the preparation the file types.csv can be inspected and edited. It contains all possible combinations of dataset pairs and their types.

The default types are:

  • CLOSE - taxonomically close organisms
  • REGULAR - simple pair of organisms

In types.csv there can also be specified any amount of custom types. Their names must be in upper case.

predator,prey,95.0,LEFT_EATS_RIGHT
prey,predator,95.0,RIGHT_EATS_LEFT

In these case coverage ratio for each custom type must be specified in winston.coverage_ratio section of settings.yml file:

...
  coverage_ratio:
    REGULAR: 1.1
    CLOSE: 0.04
    LEFT_EATS_RIGHT: 10
    RIGHT_EATS_LEFT: 0.1
...

Contamination cleanup

bin/find_contaminations.py

Output

The results will be saved in the folder, specified in winston.paths.output option.

For each datasets there will be the following structure of files.

  • DATASET_NAME_clean.fasta — clean contigs
  • DATASET_NAME_deleted.fasta — contaminated contigs
  • DATASET_NAME_suspicious_hits.csv — all suspicious BLAST hits
  • DATASET_NAME_contamination_sources.csv — sources of contaminations with a following columns: source contamination dataset name, number of sequences
  • DATASET_NAME_contaminations.csv — list of blast hits from which contaminations were detected
  • DATASET_NAME_missing_coverage.csv — list of contig ids without a coverage

TODO

  • Moving to python3
  • Logging system
  • Extended testing
  • export to graph format

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winston_cleaner-0.1.0.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

winston_cleaner-0.1.0-py2-none-any.whl (26.9 kB view details)

Uploaded Python 2

File details

Details for the file winston_cleaner-0.1.0.tar.gz.

File metadata

  • Download URL: winston_cleaner-0.1.0.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/2.7.15

File hashes

Hashes for winston_cleaner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2b5a8307b10ab9244379fb5982ba7f5ba30844ea6e8728d1d19b39b05535b879
MD5 a595e8e27ab7e6088b0c7479194e3927
BLAKE2b-256 58f375ffb3796dda3a8b55aac6b9e769029177b761e06f1659385f4701d953c3

See more details on using hashes here.

File details

Details for the file winston_cleaner-0.1.0-py2-none-any.whl.

File metadata

  • Download URL: winston_cleaner-0.1.0-py2-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/2.7.15

File hashes

Hashes for winston_cleaner-0.1.0-py2-none-any.whl
Algorithm Hash digest
SHA256 1110bfe6c0b028fc7076ac359872fb03275356437638a2530cfd3419936a92fb
MD5 2f2f53e6deb8db7c9e9d3af99bad24c4
BLAKE2b-256 70f2388759daf73edd64d412c3f7b2b20258bc1e40404f3422de192cb0849b5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page