Skip to main content

Accurate and rapid RiboRNA sequences Detector based on deep learning.

Project description

RiboDetector - Accurate and rapid RiboRNA sequences Detector based on deep learning

About Ribodetector

RiboDetector is a software developed to accurately yet rapidly detect and remove rRNA sequences from metagenomeic, metatranscriptomic, and ncRNA sequencing data. It was developed based on LSTMs and optimized for both GPU and CPU usage to achieve a 10 times on CPU and 50 times on a consumer GPU faster runtime compared to the current state-of-the-art software. Moreover, it is very accurate, with ~10 times fewer false classifications. Finally, it has a low level of bias towards any GO functional groups.

Prerequirements

To be able to use RiboDetector, all you need to do is to install Python v3.8 (make sure you have version 3.8 because 3.7 cannot serialize a string larger than 4GiB) with conda:

conda create -n ribodetector python=3.8
conda activate ribodetector

Note: To install torch compatible with your CUDA version, please fellow this instruction: https://pytorch.org/get-started/locally/. Our code was tested with torch v1.7 and v1.7.1.

Installation

git clone https://github.com/hzi-bifo/RiboDetector.git
cd RiboDetector
python setup.py install

Usage

GPU mode

usage: ribodetector [-h] [-c CONFIG] [-d DEVICEID] -l LEN -i [INPUT [INPUT ...]]
  -o [OUTPUT [OUTPUT ...]] [-r [RRNA [RRNA ...]]] [-e {rrna,norrna,both,none}] 
  [-t THREADS] [-m MEMORY] [--chunk_size CHUNK_SIZE] [-v]

rRNA sequence detector

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Path of config file
  -d DEVICEID, --deviceid DEVICEID
                        Indices of GPUs to enable. Quotated comma-separated device ID 
                        numbers. (default: all)
  -l LEN, --len LEN     Sequencing read length, should be not smaller than 50.
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path of input sequence files (fasta and fastq), the second 
                        file will be considered as second end if two files given.
  -o [OUTPUT [OUTPUT ...]], --output [OUTPUT [OUTPUT ...]]
                        Path of the output sequence files after rRNAs removal (same 
                        number of files as input). (Note: 2 times slower to write gz 
                        files)
  -r [RRNA [RRNA ...]], --rrna [RRNA [RRNA ...]]
                        Path of the output sequence file of detected rRNAs (same 
                        number of files as input)
  -e {rrna,norrna,both,none}, --ensure {rrna,norrna,both,none}
                        Only output certain sequences with high confidence
                        norrna: output non-rRNAs with high confidence, remove as many 
                        rRNAs as possible;
                        rrna: vice versa, output rRNAs with high confidence;
                        both: both non-rRNA and rRNA prediction with high confidence;
                        none: give label based on the mean probability of read pair.
                          (Only applicable for paired end reads, discard the read 
                           pair when their predicitons are discordant)
  -t THREADS, --threads THREADS
                        number of threads to use. (default: 10)
  -m MEMORY, --memory MEMORY
                        amount (GB) of GPU RAM. (default: 12)
  --chunk_size CHUNK_SIZE
                        Use this parameter when having low memory. Parsing the file in 
                        chunks.
                        Not needed when free RAM >=5 * your_file_size (uncompressed, 
                        sum of paired ends).
                        When chunk_size=256, memory=16 it will load 256 * 16 * 1024 
                        reads each chunk (use ~20 GB for 100bp paired end).
  -v, --version         show program's version number and exit

CPU mode

usage: ribodetector_cpu [-h] [-c CONFIG] -l LEN -i [INPUT [INPUT ...]] 
  -o [OUTPUT [OUTPUT ...]] [-r [RRNA [RRNA ...]]] [-e {rrna,norrna,both,none}] 
  [-t THREADS] [--chunk_size CHUNK_SIZE] [-v]

rRNA sequence detector

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Path of config file
  -l LEN, --len LEN     Sequencing read length, should be not smaller than 50.
  -i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Path of input sequence files (fasta and fastq), the second 
                        file will be considered as second end if two files given.
  -o [OUTPUT [OUTPUT ...]], --output [OUTPUT [OUTPUT ...]]
                        Path of the output sequence files after rRNAs removal (same 
                        number of files as input).
                        (Note: 2 times slower to write gz files)
  -r [RRNA [RRNA ...]], --rrna [RRNA [RRNA ...]]
                        Path of the output sequence file of detected rRNAs (same 
                        number of files as input)
  -e {rrna,norrna,both,none}, --ensure {rrna,norrna,both,none}
                        Only output certain sequences with high confidence
                        norrna: output non-rRNAs with high confidence, remove as many 
                        rRNAs as possible;
                        rrna: vice versa, output rRNAs with high confidence;
                        both: both non-rRNA and rRNA prediction with high confidence;
                        none: give label based on the mean probability of read pair.
                         (Only applicable for paired end reads, discard the read 
                          pair when their predicitons are discordant)
  -t THREADS, --threads THREADS
                        number of threads to use. (default: 10)
  --chunk_size CHUNK_SIZE
                        chunk_size * threads reads to process per thread.(default: 
                        1024)
                        When chunk_size=1024 and threads=20, each process will load 
                        1024 reads, in total consumming ~20G memory.
  -v, --version         show program's version number and exit

Acknowledgements

The scripts from the base dir were from the template pytorch-template by Victor Huang and other contributors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ribodetector-0.2.3.tar.gz (2.0 MB view hashes)

Uploaded Source

Built Distribution

ribodetector-0.2.3-py3-none-any.whl (2.0 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page