Pre-classification of Long-reads for Memory Efficient Taxonomic assignment
Project description
PLoT-ME
Pre-classification of Long-reads for Memory Efficient Taxonomic assignment
Sylvain Riondet, K. Križanović, J. Marić and M, Šikić, Niranjan Nagarajan
NUS/SoC, Biopolis/GIS, Singapore
Tool in active development, any feedback or bug report is welcome, either through github or on twitter
Description
Pre-Processing
- Segmentation of NCBI RefSeq into clusters
- Building of taxonomic classifiers' indexes for each cluster
Classification
Taxonomic classification of mock communities / metagenomic fastq files
- Assignment of long DNA reads (Nanopore/PacBio) to each cluster
- Classification by the classifier with a subset of RefSeq
- Merging of reports
Kraken2 (Derrick E. Wood et al. 2019) and Centrifuge (D.Kim et al. 2016) are currently automated, and any classifier able to build its index on a set of .fna files with a provided taxid should work.
Take-aways
- High reduction in memory needs, defined by the number of clusters *
- Compatible and enhancing existing taxonomic classifiers
- Slight over-head of the pre-classification (currently ~3-5x in time, improvements for future releases)
* Mini Batch K-Means, Web-Scale K-Means Clustering D. Sculley 2010
Requirements
- Database of Genomes, in .fna / .fasta format, with an associated taxonomy id. Tested with NCBI RefSeq (ftp server)
- Taxonomic classifier, must be installed and added to PATH. Currently supported:
- Kraken2
- Centrifuge
- (feel free to request support for more)
- Linux (tested on Ubuntu 18.04)
- Python >= 3.7
Package | Version |
---|---|
biopython | >= 1.72 |
ete3 | >= 3.1.1 |
numpy | >= 1.17.3 |
pandas | >= 0.23 |
scikit-learn | >= 0.18 |
tqdm | >= 4.24.0 |
Installation
Create a Python 3 environment with conda
or pyenv.
Installation is then done with pip:
python3 -m pip install plot-me
This will create 2 commands, plot-me.preprocess
and plot-me.classify
detailed in the 'Usage'.
It is also possible to clone PLoT-ME's repo,
and launching commands directly with python path/to/PLoT-ME/parse_DB.py or classify.py
Usage
Pre-Processing
For the full help: plot-me.preprocess -h
Typical usage:
plot-me.preprocess <path/NCBI/refseq> <folder/for/clusters> <path/taxonomy> -k 4 -w 10000 -n 10 -o <OmitFoldersContainingString>
Pre-classification + classification
For the full help: plot-me.classify -h
Typical usage:
plot-me.classify <folder/with/clusters> <folder/reports> -i <fastq files to preclassify>
Example
/mnt/data
|-- mock_files
| |-- mock_community_1.fastq
| | \-- minikm_b10_k3_s10000_oplant-vertebrate (one tmp file per cluster, generated by PLoT-ME)
| \-- mock_community_2.fastq
|-- PLoT-ME
| |-- k3_s10000
| | | -- kmer_counts
| | | |-- counts.k3_s10000 (same tree as RefSeq, with <sequencing_name>.3mer_count.pd)
| | | \-- all-counts.k3_s10000_oplant-vertebrate.csv
| | | -- minikm_b10_k3_s10000_oplant-vertebrate <*>
| | | |-- centrifuge (10 folders with indexes)
| | | |-- kraken2 (10 folders with indexes)
| | | |-- RefSeq_binned (10 folders with fna files)
| | | |-- model.minikm_b10_k3_s10000_oplant-vertebrate.pkl
| | | \-- segments-clustered.minikm_b10_k3_s10000_oplant-vertebrate.pd
| | \ -- minikm_b20_k3_s10000_oplant-vertebrate
| | \-- (same structure)
| |-- k4_s10000
| | ` -- (same structure)
| \-- no-binning
| |-- oAllRefSeq
| \-- oplant-vertebrate
| |-- centrifuge
| \-- kraken2
|-- NCBI
| \-- refseq
|-- reports
| \-- mock_community_1 (one report per cluster)
\-- taxonomy
This <*>
can be generated with:
plot-me.preprocess /mnt/data/NCBI/refseq /mnt/data/PLoT-ME /mnt/data/taxonomy -k 3 -w 10000 -n 10 -o plant vertebrate
And can be used with:
plot-me.classify /mnt/data/PLoT-ME/k3_s10000/minikm_b10_k3_s10000_oplant-vertebrate /mnt/data/reports -i /mnt/data/mock_files/mock_community_1.fastq
Technical details
Python 3 is the main programming language, with extensive use of libraries. Dependencies are resolved using PIP
Intermediate Data
Data is saved as pickle .pkl
or Pandas DataFrame .pd
- Kmer counts Pandas DataFrames are saved under
.../kmer_counts/counts.<param>
and have the following columns:
taxon category start end name description fna_path AAAA ... TTTT
- Cluster assignments
segments-clustered.\<param\>.pd
trade the nucleotides columns to acluster
column. RefSeq_binned
is the clustering made by PLoT-ME, and holds one folder per cluster, with concatenated segments of genomes (one .fna file per taxa)- Libraries generated by classifier, depends on each of them.
Final files
The model*.pkl
and the folder kraken2
or centrifuge
are needed for PLoT-ME to work. Folder tree needs to remain intact.
Work in progress
April 2021
- Implementation of Cython version of the kmer counter
- Adding reverse complement to forward strand
July 2020:
pre-process
Using large k (5+) and small s (10000-) yield very large kmer counts, costing high amounts of RAM (esp. when combining all kmer counts together, RAM needs to reach ~30GB or more).classify
Merging of reportspre-process
Cleaning of pre-processing files--clean
Future work
classify
Cleaning of pre-classification tmp filesclassify
Multi coresclassify
/pre-process
Speed up kmer countingpre-process
Even sized binspre-process
Overlapping clusters or tricks for higher accuracy
Contact
Author: Sylvain Riondet, PhD student at the National University of Singapore, School of Computing
Email: sylvainriondet@gmail.com
Lab: Genome Institute of Singapore / National University of Singapore
Supervisors: Niranjan Nagarajan & Martin Henz
Thanks
Thanks for your support and supervision all along my PhD and this project: Martin Henz, Chenhao Li, Rafael Peres, D. Bertrand and the whole MTMS lab
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file PLoT_ME-0.9.1-cp39-cp39-manylinux2014_x86_64.whl
.
File metadata
- Download URL: PLoT_ME-0.9.1-cp39-cp39-manylinux2014_x86_64.whl
- Upload date:
- Size: 696.0 kB
- Tags: CPython 3.9
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4388d887cade4822e21acf4670b551f506e059eca1da0be88121f4f67a6d18be |
|
MD5 | 04d48acdcacc45ed61bc6d4fc4a93212 |
|
BLAKE2b-256 | 208ac92934bd524eaae4e51331ee03aeeb3374c08604172a4c605204f25354cb |
File details
Details for the file PLoT_ME-0.9.1-cp38-cp38-manylinux2014_x86_64.whl
.
File metadata
- Download URL: PLoT_ME-0.9.1-cp38-cp38-manylinux2014_x86_64.whl
- Upload date:
- Size: 730.8 kB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8689d86e44de30e989884e9c5f6397b03ed110bfd393d0a409e1f77b2a72b134 |
|
MD5 | 40af871296a61254d233dc39d6b9640e |
|
BLAKE2b-256 | 8a50eb420009fa99c6151525bf41e70a0e53021f548fb647f28045bc89ff47aa |
File details
Details for the file PLoT_ME-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl
.
File metadata
- Download URL: PLoT_ME-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.7m
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 195dc2be07b32253a9fa86e669cb7bcede41476e4aeb180e081613e52eb96315 |
|
MD5 | e6de6a9d271f82b29e018aecfea2078c |
|
BLAKE2b-256 | a31d5cfcda6e14ca49e6eb12cfae18229fc89d77568b9e2c8bbe50d2816acb43 |