Skip to main content

Pre-classification of Long-reads for Memory Efficient Taxonomic assignment

Project description

PLoT-ME

Pre-classification of Long-reads for Memory Efficient Taxonomic assignment
Sylvain Riondet, K. Križanović, J. Marić and M, Šikić, Niranjan Nagarajan
NUS/SoC, Biopolis/GIS, Singapore

Tool in active development, any feedback or bug report is welcome, either through github or on twitter

Description

Pre-Processing

  • Segmentation of NCBI RefSeq into clusters
  • Building of taxonomic classifiers' indexes for each cluster

Classification

Taxonomic classification of mock communities / metagenomic fastq files

  • Assignment of long DNA reads (Nanopore/PacBio) to each cluster
  • Classification by the classifier with a subset of RefSeq
  • Merging of reports

Kraken2 (Derrick E. Wood et al. 2019) and Centrifuge (D.Kim et al. 2016) are currently automated, and any classifier able to build its index on a set of .fna files with a provided taxid should work.

Take-aways

Memory Consumption - Bare Classifier (33 GB) vs PLoT-ME (3.6 GB for 20 bins)

  • High reduction in memory needs, defined by the number of clusters *
  • Compatible and enhancing existing taxonomic classifiers
  • Slight over-head of the pre-classification (currently ~3-5x in time, improvements for future releases)

* Mini Batch K-Means, Web-Scale K-Means Clustering D. Sculley 2010

Requirements

  • Database of Genomes, in .fna / .fasta format, with an associated taxonomy id. Tested with NCBI RefSeq (ftp server)
  • Taxonomic classifier, must be installed and added to PATH. Currently supported:
  • Linux (tested on Ubuntu 18.04)
  • Python >= 3.7
Package Version
biopython >= 1.72
ete3 >= 3.1.1
numpy >= 1.17.3
pandas >= 0.23
scikit-learn >= 0.18
tqdm >= 4.24.0

Installation

Create a Python 3 environment with conda or pyenv.
Installation is then done with pip:
python3 -m pip install plot-me
This will create 2 commands, plot-me.preprocess and plot-me.classify detailed in the 'Usage'.

It is also possible to clone PLoT-ME's repo, and launching commands directly with python path/to/PLoT-ME/parse_DB.py or classify.py

Usage

Pre-Processing

For the full help: plot-me.preprocess -h
Typical usage:
plot-me.preprocess <path/NCBI/refseq> <folder/for/clusters> <path/taxonomy> -k 4 -w 10000 -n 10 -o <OmitFoldersContainingString>

Pre-classification + classification

For the full help: plot-me.classify -h
Typical usage:
plot-me.classify <folder/with/clusters> <folder/reports> -i <fastq files to preclassify>

Example

/mnt/data
|-- mock_files
|   |-- mock_community_1.fastq
|   |   \-- minikm_b10_k3_s10000_oplant-vertebrate (one tmp file per cluster, generated by PLoT-ME)
|   \-- mock_community_2.fastq
|-- PLoT-ME
|   |-- k3_s10000
|   |   | -- kmer_counts
|   |   |    |-- counts.k3_s10000 (same tree as RefSeq, with <sequencing_name>.3mer_count.pd)
|   |   |    \-- all-counts.k3_s10000_oplant-vertebrate.csv
|   |   | -- minikm_b10_k3_s10000_oplant-vertebrate               <*>
|   |   |    |-- centrifuge       (10 folders with indexes)
|   |   |    |-- kraken2          (10 folders with indexes)
|   |   |    |-- RefSeq_binned    (10 folders with fna files)
|   |   |    |-- model.minikm_b10_k3_s10000_oplant-vertebrate.pkl
|   |   |    \-- segments-clustered.minikm_b10_k3_s10000_oplant-vertebrate.pd
|   |   \ -- minikm_b20_k3_s10000_oplant-vertebrate
|   |        \-- (same structure) 
|   |-- k4_s10000
|   |   ` --  (same structure)
|   \-- no-binning
|       |-- oAllRefSeq
|       \-- oplant-vertebrate
|           |-- centrifuge
|           \-- kraken2
|-- NCBI
|   \-- refseq
|-- reports
|   \-- mock_community_1 (one report per cluster)
\-- taxonomy

This <*> can be generated with:
plot-me.preprocess /mnt/data/NCBI/refseq /mnt/data/PLoT-ME /mnt/data/taxonomy -k 3 -w 10000 -n 10 -o plant vertebrate
And can be used with:
plot-me.classify /mnt/data/PLoT-ME/k3_s10000/minikm_b10_k3_s10000_oplant-vertebrate /mnt/data/reports -i /mnt/data/mock_files/mock_community_1.fastq

Technical details

Python 3 is the main programming language, with extensive use of libraries. Dependencies are resolved using PIP

Intermediate Data

Data is saved as pickle .pkl or Pandas DataFrame .pd

  • Kmer counts Pandas DataFrames are saved under .../kmer_counts/counts.<param> and have the following columns:
    taxon category start end name description fna_path AAAA ... TTTT
  • Cluster assignments segments-clustered.\<param\>.pd trade the nucleotides columns to a cluster column.
  • RefSeq_binned is the clustering made by PLoT-ME, and holds one folder per cluster, with concatenated segments of genomes (one .fna file per taxa)
  • Libraries generated by classifier, depends on each of them.

Final files

The model*.pkl and the folder kraken2 or centrifuge are needed for PLoT-ME to work. Folder tree needs to remain intact.

Work in progress

April 2021

  • Implementation of Cython version of the kmer counter
  • Adding reverse complement to forward strand

July 2020:

  • pre-process Using large k (5+) and small s (10000-) yield very large kmer counts, costing high amounts of RAM (esp. when combining all kmer counts together, RAM needs to reach ~30GB or more).
  • classify Merging of reports
  • pre-process Cleaning of pre-processing files --clean

Future work

  • classify Cleaning of pre-classification tmp files
  • classify Multi cores
  • classify/pre-process Speed up kmer counting
  • pre-process Even sized bins
  • pre-process Overlapping clusters or tricks for higher accuracy

Contact

Author: Sylvain Riondet, PhD student at the National University of Singapore, School of Computing
Email: sylvainriondet@gmail.com
Lab: Genome Institute of Singapore / National University of Singapore
Supervisors: Niranjan Nagarajan & Martin Henz

Thanks

Thanks for your support and supervision all along my PhD and this project: Martin Henz, Chenhao Li, Rafael Peres, D. Bertrand and the whole MTMS lab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

PLoT_ME-0.9.1-cp39-cp39-manylinux2014_x86_64.whl (696.0 kB view details)

Uploaded CPython 3.9

PLoT_ME-0.9.1-cp38-cp38-manylinux2014_x86_64.whl (730.8 kB view details)

Uploaded CPython 3.8

PLoT_ME-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7m

File details

Details for the file PLoT_ME-0.9.1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

  • Download URL: PLoT_ME-0.9.1-cp39-cp39-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 696.0 kB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.3

File hashes

Hashes for PLoT_ME-0.9.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4388d887cade4822e21acf4670b551f506e059eca1da0be88121f4f67a6d18be
MD5 04d48acdcacc45ed61bc6d4fc4a93212
BLAKE2b-256 208ac92934bd524eaae4e51331ee03aeeb3374c08604172a4c605204f25354cb

See more details on using hashes here.

File details

Details for the file PLoT_ME-0.9.1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: PLoT_ME-0.9.1-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 730.8 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.3

File hashes

Hashes for PLoT_ME-0.9.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8689d86e44de30e989884e9c5f6397b03ed110bfd393d0a409e1f77b2a72b134
MD5 40af871296a61254d233dc39d6b9640e
BLAKE2b-256 8a50eb420009fa99c6151525bf41e70a0e53021f548fb647f28045bc89ff47aa

See more details on using hashes here.

File details

Details for the file PLoT_ME-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: PLoT_ME-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/56.0.0 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.3

File hashes

Hashes for PLoT_ME-0.9.1-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 195dc2be07b32253a9fa86e669cb7bcede41476e4aeb180e081613e52eb96315
MD5 e6de6a9d271f82b29e018aecfea2078c
BLAKE2b-256 a31d5cfcda6e14ca49e6eb12cfae18229fc89d77568b9e2c8bbe50d2816acb43

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page