Skip to main content

Pre-classification of Long-reads for Memory Efficient Taxonomic assignment

Project description


Pre-classification of Long-reads for Memory Efficient Taxonomic assignment Sylvain Riondet, Niranjan Ngarajan, NUS/SoC, GIS/Biopolis, Singapore



  • Segmentation of NCBI RefSeq into clusters
  • Building of taxonomic classifiers' indexes for each cluster


Taxonomic classification of mock communities / metagenomic fastq files

  • Assignment of long DNA reads (Nanopore/PacBio) to each cluster
  • Classification by the classifier with a subset of RefSeq
  • Merging of reports

Kraken2 (Derrick E. Wood et al. 2019) and Centrifuge (D.Kim et al. 2016) are currently automated, and any classifier able to build its index on a set of .fna files with a provided taxid should work.


Memory Consumption - Bare Classifier (33 GB) vs PLoT-ME (3.6 GB for 20 bins)

  • High reduction in memory needs, defined by the number of clusters *
  • Compatible and enhancing existing taxonomic classifiers
  • Slight over-head of the pre-classification (currently ~3-5x in time, improvements for future releases)

* Mini Batch K-Means, Web-Scale K-Means Clustering D. Sculley 2010


  • Linux (tested on Ubuntu 18.04)

  • Taxonomic classifier

    • Kraken2 or Centrifuge (feel free to request support for more)
  • Python >= 3.7

    Package Version
    biopython >= 1.72
    ete3 >= 3.1.1
    numpy >= 1.17.3
    pandas >= 0.23
    scikit-learn >= 0.18
    tqdm >= 4.24.0


Create a Python 3 environment with conda or pyenv. Installation is then done with pip: python3 -m pip install plot-me This will create 2 commands, plot-me.preprocess and plot-me.classify detailed in the 'Usage'. It is also possible to clone PLoT-ME's repo, and launching commands directly with python path/to/PLoT-ME/ or



For the full help: plot-me.preprocess -h Typical usage: plot-me.preprocess <path/NCBI/refseq> <folder/for/clusters> <path/taxonomy> -k 4 -w 10000 -n 10 -o <OmitFoldersContainingString>

Pre-classification + classification

For the full help: plot-me.classify -h Typical usage: plot-me.classify <folder/with/clusters> <folder/reports> -i <fastq files to preclassify>


|-- mock_files
|   |-- mock_community_1.fastq
|   |   \-- minikm_b10_k3_s10000_oplant-vertebrate (one tmp file per cluster, generated by PLoT-ME)
|   \-- mock_community_2.fastq
|-- PLoT-ME
|   |-- k3_s10000
|   |   | -- kmer_counts
|   |   |    |-- counts.k3_s10000 (same tree as RefSeq, with <sequencing_name>.3mer_count.pd)
|   |   |    \-- all-counts.k3_s10000_oplant-vertebrate.csv
|   |   | -- minikm_b10_k3_s10000_oplant-vertebrate               <*>
|   |   |    |-- centrifuge       (10 folders with indexes)
|   |   |    |-- kraken2          (10 folders with indexes)
|   |   |    |-- RefSeq_binned    (10 folders with fna files)
|   |   |    |-- model.minikm_b10_k3_s10000_oplant-vertebrate.pkl
|   |   |    \-- segments-clustered.minikm_b10_k3_s10000_oplant-vertebrate.pd
|   |   \ -- minikm_b20_k3_s10000_oplant-vertebrate
|   |        \-- (same structure) 
|   |-- k4_s10000
|   |   ` --  (same structure)
|   \-- no-binning
|       |-- oAllRefSeq
|       \-- oplant-vertebrate
|           |-- centrifuge
|           \-- kraken2
|-- NCBI
|   \-- refseq
|-- reports
|   \-- mock_community_1 (one report per cluster)
\-- taxonomy

This <*> can be generated with: plot-me.preprocess /mnt/data/NCBI/refseq /mnt/data/PLoT-ME /mnt/data/taxonomy -k 3 -w 10000 -n 10 -o plant vertebrate And can be used with: plot-me.classify /mnt/data/PLoT-ME/k3_s10000/minikm_b10_k3_s10000_oplant-vertebrate /mnt/data/reports -i /mnt/data/mock_files/mock_community_1.fastq

Technical details

Python 3 is the main programming language, with extensive use of libraries. Dependencies are resolved using PIP

Intermediate Data

Data is saved as pickle .pkl or Pandas DataFrame .pd

  • Kmer counts Pandas DataFrames are saved under .../kmer_counts/counts.<param> and have the following columns: taxon category start end name description fna_path AAAA ... TTTT
  • Cluster assignments segments-clustered.\<param\>.pd trade the nucleotides columns to a cluster column.
  • RefSeq_binned is the clustering made by PLoT-ME, and holds one folder per cluster, with concatenated segments of genomes (one .fna file per taxa)
  • Libraries generated by classifier, depends on each of them.

Final files

The model*.pkl and the folder kraken2 or centrifuge are needed for PLoT-ME to work. Folder tree needs to remain intact.

Work in progress

As of July 2020:

  • pre-process Using large k (5+) and small s (10000-) yield very large kmer counts, costing high amounts of RAM (esp. when combining all kmer counts together, RAM needs to reach ~30GB or more).
  • classify Merging of reports
  • pre-process Cleaning of pre-processing files --clean

Future work

  • classify Cleaning of pre-classification tmp files
  • classify Multi cores
  • classify/pre-process Speed up kmer counting
  • pre-process Even sized bins
  • pre-process Overlapping clusters or tricks for higher accuracy


Author: Sylvain Riondet, PhD student at the National University of Singapore, School of Computing Email: Lab: Genome Institute of Singapore / National University of Singapore Supervisors: Niranjan Nagarajan & Martin Henz

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for PLoT-ME, version 0.8.3
Filename, size File type Python version Upload date Hashes
Filename, size PLoT_ME-0.8.3-py3-none-any.whl (39.5 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size PLoT-ME-0.8.3.tar.gz (48.9 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page