Skip to main content

A utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database

Project description

GTAXOPROP (Genbinesia Taxonomy Propagator)

Python Version License: GPL v3 Version

GTAXOPROP is a utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database. It converts NCBI accession numbers to QIIME/QIIME2-compatible taxonomy files with API fallback.

⚠️ Derivative Work Notice

GTAXOPROP is a derivative work based on entrez_qiime v2.0 by Christopher C. M. Baker. This version includes substantial modifications and enhancements while maintaining GPL v3 compliance.

Original work: Baker, C.C.M. (2016). entrez_qiime. v2.0. https://github.com/bakerccm/entrez_qiime

Major Enhancements from Original

  • ✅ Complete Python 3 migration
  • ✅ cogent3 integration (replaced PyCogent)
  • ✅ Better NCBI Entrez communication using Biopython
  • ✅ Advanced caching with resume capability
  • ✅ Batch API processing with rate limiting
  • ✅ Improved error handling and logging
  • ✅ Enhanced file encoding detection
  • ✅ Better taxonomy rank handling

Authors

  • Maulana Malik Nashrulloh (Division of Biomics Research, Department of Sciences, Generasi Biologi Indonesia Foundation)
  • Sonia Az Zahra Defi (Department of Biology, Faculty of Mathematics and Natural Sciences, Brawijaya University)
  • Brian Rahardi (Department of Bioinformatics, Faculty of Mathematics and Natural Sciences, Brawijaya University)
  • Muhammad Badrut Tamam (Division of Biomics Research, Department of Sciences, Generasi Biologi Indonesia Foundation & Biology Program, Faculty of Science, Technology, and Education, Muhammadiyah University of Lamongan)
  • Riki Ruhimat (Research Center for Applied Microbiology, Research Organization for Life Sciences, National Research and Innovation Agency)
  • Hessy Novita (Research Center for Veterinary Science, Research Organization for Health, National Research and Innovation Agency)

Quick Start

Dependencies

Make sure that your system have Python >=3.10 installed and these packages/libraries installed:

  • tinydb==4.8.2
  • pbr>=6.1.1
  • stevedore>=5.5.0
  • cogent3>=2025.9.8a2
  • biopython>=1.85

Installation

Currently we only support installation thru pip command only.

pip install gtaxoprop

Usage

To use this program, you must have NCBI taxdump and accession2taxid data

Unpacked content of nucl_gb.accession2taxid.gz and nucl_wgs.accession2taxid.gz respectively is very huge! (Spending 10 GB+ and 40 GB+ space respectively, manage your disk space accordingly!). Alternatively, you may choose only one, nucl_gb.accession2taxid.gz or nucl_wgs.accession2taxid.gz one, but this may will not cover entirety of your data.

Assumed that you have enough free space of 100-150 GB+ at your ~ (/home/username/), run this command one-by-one to set up your data:

cd ~
mkdir ~/path/to/your/NCBI/taxdump
cd ~/path/to/your/NCBI/taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz
mkdir ~/path/to/your/NCBI/accession2taxid
cd ~/path/to/your/NCBI/accession2taxid
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz
gunzip nucl_wgs.accession2taxid.gz
cp nucl_gb.accession2taxid nucl_merged.accession2taxid
tail -n+2 nucl_wgs.accession2taxid >> nucl_merged.accession2taxid
rm nucl_gb.accession2taxid nucl_wgs.accession2taxid

For propagating taxonomy of Archaea, Bacteria, and Eukaryota:

gtaxoprop -i ~/path/to/your/your_sequences.fasta \
          -o ~/path/to/your/your_taxdumps.txt \
          -g ~/path/to/your/your_execution.log \
          -n ~/path/to/your/NCBI/taxdump/ \
          -a ~/path/to/your/NCBI/accession2taxid/nucl_merged.accession2taxid \
          -r domain,kingdom,phylum,class,order,family,genus,species \
          -d \
          --email your_mail@email.xxx

For propagating taxonomy of Virus:

gtaxoprop -i ~/path/to/your/your_sequences.fasta \
          -o ~/path/to/your/your_taxdumps.txt \
          -g ~/path/to/your/your_execution.log \
          -n ~/path/to/your/NCBI/taxdump/ \
          -a ~/path/to/your/NCBI/accession2taxid/nucl_merged.accession2taxid \
          -r realm,kingdom,phylum,class,order,family,genus,species \
          -d \
          --email your_mail@email.xxx

Help

To access the help, use:

gtaxoprop -h

Acknowledgments

  • This program is based on entrez_qiime Version 2.0 by Chris Baker (https://github.com/bakerccm/entrez_qiime)
  • Part of this program was presented at 4th International Conference on Biological Sciences (ICoBioS 2025) (https://www.icobios.org/)
  • This program was made as part of research mini-project "In silico metagenomic assessment of aCPSF1 phylogenetic marker for the identification and classification of archaea using publicly available Metagenomic Whole-genome Shotgun Sequencing data" funded internally by Generasi Biologi Indonesia Foundation.

Citation

A dedicated publication for this program is not yet available. For citation purposes, please refer to the following technical report:

Nashrulloh, M.M., Defi, S.A.Z., Rahardi, B., Tamam, Mh. B., Ruhimat, R., & Novita, H. (2025). GTAXOPROP: A utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database (Technical Report No. GBR-TR-BIOMIKA-01/Genbinesia/IX/2025). Generasi Biologi Indonesia Foundation. Gresik, Indonesia.

If you wish to cite this repository, you may use the following APA-style reference entry:

Nashrulloh, M.M., Defi, S.A.Z., Rahardi, B., Tamam, Mh. B., Ruhimat, R., & Novita, H. (2025). GTAXOPROP: A utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database (Version 1.0.post3) [Computer software]. https://gitlab.com/biomikalab/GTAXOPROP

License

This project is licensed under the GNU General Public License v3.0 - See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gtaxoprop-1.0.post3.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gtaxoprop-1.0.post3-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file gtaxoprop-1.0.post3.tar.gz.

File metadata

  • Download URL: gtaxoprop-1.0.post3.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for gtaxoprop-1.0.post3.tar.gz
Algorithm Hash digest
SHA256 685f1d4007aa63a0610bf71acdb3a0ad16d8d9c175ff986a9d1175e8a2d7be12
MD5 6826f8672b746b4ecaabcc213f2ec732
BLAKE2b-256 511eeeba2bdbb3064181ac6a84e2a08ea3d48dbf434524bd76b9928051da5543

See more details on using hashes here.

File details

Details for the file gtaxoprop-1.0.post3-py3-none-any.whl.

File metadata

  • Download URL: gtaxoprop-1.0.post3-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for gtaxoprop-1.0.post3-py3-none-any.whl
Algorithm Hash digest
SHA256 25537cdbd7225b90183a69d73dae71aca44fb3cf4896e82ccfd6bd3961ccd547
MD5 a185b9a4a494a523ed3d7328707c0fda
BLAKE2b-256 5ad030db01af3c73f42cad114c4b3c789dba860b38aa5849f42f09acc1abb051

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page