A utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database
Project description
GTAXOPROP (Genbinesia Taxonomy Propagator)
GTAXOPROP is a utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database. It converts NCBI accession numbers to QIIME/QIIME2-compatible taxonomy files with API fallback.
⚠️ Derivative Work Notice
GTAXOPROP is a derivative work based on entrez_qiime v2.0 by Christopher C. M. Baker.
This version includes substantial modifications and enhancements while maintaining GPL v3 compliance.
Original work: Baker, C.C.M. (2016). entrez_qiime. v2.0. https://github.com/bakerccm/entrez_qiime
Major Enhancements from Original
- ✅ Complete Python 3 migration
- ✅ cogent3 integration (replaced PyCogent)
- ✅ Better NCBI Entrez communication using Biopython
- ✅ Advanced caching with resume capability
- ✅ Batch API processing with rate limiting
- ✅ Improved error handling and logging
- ✅ Enhanced file encoding detection
- ✅ Better taxonomy rank handling
Authors
- Maulana Malik Nashrulloh (Division of Biomics Research, Department of Sciences, Generasi Biologi Indonesia Foundation)
- Sonia Az Zahra Defi (Department of Biology, Faculty of Mathematics and Natural Sciences, Brawijaya University)
- Brian Rahardi (Department of Bioinformatics, Faculty of Mathematics and Natural Sciences, Brawijaya University)
- Muhammad Badrut Tamam (Division of Biomics Research, Department of Sciences, Generasi Biologi Indonesia Foundation & Biology Program, Faculty of Science, Technology, and Education, Muhammadiyah University of Lamongan)
- Riki Ruhimat (Research Center for Applied Microbiology, Research Organization for Life Sciences, National Research and Innovation Agency)
- Hessy Novita (Research Center for Veterinary Science, Research Organization for Health, National Research and Innovation Agency)
Quick Start
Dependencies
Make sure that your system have Python >=3.10 installed and these packages/libraries installed:
- tinydb==4.8.2
- pbr>=6.1.1
- stevedore>=5.5.0
- cogent3>=2025.9.8a2
- biopython>=1.85
Installation
Currently we only support installation thru pip command only.
pip install gtaxoprop
Usage
To use this program, you must have NCBI taxdump and accession2taxid data
- taxdump.tar.gz (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz). This tarball contains files that constitute the full NCBI taxonomy database, primarily used for local installations and bioinformatics tools that require taxonomic information
- nucl_gb.accession2taxid.gz (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz). This file store TaxID mapping for live nucleotide sequence records of type WGS or TSA.
- nucl_wgs.accession2taxid.gz (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz). This file store TaxID mapping for live nucleotide sequence records that are not WGS or TSA.
Unpacked content of nucl_gb.accession2taxid.gz and nucl_wgs.accession2taxid.gz respectively is very huge! (Spending 10 GB+ and 40 GB+ space respectively, manage your disk space accordingly!). Alternatively, you may choose only one, nucl_gb.accession2taxid.gz or nucl_wgs.accession2taxid.gz one, but this may will not cover entirety of your data.
Assumed that you have enough free space of 100-150 GB+ at your ~ (/home/username/), run this command one-by-one to set up your data:
cd ~
mkdir ~/path/to/your/NCBI/taxdump
cd ~/path/to/your/NCBI/taxdump
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz
mkdir ~/path/to/your/NCBI/accession2taxid
cd ~/path/to/your/NCBI/accession2taxid
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
gunzip nucl_gb.accession2taxid.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz
gunzip nucl_wgs.accession2taxid.gz
cp nucl_gb.accession2taxid nucl_merged.accession2taxid
tail -n+2 nucl_wgs.accession2taxid >> nucl_merged.accession2taxid
rm nucl_gb.accession2taxid nucl_wgs.accession2taxid
For propagating taxonomy of Archaea, Bacteria, and Eukaryota:
gtaxoprop -i ~/path/to/your/your_sequences.fasta \
-o ~/path/to/your/your_taxdumps.txt \
-g ~/path/to/your/your_execution.log \
-n ~/path/to/your/NCBI/taxdump/ \
-a ~/path/to/your/NCBI/accession2taxid/nucl_merged.accession2taxid \
-r domain,kingdom,phylum,class,order,family,genus,species \
-d \
--email your_mail@email.xxx
For propagating taxonomy of Virus:
gtaxoprop -i ~/path/to/your/your_sequences.fasta \
-o ~/path/to/your/your_taxdumps.txt \
-g ~/path/to/your/your_execution.log \
-n ~/path/to/your/NCBI/taxdump/ \
-a ~/path/to/your/NCBI/accession2taxid/nucl_merged.accession2taxid \
-r realm,kingdom,phylum,class,order,family,genus,species \
-d \
--email your_mail@email.xxx
Help
To access the help, use:
gtaxoprop -h
Acknowledgments
- This program is based on entrez_qiime Version 2.0 by Chris Baker (https://github.com/bakerccm/entrez_qiime)
- Part of this program was presented at 4th International Conference on Biological Sciences (ICoBioS 2025) (https://www.icobios.org/)
- This program was made as part of research mini-project "In silico metagenomic assessment of aCPSF1 phylogenetic marker for the identification and classification of archaea using publicly available Metagenomic Whole-genome Shotgun Sequencing data" funded internally by Generasi Biologi Indonesia Foundation.
Citation
A dedicated publication for this program is not yet available. For citation purposes, please refer to the following technical report:
Nashrulloh, M.M., Defi, S.A.Z., Rahardi, B., Tamam, Mh. B., Ruhimat, R., & Novita, H. (2025). GTAXOPROP: A utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database (Technical Report No. GBR-TR-BIOMIKA-01/Genbinesia/IX/2025). Generasi Biologi Indonesia Foundation. Gresik, Indonesia.
If you wish to cite this repository, you may use the following APA-style reference entry:
Nashrulloh, M.M., Defi, S.A.Z., Rahardi, B., Tamam, Mh. B., Ruhimat, R., & Novita, H. (2025). GTAXOPROP: A utility to generate input files for taxonomy propagation and assignment in QIIME/QIIME2 from the NCBI database (Version 1.0.post3) [Computer software]. https://gitlab.com/biomikalab/GTAXOPROP
License
This project is licensed under the GNU General Public License v3.0 - See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gtaxoprop-1.0.post3.tar.gz.
File metadata
- Download URL: gtaxoprop-1.0.post3.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
685f1d4007aa63a0610bf71acdb3a0ad16d8d9c175ff986a9d1175e8a2d7be12
|
|
| MD5 |
6826f8672b746b4ecaabcc213f2ec732
|
|
| BLAKE2b-256 |
511eeeba2bdbb3064181ac6a84e2a08ea3d48dbf434524bd76b9928051da5543
|
File details
Details for the file gtaxoprop-1.0.post3-py3-none-any.whl.
File metadata
- Download URL: gtaxoprop-1.0.post3-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25537cdbd7225b90183a69d73dae71aca44fb3cf4896e82ccfd6bd3961ccd547
|
|
| MD5 |
a185b9a4a494a523ed3d7328707c0fda
|
|
| BLAKE2b-256 |
5ad030db01af3c73f42cad114c4b3c789dba860b38aa5849f42f09acc1abb051
|