A tool that converts NCBI taxonomy dump into lineages
Project description
NCBItax2lin
Convert NCBI taxonomy dump into lineages. An example for human (tax_id=9606) is like
| tax_id | superkingdom | phylum | class | order | family | genus | species | family1 | forma | genus1 | infraclass | infraorder | kingdom | no rank | no rank1 | no rank10 | no rank11 | no rank12 | no rank13 | no rank14 | no rank15 | no rank16 | no rank17 | no rank18 | no rank19 | no rank2 | no rank20 | no rank21 | no rank22 | no rank3 | no rank4 | no rank5 | no rank6 | no rank7 | no rank8 | no rank9 | parvorder | species group | species subgroup | species1 | subclass | subfamily | subgenus | subkingdom | suborder | subphylum | subspecies | subtribe | superclass | superfamily | superorder | superorder1 | superphylum | tribe | varietas |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9606 | Eukaryota | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens | Simiiformes | Metazoa | cellular organisms | Opisthokonta | Dipnotetrapodomorpha | Tetrapoda | Amniota | Theria | Eutheria | Boreoeutheria | Eumetazoa | Bilateria | Deuterostomia | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Catarrhini | Homininae | Haplorrhini | Craniata | Hominoidea | Euarchontoglires |
Install
ncbitax2lin supports python-3.9 to python-3.13.
pip install -U ncbitax2lin
Generate lineages
First download taxonomy dump from NCBI:
wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump
Then, run ncbitax2lin
ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp
By default, the generated lineages will be saved to
ncbi_lineages_[date_of_utcnow].csv.gz. The output file can be overwritten with
--output option.
FAQ
Q: I have a large number of sequences with their corresponding accession numbers from NCBI, how to get their lineages?
A: First, you need to map accession numbers (GI is deprecated) to tax IDs
based on nucl_*accession2taxid.gz files from
ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Secondly, you can trace a
sequence's whole lineage based on its tax ID. The tax-id-to-lineage mapping is
what NCBItax2lin can generate for you.
If you have any question about this project, please feel free to create a new issue.
Note on taxdump.tar.gz.md5
It appears that NCBI periodically regenerates taxdump.tar.gz and
taxdump.tar.gz.md5 even when its content is still the same. I am not sure how
their regeneration works, but taxdump.tar.gz.md5 will differ simply because
of a different timestamp.
Used in
- Mahmoudabadi, G., & Phillips, R. (2018). A comprehensive and quantitative exploration of thousands of viral genomes. ELife, 7. https://doi.org/10.7554/eLife.31955
- Dombrowski, N. et al. (2020) Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution, Nature Communications. Springer US, 11(1). doi: 10.1038/s41467-020-17408-w. https://www.nature.com/articles/s41467-020-17408-w
- Schenberger Santos, A. R. et al. (2020) NAD+ biosynthesis in bacteria is controlled by global carbon/ nitrogen levels via PII signaling, Journal of Biological Chemistry, 295(18), pp. 6165–6176. doi: 10.1074/jbc.RA120.012793. https://www.sciencedirect.com/science/article/pii/S0021925817482433
- Villada, J. C., Duran, M. F. and Lee, P. K. H. (2020) Interplay between Position-Dependent Codon Usage Bias and Hydrogen Bonding at the 5' End of ORFeomes, mSystems, 5(4), pp. 1–18. doi: 10.1128/msystems.00613-20. https://msystems.asm.org/content/5/4/e00613-20
- Byadgi, O. et al. (2020) Transcriptome analysis of amyloodinium ocellatum tomonts revealed basic information on the major potential virulence factors, Genes, 11(11), pp. 1–12. doi: 10.3390/genes11111252. https://www.mdpi.com/2073-4425/11/11/1252
Development
Install dependencies
poetry install --sync
Testing
make format
make all
Publish (only for administrator)
poetry version [minor/major etc.]
poetry publish --build -u __token__ --password pypi-<token-from-pypi>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ncbitax2lin-3.0.0.tar.gz.
File metadata
- Download URL: ncbitax2lin-3.0.0.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/21.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa57570acc72a4846c76e36002a7222c2b9d10fff9685f36ddf022c8f02d4944
|
|
| MD5 |
98cd018e0a6a11ede6fed8ae47d73361
|
|
| BLAKE2b-256 |
89b5ade4f767f55a51f906cf16007ab0c847994fef5c5796e02d105a2ab09646
|
File details
Details for the file ncbitax2lin-3.0.0-py3-none-any.whl.
File metadata
- Download URL: ncbitax2lin-3.0.0-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/21.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86b8fe36ea730ff6bc6aeddd9154f6cd833d2c6e0da67c8cb2b49f59e9c890e4
|
|
| MD5 |
1b13aaaef236a3d85c3b83869d52a2e2
|
|
| BLAKE2b-256 |
399546bbbace76f6cce35255d405c7ad1b4c300c16fa3402d596e7e0e5ef98e8
|