Skip to main content

A tool that converts NCBI taxonomy dump into lineages

Project description

NCBItax2lin

Downloads

Convert NCBI taxonomy dump into lineages. An example for human (tax_id=9606) is like

tax_id superkingdom phylum class order family genus species family1 forma genus1 infraclass infraorder kingdom no rank no rank1 no rank10 no rank11 no rank12 no rank13 no rank14 no rank15 no rank16 no rank17 no rank18 no rank19 no rank2 no rank20 no rank21 no rank22 no rank3 no rank4 no rank5 no rank6 no rank7 no rank8 no rank9 parvorder species group species subgroup species1 subclass subfamily subgenus subkingdom suborder subphylum subspecies subtribe superclass superfamily superorder superorder1 superphylum tribe varietas
9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens Simiiformes Metazoa cellular organisms Opisthokonta Dipnotetrapodomorpha Tetrapoda Amniota Theria Eutheria Boreoeutheria Eumetazoa Bilateria Deuterostomia Vertebrata Gnathostomata Teleostomi Euteleostomi Sarcopterygii Catarrhini Homininae Haplorrhini Craniata Hominoidea Euarchontoglires

Install

ncbitax2lin supports python-3.9 to python-3.13.

pip install -U ncbitax2lin

Generate lineages

First download taxonomy dump from NCBI:

wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump

Then, run ncbitax2lin

ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp

By default, the generated lineages will be saved to ncbi_lineages_[date_of_utcnow].csv.gz. The output file can be overwritten with --output option.

FAQ

Q: I have a large number of sequences with their corresponding accession numbers from NCBI, how to get their lineages?

A: First, you need to map accession numbers (GI is deprecated) to tax IDs based on nucl_*accession2taxid.gz files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Secondly, you can trace a sequence's whole lineage based on its tax ID. The tax-id-to-lineage mapping is what NCBItax2lin can generate for you.

If you have any question about this project, please feel free to create a new issue.

Note on taxdump.tar.gz.md5

It appears that NCBI periodically regenerates taxdump.tar.gz and taxdump.tar.gz.md5 even when its content is still the same. I am not sure how their regeneration works, but taxdump.tar.gz.md5 will differ simply because of a different timestamp.

Used in

  • Mahmoudabadi, G., & Phillips, R. (2018). A comprehensive and quantitative exploration of thousands of viral genomes. ELife, 7. https://doi.org/10.7554/eLife.31955
  • Dombrowski, N. et al. (2020) Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution, Nature Communications. Springer US, 11(1). doi: 10.1038/s41467-020-17408-w. https://www.nature.com/articles/s41467-020-17408-w
  • Schenberger Santos, A. R. et al. (2020) NAD+ biosynthesis in bacteria is controlled by global carbon/ nitrogen levels via PII signaling, Journal of Biological Chemistry, 295(18), pp. 6165–6176. doi: 10.1074/jbc.RA120.012793. https://www.sciencedirect.com/science/article/pii/S0021925817482433
  • Villada, J. C., Duran, M. F. and Lee, P. K. H. (2020) Interplay between Position-Dependent Codon Usage Bias and Hydrogen Bonding at the 5' End of ORFeomes, mSystems, 5(4), pp. 1–18. doi: 10.1128/msystems.00613-20. https://msystems.asm.org/content/5/4/e00613-20
  • Byadgi, O. et al. (2020) Transcriptome analysis of amyloodinium ocellatum tomonts revealed basic information on the major potential virulence factors, Genes, 11(11), pp. 1–12. doi: 10.3390/genes11111252. https://www.mdpi.com/2073-4425/11/11/1252

Development

Install dependencies

poetry install --sync

Testing

make format
make all

Publish (only for administrator)

poetry version [minor/major etc.]
poetry publish --build -u __token__ --password pypi-<token-from-pypi>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ncbitax2lin-3.0.0.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ncbitax2lin-3.0.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file ncbitax2lin-3.0.0.tar.gz.

File metadata

  • Download URL: ncbitax2lin-3.0.0.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/21.6.0

File hashes

Hashes for ncbitax2lin-3.0.0.tar.gz
Algorithm Hash digest
SHA256 fa57570acc72a4846c76e36002a7222c2b9d10fff9685f36ddf022c8f02d4944
MD5 98cd018e0a6a11ede6fed8ae47d73361
BLAKE2b-256 89b5ade4f767f55a51f906cf16007ab0c847994fef5c5796e02d105a2ab09646

See more details on using hashes here.

File details

Details for the file ncbitax2lin-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: ncbitax2lin-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.12.7 Darwin/21.6.0

File hashes

Hashes for ncbitax2lin-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 86b8fe36ea730ff6bc6aeddd9154f6cd833d2c6e0da67c8cb2b49f59e9c890e4
MD5 1b13aaaef236a3d85c3b83869d52a2e2
BLAKE2b-256 399546bbbace76f6cce35255d405c7ad1b4c300c16fa3402d596e7e0e5ef98e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page