A tool that converts NCBI taxonomy dump into lineages
Project description
NCBItax2lin
Convert NCBI taxonomy dump from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz into lineages. An example for human (tax_id=9606) is like
tax_id | superkingdom | phylum | class | order | family | genus | species | family1 | forma | genus1 | infraclass | infraorder | kingdom | no rank | no rank1 | no rank10 | no rank11 | no rank12 | no rank13 | no rank14 | no rank15 | no rank16 | no rank17 | no rank18 | no rank19 | no rank2 | no rank20 | no rank21 | no rank22 | no rank3 | no rank4 | no rank5 | no rank6 | no rank7 | no rank8 | no rank9 | parvorder | species group | species subgroup | species1 | subclass | subfamily | subgenus | subkingdom | suborder | subphylum | subspecies | subtribe | superclass | superfamily | superorder | superorder1 | superphylum | tribe | varietas |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9606 | Eukaryota | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens | Simiiformes | Metazoa | cellular organisms | Opisthokonta | Dipnotetrapodomorpha | Tetrapoda | Amniota | Theria | Eutheria | Boreoeutheria | Eumetazoa | Bilateria | Deuterostomia | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Catarrhini | Homininae | Haplorrhini | Craniata | Hominoidea | Euarchontoglires |
Regenerate the lineages yourself
Regeneration is straightforward, but it may incur quite a bit of memory (~20
GB). I generated lineages.csv.gz
on a machine with 32 GB memory. Pull request
on refactoring to a lower memory usage is welcome. It's mainly about
this line,
where the pool.map
takes places.
If you really need an updated version but without the hardware resources, you could also notify me on github, and I will update it for you.
Install
git clone git@github.com:zyxue/ncbitax2lin.git
cd ncbitax2lin/
Set up a virtual environment
Currently, it only works with python2.7
, and needs
pandas, so make sure you are in a proper virtual
environment. If you have already these had one available, just use that
one.
Otherwise, you can create a new one with miniconda/anaconda (recommended),
conda create -y -p venv/ --file env-conda.txt
# or effectively the same
# conda create -y -p venv python=2 pandas
source activate venv/
or with virtualenv + pip
virtualenv venv/
source venv/bin/activate
pip install -r env-pip.txt
Regenerate
Then run the following, this will download the latest taxdump from NCBI, and run the scripts to regenerate all latest lineages from it
make
FAQ
Q: I have a large number of sequences with their corresponding accession numbers from NCBI, how to get their lineages?
A: First, you need to map accession numbers (GI is deprecated) to tax IDs
based on nucl_*accession2taxid.gz
files from
ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Secondly, you can trace a
sequence's whole lineage based on its tax ID. The tax-id-to-lineage mapping is
what NCBItax2lin generates for you, and it is available on ncbitax2lin-lineages.
If you have any question about this project, please feel free to create a new issue.
Note on taxdump.tar.gz.md5
It appears that NCBI periodically regenerates taxdump.tar.gz
and
taxdump.tar.gz.md5
even when its content is still the same. I am not sure how
their regeneration works, but taxdump.tar.gz.md5
will differ simply because
of a different timestamp.
Used in
- Mahmoudabadi, G., & Phillips, R. (2018). A comprehensive and quantitative exploration of thousands of viral genomes. ELife, 7. https://doi.org/10.7554/eLife.31955
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ncbitax2lin-2.0.0a2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be4abd4bd74db836056946d26eb5c2790143ba3e7cd2d78fe15acfc84b49e8ad |
|
MD5 | 7df4f828451b7d33ac5bc2b48d5a0144 |
|
BLAKE2b-256 | da92115dd5e747b3717365414daac371b6cce17a224264e4f76887f7d6be07b2 |