Taxoniq: Taxon Information Query - fast, offline querying of NCBI Taxonomy and related data

These details have not been verified by PyPI

Project links

Project description

Taxoniq: Taxon Information Query - fast, offline querying of NCBI Taxonomy and related data

Taxoniq is a Python and command-line interface to the NCBI Taxonomy database and selected data sources that cross-reference it.

Taxoniq's features include:

Pre-computed indexes updated monthly from NCBI, WoL and cross-referenced databases
Offline operation: all indexes are bundled with the package; no network calls are made when querying taxon information (separately, Taxoniq can fetch the nucleotide or protein sequences over the network given a taxon or accession - see Retrieving sequences below)
A CLI capable of JSON I/O, batch processing and streaming of inputs for ease of use and pipelining in shell scripts
A stable, well-documented, type-hinted Python API (Python 3.6 and higher is supported)
Comprehensive testing and continuous integration
An intuitive interface with useful defaults
Compactness, readability, and extensibility

The Taxoniq package bundles an indexed, compressed copy of the NCBI taxonomy database files, the NCBI RefSeq nucleotide and protein accessions associated with each taxon, the WoL kingdom-wide phylogenetic distance database, and relevant information from other databases. Accessions which appear in the NCBI RefSeq BLAST databases are indexed so that given a taxon ID, accession ID, or taxon name, you can quickly retrieve the taxon's rank, lineage, description, citations, representative RefSeq IDs, LCA information, evolutionary distance, and more, as described in the Cookbook section below.

Installation

pip3 install taxoniq

Synopsis


t = taxoniq.Taxon(9606)
assert t.scientific_name == "Homo sapiens"
assert t.common_name == "human"
assert t.ranked_lineage == [taxoniq.Taxon(scientific_name='Homo sapiens'),
                            taxoniq.Taxon(scientific_name='Homo'),
                            taxoniq.Taxon(scientific_name='Hominidae'),
                            taxoniq.Taxon(scientific_name='Primates'),
                            taxoniq.Taxon(scientific_name='Mammalia'),
                            taxoniq.Taxon(scientific_name='Chordata'),
                            taxoniq.Taxon(scientific_name='Metazoa'),
                            taxoniq.Taxon(scientific_name='Eukaryota')]

t2 = taxoniq.Taxon(accession_id="NC_000913.3")
assert t2 == taxoniq.Taxon(scientific_name="Escherichia coli str. K-12 substr. MG1655")
assert t2.parent.parent.common_name == "E. coli"

Retrieving sequences

Mirrors of the NCBI BLAST databases are maintained on AWS S3 (s3://ncbi-blast-databases) and Google Storage (gs://blast-db). This is a key resource, since S3 and GS have superior bandwidth and throughput compared to the NCBI FTP server, so range requests can be used to retrieve individual sequences from the database files without downloading and keeping a copy of the whole database.

The Taxoniq PyPI distribution (the package you install using pip3 install taxoniq) indexes accessions for the following NCBI BLAST databases:

Refseq viruses representative genomes (ref_viruses_rep_genomes) (nucleotide)
Refseq prokaryote representative genomes (contains refseq assembly) (ref_prok_rep_genomes) (nucleotide)
RefSeq Eukaryotic Representative Genome Database (ref_euk_rep_genomes) (nucleotide)
Betacoronavirus (nucleotide)

Given an accession ID, Taxoniq can issue a single HTTP request and return a file-like object streaming the nucleotide sequence for this accession from the S3 or GS mirror as follows:

with taxoniq.Accession("NC_000913.3").get_from_s3() as fh:
     fh.read()

To retrieve many sequences quickly, you may want to use a threadpool to open multiple network connections at once:

def fetch_seq(accession_id):
    accession = taxoniq.Accession(accession_id)
    seq = accession.get_from_s3().read()
    return (accession, seq)

taxon = taxoniq.Taxon(scientific_name="Apis mellifera")
for accession, seq in ThreadPoolExecutor().map(fetch_seq, taxon.refseq_representative_genome_accessions):
    print(accession, len(seq))

Using the nr/nt databases

In progress

Cookbook

In progress

Bugs

Please report bugs, issues, feature requests, etc. on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.3

Sep 18, 2024

1.0.2

Sep 17, 2024

1.0.1

Nov 19, 2023

1.0.0

Nov 19, 2023

0.6.0

Apr 20, 2021

0.5.2

Apr 14, 2021

0.5.1

Apr 14, 2021

0.5.0

Apr 13, 2021

0.4.0

Apr 2, 2021

0.3.4

Apr 2, 2021

0.3.0

Mar 27, 2021

0.2.0

Mar 26, 2021

0.1.7

Mar 25, 2021

0.1.6

Mar 25, 2021

0.1.5

Mar 25, 2021

0.1.4

Mar 25, 2021

0.1.3

Mar 24, 2021

0.1.2

Mar 24, 2021

0.0.8

Feb 22, 2021

0.0.7

Feb 19, 2021

0.0.6

Feb 19, 2021

This version

0.0.5

Feb 15, 2021

0.0.3

Feb 7, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxoniq-0.0.5.tar.gz (48.7 MB view details)

Uploaded Feb 15, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

taxoniq-0.0.5-py3-none-any.whl (48.8 MB view details)

Uploaded Feb 15, 2021 Python 3

File details

Details for the file taxoniq-0.0.5.tar.gz.

File metadata

Download URL: taxoniq-0.0.5.tar.gz
Upload date: Feb 15, 2021
Size: 48.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for taxoniq-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`1b6f30378ef35647f4c553cd6b909f52764abbb584321ad14d9956fe9638fb85`
MD5	`0b2aeaf4a4530acc8b00138bfea862c3`
BLAKE2b-256	`c73d50f255adf7fc9c32c046a61c63e7de4357ae153ef79f3ba84195264933cd`

See more details on using hashes here.

File details

Details for the file taxoniq-0.0.5-py3-none-any.whl.

File metadata

Download URL: taxoniq-0.0.5-py3-none-any.whl
Upload date: Feb 15, 2021
Size: 48.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.8.5

File hashes

Hashes for taxoniq-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bbca43a04bebac4fbeafa442ff796ce23bcd8305e9abc51c51284fc11a7235c3`
MD5	`ffcfbbca295f83010f0dde6b4d613da5`
BLAKE2b-256	`63e7b087a94b9f07acd716315fa73fa786dec88ef446894175d772ecd836f628`

See more details on using hashes here.

taxoniq 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Taxoniq: Taxon Information Query - fast, offline querying of NCBI Taxonomy and related data

Installation

Synopsis

Retrieving sequences

Using the nr/nt databases

Cookbook

Links

Bugs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes