Python package to download and filter GenBank database based on taxonomy

Project description

Taxonize_genbank

Taxonize_genbank is a Python package designed to simplify the process of downloading, filtering, and curating GenBank's non-redundant protein and nucleotide databases based on taxonomy IDs (TaxIDs) and/or specific keywords. This tool is particularly useful for researchers working with large-scale genomic datasets who need to extract specific subsets of data.

The tool supports advanced filtering options, allowing users to:

Include Multiple TaxIDs: Extract sequences associated with multiple taxa by specifying a list of TaxIDs.
Exclude Specific TaxIDs: Remove unwanted or contaminant taxa from the dataset during filtering.

These features make Taxonize_genbank highly flexible and customizable for a variety of research applications.

Features

Download NCBI databases, including taxonomy, protein, and nucleotide datasets.
Filter GenBank's non-redundant databases (nr or nt) by taxonomy ID or keywords.
Retrieve taxonomic lineages for FASTA accessions.
Support for advanced filtering using multiple TaxIDs or excluding specific TaxIDs.

Installation

Prerequisites

Ensure you have the following dependencies installed:

Python 3.7 or higher
Biopython: 1.81
tqdm: 4.64.1
ete3: 3.1.3
networkx: 2.6.3
six: 1.16.0
isal:1.7.1

Installation Steps

Clone the Repository
Clone the GitHub repository to your local machine:

git clone https://github.com/msabrysarhan/taxonize_genbank

Install via pip (Recommended)
Alternatively, install taxonize_gb directly using pip:

pip install taxonize-gb

Usage

The Taxonize_genbank package includes three main modules:

1. `get_db.py`: Download NCBI Databases

This module allows you to download NCBI databases required for filtering.

Command:

get_db.py --db_name <DB_NAME> --out <OUTPUT_DIRECTORY>

Options:

--db_name: Specify the database to download (e.g., taxdb, nr, nt, etc.).
--out: Path to the output directory where the database will be stored.

Example:

Download the non-redundant protein database (nr):

get_db.py --db_name nr --out databases/

2. `taxonize_gb.py`: Filter Databases by TaxID or Keywords

This module filters the downloaded database based on taxonomy ID or keywords.

Command:

taxonize_gb.py --db <DB> --db_path <DB_PATH> [OPTIONS] --out <OUTPUT_DIRECTORY>

Required Arguments:

--db: Specify the database type (nt for nucleotide or nr for protein).
--db_path: Path to the gzipped FASTA file (if not provided, it will be downloaded automatically).
--out: Path to the output directory.

Optional Arguments:

--taxid: Target taxonomy ID to filter for.
--keywords: Keywords to include in FASTA headers.
Additional arguments for mapping files (--prot_acc2taxid, --nucl_gb_acc2taxid, etc.).

Example:

Filter the non-redundant protein database (nr) for plant proteins (TaxID: 33090):

taxonize_gb.py --db nr --db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/

3. `get_taxonomy.py`: Retrieve Taxonomic Lineages

This module extracts taxonomic lineages from GenBank FASTA files.

Command:

get_taxonomy.py --fasta <FASTA_FILE> --map <MAPPING_FILE> --out <OUTPUT_FILE>

Options:

--fasta: Path to the FASTA file.
--map: Path to the mapping file (e.g., accession-to-taxonomy mapping).
--out: Path to save the output file.

Example:

Retrieve taxonomic lineages from a FASTA file:

get_taxonomy.py --fasta input.fasta
--map databases/prot.accession2taxid.gz
--out taxonomy_lineages.txt

Examples

Example 1: Plant Non-redundant Protein Database

Download required files:

get_db.py --db_name nr --out databases/
get_db.py --db_name prot_acc2taxid --out databases/
get_db.py --db_name pdb_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/

Filter for plant proteins (TaxID: 33090):

taxonize_gb.py --db nr
--db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/

Example 2: Insect Non-redundant Nucleotide Database

Download required files:

get_db.py --db_name nt --out databases/
get_db.py --db_name nucl_gb_acc2taxid --out databases/
get_db.py --db_name nucl_wgs_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/

Filter for insect nucleotides (TaxID: 50557):

taxonize_gb.py --db nt
--db_path databases/nt.gz
--taxdb databases/taxdump.tar.gz
--nucl_gb_acc2taxid databases/nucl_gb.accession2taxid.gz
--nucl_wgs_acc2taxid databases/nucl_wgs.accession2taxid.gz
--taxid 50557
--out insect_nt/

License

This project is licensed under the MIT License.
See the LICENSE file for full details.

Project details

Release history Release notifications | RSS feed

This version

1.1.16

Mar 4, 2025

1.1.6

Mar 3, 2025

1.1.5

Feb 14, 2025

1.1.4

Feb 14, 2025

1.1.3

Feb 14, 2025

1.1.2

Feb 13, 2025

1.1.1

Feb 13, 2025

1.1.0

Feb 13, 2025

1.0.14

Sep 16, 2023

1.0.13

Sep 15, 2023

1.0.12

Sep 14, 2023

1.0.11

Sep 13, 2023

1.0.10

Sep 13, 2023

1.0.9

Sep 13, 2023

1.0.8

Sep 13, 2023

1.0.7

Sep 13, 2023

1.0.6

Sep 13, 2023

1.0.5

Sep 13, 2023

1.0.4

Sep 13, 2023

1.0.3

Sep 13, 2023

1.0.2

Sep 13, 2023

1.0.1

Sep 13, 2023

1.0.0

Sep 12, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxonize_gb-1.1.16.tar.gz (15.2 kB view details)

Uploaded Mar 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

taxonize_gb-1.1.16-py3-none-any.whl (21.6 kB view details)

Uploaded Mar 4, 2025 Python 3

File details

Details for the file taxonize_gb-1.1.16.tar.gz.

File metadata

Download URL: taxonize_gb-1.1.16.tar.gz
Upload date: Mar 4, 2025
Size: 15.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for taxonize_gb-1.1.16.tar.gz
Algorithm	Hash digest
SHA256	`4e39751fb81183783ccde0ee84164d7b9d7699e26617e492fd310bb0c3bb56d7`
MD5	`b3adc839421641bccc5e3c314b1c9763`
BLAKE2b-256	`daecc84c7446099bd822fb5ded684f2e2f090fa720dc68cb1d7863c211338305`

See more details on using hashes here.

File details

Details for the file taxonize_gb-1.1.16-py3-none-any.whl.

File metadata

Download URL: taxonize_gb-1.1.16-py3-none-any.whl
Upload date: Mar 4, 2025
Size: 21.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for taxonize_gb-1.1.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86f6d064f000a79d29e7bdd2a72dd5c650992cb51ffbe4aaf93299f7a0d3e991`
MD5	`34502a6673d728a419fd22009c37eb90`
BLAKE2b-256	`1707735782d0633a05e02576efcdd37e234d6ea626931551f5c70192b2f4a3de`

See more details on using hashes here.

taxonize-gb 1.1.16

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Taxonize_genbank

Features

Installation

Prerequisites

Installation Steps

Usage

1. get_db.py: Download NCBI Databases

Command:

Options:

Example:

2. taxonize_gb.py: Filter Databases by TaxID or Keywords

Command:

Required Arguments:

Optional Arguments:

Example:

3. get_taxonomy.py: Retrieve Taxonomic Lineages

Command:

Options:

Example:

Examples

Example 1: Plant Non-redundant Protein Database

Example 2: Insect Non-redundant Nucleotide Database

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. `get_db.py`: Download NCBI Databases

2. `taxonize_gb.py`: Filter Databases by TaxID or Keywords

3. `get_taxonomy.py`: Retrieve Taxonomic Lineages