Skip to main content

Python package to download and filter GenBank database based on taxonomy

Project description

Taxonize_genbank

PyPI - Downloads

Taxonize_genbank is a Python package designed to simplify the process of downloading, filtering, and curating GenBank's non-redundant protein and nucleotide databases based on taxonomy IDs (TaxIDs) and/or specific keywords. This tool is particularly useful for researchers working with large-scale genomic datasets who need to extract specific subsets of data.

The tool supports advanced filtering options, allowing users to:

  • Include Multiple TaxIDs: Extract sequences associated with multiple taxa by specifying a list of TaxIDs.
  • Exclude Specific TaxIDs: Remove unwanted or contaminant taxa from the dataset during filtering.

These features make Taxonize_genbank highly flexible and customizable for a variety of research applications.

Features

  • Download NCBI databases, including taxonomy, protein, and nucleotide datasets.
  • Filter GenBank's non-redundant databases (nr or nt) by taxonomy ID or keywords.
  • Retrieve taxonomic lineages for FASTA accessions.
  • Support for advanced filtering using multiple TaxIDs or excluding specific TaxIDs.

Installation

Prerequisites

Ensure you have the following dependencies installed:

  • Python 3.7 or higher
  • Biopython: 1.81
  • tqdm: 4.64.1
  • ete3: 3.1.3
  • networkx: 2.6.3
  • six: 1.16.0
  • isal:1.7.1

Installation Steps

  1. Clone the Repository
    Clone the GitHub repository to your local machine:
git clone https://github.com/msabrysarhan/taxonize_genbank
  1. Install via pip (Recommended)
    Alternatively, install taxonize_gb directly using pip:
pip install taxonize-gb

Usage

The Taxonize_genbank package includes three main modules:

1. get_db.py: Download NCBI Databases

This module allows you to download NCBI databases required for filtering.

Command:

get_db.py --db_name <DB_NAME> --out <OUTPUT_DIRECTORY>

Options:

  • --db_name: Specify the database to download (e.g., taxdb, nr, nt, etc.).
  • --out: Path to the output directory where the database will be stored.

Example:

Download the non-redundant protein database (nr):

get_db.py --db_name nr --out databases/

2. taxonize_gb.py: Filter Databases by TaxID or Keywords

This module filters the downloaded database based on taxonomy ID or keywords.

Command:

taxonize_gb.py --db <DB> --db_path <DB_PATH> [OPTIONS] --out <OUTPUT_DIRECTORY>

Required Arguments:

  • --db: Specify the database type (nt for nucleotide or nr for protein).
  • --db_path: Path to the gzipped FASTA file (if not provided, it will be downloaded automatically).
  • --out: Path to the output directory.

Optional Arguments:

  • --taxid: Target taxonomy ID to filter for.
  • --keywords: Keywords to include in FASTA headers.
  • Additional arguments for mapping files (--prot_acc2taxid, --nucl_gb_acc2taxid, etc.).

Example:

Filter the non-redundant protein database (nr) for plant proteins (TaxID: 33090):

taxonize_gb.py --db nr --db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/

3. get_taxonomy.py: Retrieve Taxonomic Lineages

This module extracts taxonomic lineages from GenBank FASTA files.

Command:

get_taxonomy.py --fasta <FASTA_FILE> --map <MAPPING_FILE> --out <OUTPUT_FILE>

Options:

  • --fasta: Path to the FASTA file.
  • --map: Path to the mapping file (e.g., accession-to-taxonomy mapping).
  • --out: Path to save the output file.

Example:

Retrieve taxonomic lineages from a FASTA file:

get_taxonomy.py --fasta input.fasta
--map databases/prot.accession2taxid.gz
--out taxonomy_lineages.txt

Examples

Example 1: Plant Non-redundant Protein Database

  1. Download required files:
get_db.py --db_name nr --out databases/
get_db.py --db_name prot_acc2taxid --out databases/
get_db.py --db_name pdb_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/
  1. Filter for plant proteins (TaxID: 33090):
taxonize_gb.py --db nr
--db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/

Example 2: Insect Non-redundant Nucleotide Database

  1. Download required files:
get_db.py --db_name nt --out databases/
get_db.py --db_name nucl_gb_acc2taxid --out databases/
get_db.py --db_name nucl_wgs_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/
  1. Filter for insect nucleotides (TaxID: 50557):
taxonize_gb.py --db nt
--db_path databases/nt.gz
--taxdb databases/taxdump.tar.gz
--nucl_gb_acc2taxid databases/nucl_gb.accession2taxid.gz
--nucl_wgs_acc2taxid databases/nucl_wgs.accession2taxid.gz
--taxid 50557
--out insect_nt/

License

This project is licensed under the MIT License.
See the LICENSE file for full details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxonize_gb-1.1.16.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxonize_gb-1.1.16-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file taxonize_gb-1.1.16.tar.gz.

File metadata

  • Download URL: taxonize_gb-1.1.16.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for taxonize_gb-1.1.16.tar.gz
Algorithm Hash digest
SHA256 4e39751fb81183783ccde0ee84164d7b9d7699e26617e492fd310bb0c3bb56d7
MD5 b3adc839421641bccc5e3c314b1c9763
BLAKE2b-256 daecc84c7446099bd822fb5ded684f2e2f090fa720dc68cb1d7863c211338305

See more details on using hashes here.

File details

Details for the file taxonize_gb-1.1.16-py3-none-any.whl.

File metadata

  • Download URL: taxonize_gb-1.1.16-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for taxonize_gb-1.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 86f6d064f000a79d29e7bdd2a72dd5c650992cb51ffbe4aaf93299f7a0d3e991
MD5 34502a6673d728a419fd22009c37eb90
BLAKE2b-256 1707735782d0633a05e02576efcdd37e234d6ea626931551f5c70192b2f4a3de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page