Python package to download and filter GenBank database based on taxonomy
Project description
Taxonize_genbank
Taxonize_genbank is a Python package designed to simplify the process of downloading, filtering, and curating GenBank's non-redundant protein and nucleotide databases based on taxonomy IDs (TaxIDs) and/or specific keywords. This tool is particularly useful for researchers working with large-scale genomic datasets who need to extract specific subsets of data.
The tool supports advanced filtering options, allowing users to:
- Include Multiple TaxIDs: Extract sequences associated with multiple taxa by specifying a list of TaxIDs.
- Exclude Specific TaxIDs: Remove unwanted or contaminant taxa from the dataset during filtering.
These features make Taxonize_genbank highly flexible and customizable for a variety of research applications.
Features
- Download NCBI databases, including taxonomy, protein, and nucleotide datasets.
- Filter GenBank's non-redundant databases (
nrornt) by taxonomy ID or keywords. - Retrieve taxonomic lineages for FASTA accessions.
- Support for advanced filtering using multiple TaxIDs or excluding specific TaxIDs.
Installation
Prerequisites
Ensure you have the following dependencies installed:
- Python 3.7 or higher
- Biopython: 1.81
- tqdm: 4.64.1
- ete3: 3.1.3
- networkx: 2.6.3
- six: 1.16.0
- isal:1.7.1
Installation Steps
- Clone the Repository
Clone the GitHub repository to your local machine:
git clone https://github.com/msabrysarhan/taxonize_genbank
- Install via pip (Recommended)
Alternatively, installtaxonize_gbdirectly using pip:
pip install taxonize-gb
Usage
The Taxonize_genbank package includes three main modules:
1. get_db.py: Download NCBI Databases
This module allows you to download NCBI databases required for filtering.
Command:
get_db.py --db_name <DB_NAME> --out <OUTPUT_DIRECTORY>
Options:
--db_name: Specify the database to download (e.g.,taxdb,nr,nt, etc.).--out: Path to the output directory where the database will be stored.
Example:
Download the non-redundant protein database (nr):
get_db.py --db_name nr --out databases/
2. taxonize_gb.py: Filter Databases by TaxID or Keywords
This module filters the downloaded database based on taxonomy ID or keywords.
Command:
taxonize_gb.py --db <DB> --db_path <DB_PATH> [OPTIONS] --out <OUTPUT_DIRECTORY>
Required Arguments:
--db: Specify the database type (ntfor nucleotide ornrfor protein).--db_path: Path to the gzipped FASTA file (if not provided, it will be downloaded automatically).--out: Path to the output directory.
Optional Arguments:
--taxid: Target taxonomy ID to filter for.--keywords: Keywords to include in FASTA headers.- Additional arguments for mapping files (
--prot_acc2taxid,--nucl_gb_acc2taxid, etc.).
Example:
Filter the non-redundant protein database (nr) for plant proteins (TaxID: 33090):
taxonize_gb.py --db nr --db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/
3. get_taxonomy.py: Retrieve Taxonomic Lineages
This module extracts taxonomic lineages from GenBank FASTA files.
Command:
get_taxonomy.py --fasta <FASTA_FILE> --map <MAPPING_FILE> --out <OUTPUT_FILE>
Options:
--fasta: Path to the FASTA file.--map: Path to the mapping file (e.g., accession-to-taxonomy mapping).--out: Path to save the output file.
Example:
Retrieve taxonomic lineages from a FASTA file:
get_taxonomy.py --fasta input.fasta
--map databases/prot.accession2taxid.gz
--out taxonomy_lineages.txt
Examples
Example 1: Plant Non-redundant Protein Database
- Download required files:
get_db.py --db_name nr --out databases/
get_db.py --db_name prot_acc2taxid --out databases/
get_db.py --db_name pdb_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/
- Filter for plant proteins (TaxID: 33090):
taxonize_gb.py --db nr
--db_path databases/nr.gz
--taxdb databases/taxdump.tar.gz
--prot_acc2taxid databases/prot.accession2taxid.gz
--pdb_acc2taxid databases/pdb.accession2taxid.gz
--taxid 33090
--out plant_nr/
Example 2: Insect Non-redundant Nucleotide Database
- Download required files:
get_db.py --db_name nt --out databases/
get_db.py --db_name nucl_gb_acc2taxid --out databases/
get_db.py --db_name nucl_wgs_acc2taxid --out databases/
get_db.py --db_name taxdb --out databases/
- Filter for insect nucleotides (TaxID: 50557):
taxonize_gb.py --db nt
--db_path databases/nt.gz
--taxdb databases/taxdump.tar.gz
--nucl_gb_acc2taxid databases/nucl_gb.accession2taxid.gz
--nucl_wgs_acc2taxid databases/nucl_wgs.accession2taxid.gz
--taxid 50557
--out insect_nt/
License
This project is licensed under the MIT License.
See the LICENSE file for full details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file taxonize_gb-1.1.16.tar.gz.
File metadata
- Download URL: taxonize_gb-1.1.16.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e39751fb81183783ccde0ee84164d7b9d7699e26617e492fd310bb0c3bb56d7
|
|
| MD5 |
b3adc839421641bccc5e3c314b1c9763
|
|
| BLAKE2b-256 |
daecc84c7446099bd822fb5ded684f2e2f090fa720dc68cb1d7863c211338305
|
File details
Details for the file taxonize_gb-1.1.16-py3-none-any.whl.
File metadata
- Download URL: taxonize_gb-1.1.16-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86f6d064f000a79d29e7bdd2a72dd5c650992cb51ffbe4aaf93299f7a0d3e991
|
|
| MD5 |
34502a6673d728a419fd22009c37eb90
|
|
| BLAKE2b-256 |
1707735782d0633a05e02576efcdd37e234d6ea626931551f5c70192b2f4a3de
|