Gene Fetch: High-throughput NCBI Sequence Retrieval Tool
Project description
Gene Fetch
Gene Fetch enables high-throughput retreival of sequence data from NCBI databases based on taxonomy IDs (taxids) or taxonomic heirarchies. It can retrieve both protein and/or nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).
Installation:
Install from PyPI
pip install gene-fetch
Post-installation testing:
- The Gene Fetch package includes some basic tests for each module, which can be run by:
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch
# Install pytest
pip install pytest
# Run tests
pytest
- This will take a few minutes to run 65 tests, consisting of 8 test modules (tests/test_*.py). You will get 1 warning regarding API credentials as these are not provided in the basic tests.
Usage:
python gene_fetch.py -g/--gene <gene_name> --type <sequence_type> -i/--in <samples.csv> -o/--out <output_directory>
--h/--help: Show help and exit.
Required arguments:
-g/--gene: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).--type: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).-i/--in: Path to input CSV file containing sample IDs and TaxIDs (see Input section below).i2/--in2: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see Input section below).o/--out: Path to output directory. The directory will be created if it does not exist.e/--emailand-k/--api-key: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found here.
Optional arguments:
--protein-size: Minimum protein sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 500).--nucleotide-size: Minimum nucleotide sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 1500).s/--single: Taxonomic ID for 'single' sequence search mode (-iand-i2are ignored when run with-smode). 'single' mode will fetch all (or N if specifying--max-sequences) target gene or protein sequences on GenBank for a specific taxonomic ID.--max-sequences: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single' mode).-b/--genbank: Saves genbank (.gb) files for fetched nucleotide and/or protein sequences togenbank/(applies when run in 'batch' or 'single' mode).
Input:
Example 'samples.csv' input file (-i/--in)
| ID | taxid |
|---|---|
| sample-1 | 177658 |
| sample-2 | 177627 |
| sample-3 | 3084599 |
Example 'samples_taxonomy.csv' input file (-i2/--in2)
| ID | phylum | class | order | family | genus | species |
|---|---|---|---|---|---|---|
| sample-1 | Arthropoda | Insecta | Diptera | Acroceridae | Astomella | |
| sample-2 | Arthropoda | Insecta | Hemiptera | Cicadellidae | Psammotettix | Psammotettix sabulicola |
| sample-3 | Arthropoda | Insecta | Trichoptera | Limnephilidae | Dicosmoecus | Dicosmoecus palatus |
- Leave blank if taxonomic information not known/needed
** Authored by Dan Parsons and Ben Price @ NHMUK (2025). **
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gene_fetch-1.0.8.tar.gz.
File metadata
- Download URL: gene_fetch-1.0.8.tar.gz
- Upload date:
- Size: 53.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.8 Linux/6.1.0-31-amd64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50e43af1b25169b69f46477530404958de0154aa3f3774ff6e61cf76d368a173
|
|
| MD5 |
463119cd1b6c24661186482bac6dd1dd
|
|
| BLAKE2b-256 |
113c78ee00a1f5f0b9db549bb4a2c762637ddc87df1b9a0ed060dd469ffe2557
|
File details
Details for the file gene_fetch-1.0.8-py3-none-any.whl.
File metadata
- Download URL: gene_fetch-1.0.8-py3-none-any.whl
- Upload date:
- Size: 35.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.12.8 Linux/6.1.0-31-amd64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c21490f2494044c5c020cbfc33d55bfe66820e26d23112c31182b66f129220e
|
|
| MD5 |
f1e76a0f71025fa7486f1a7178b680ad
|
|
| BLAKE2b-256 |
2599af57972a6f2a32d4fdd37669fdffb9b2f878a0b45577545d5d850691f4c8
|