Skip to main content

Gene Fetch: High-throughput NCBI Sequence Retrieval Tool

Project description

Gene Fetch

Gene Fetch enables high-throughput retreival of sequence data from NCBI databases based on taxonomy IDs (taxids) or taxonomic heirarchies. It can retrieve both protein and/or nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).

Installation:

Install from PyPI

pip install gene-fetch

Post-installation testing:

  • The Gene Fetch package includes some basic tests for each module, which can be run by:
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch

# Install pytest
pip install pytest

# Run tests
pytest
  • This will take a few minutes to run the tests. You will get 1 warning regarding API credentials as these are not provided in the basic tests.

Usage:

python gene_fetch.py -g/--gene <gene_name> --type <sequence_type> -i/--in <samples.csv> -o/--out <output_directory> 

--h/--help: Show help and exit.

Required arguments:

  • -g/--gene: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).
  • --type: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).
  • -i/--in: Path to input CSV file containing sample IDs and TaxIDs (see Input section below).
  • -i2/--in2: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see Input section below).
  • -o/--out: Path to output directory. The directory will be created if it does not exist.
  • -e/--email and -k/--api-key: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found here.

Optional arguments:

  • --protein-size: Minimum protein sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 500, can be bypassed by setting the value to zero or a negative number).
  • --nucleotide-size: Minimum nucleotide sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 1000, can be bypassed by setting the value to zero or a negative number).
  • s/--single: Taxonomic ID for 'single' sequence search mode (-i and -i2 are ignored when run with -s mode). 'single' mode will fetch all (or N if specifying --max-sequences) target gene or protein sequences on GenBank for a specific taxonomic ID.
  • --max-sequences: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single' mode).
  • -b/--genbank: Saves genbank (.gb) files for fetched nucleotide and/or protein sequences to genbank/ (applies when run in 'batch' or 'single' mode).

Input:

Example 'samples.csv' input file (-i/--in)

ID taxid
sample-1 177658
sample-2 177627
sample-3 3084599

Example 'samples_taxonomy.csv' input file (-i2/--in2)

ID phylum class order family genus species
sample-1 Arthropoda Insecta Diptera Acroceridae Astomella
sample-2 Arthropoda Insecta Hemiptera Cicadellidae Psammotettix Psammotettix sabulicola
sample-3 Arthropoda Insecta Trichoptera Limnephilidae Dicosmoecus Dicosmoecus palatus
  • Leave blank if taxonomic information not known/needed

** Authored by Dan Parsons and Ben Price @ NHMUK (2025). **

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gene_fetch-1.0.11.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gene_fetch-1.0.11-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file gene_fetch-1.0.11.tar.gz.

File metadata

  • Download URL: gene_fetch-1.0.11.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.8 Linux/6.1.0-31-amd64

File hashes

Hashes for gene_fetch-1.0.11.tar.gz
Algorithm Hash digest
SHA256 f0d364ff9b9ada2e881880cc9a19ca812d337a2db1523a25f5f28415f622be30
MD5 f1d606ef93857b7cc557eaf8ee5afe4a
BLAKE2b-256 e01c889b143b4243e1f937f1fef349d5759fa2e85f0a0da749b9e60a4fe10a12

See more details on using hashes here.

File details

Details for the file gene_fetch-1.0.11-py3-none-any.whl.

File metadata

  • Download URL: gene_fetch-1.0.11-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.8 Linux/6.1.0-31-amd64

File hashes

Hashes for gene_fetch-1.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 e036cfdf5c930213c69e7df21f7716e5927794fc6ae1b6a68cf93b4ba2090988
MD5 3b2440591ecda30d81718d6225aac4ec
BLAKE2b-256 4740f2d6df39540aa2500f5a63c2e36cf9740d2d9cb37301ee118d8d0b4f6b99

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page