Skip to main content

Gene Fetch: High-throughput NCBI Sequence Retrieval Tool

Project description

gene_fetch_logo

PyPI version install with bioconda Python versions status

GeneFetch

Gene Fetch enables high-throughput retreival of sequence data from NCBI databases based on taxonomy IDs (taxids) or taxonomic heirarchies. It can retrieve both protein and/or nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).

Highlight features

  • Fetch protein and/or nucleotide sequences from NCBI GenBank database.
  • Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats).
  • Support for both protein-coding and rDNA genes.
  • Customisable length filtering thresholds for protein and nucleotide sequences (default: protein=500aa. nucleotide=1000bp).
  • Default "batch" mode processes multiple input taxa based on a user specified CSV file.
  • Configurable "single" mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds can be bypassed by setting the value to zero or a negative number).
  • Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. i.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.
  • Taxonomic validation: validates fetched sequence taxonomy against input taxonomic heirarchy, avoiding potential taxonomic homonyms (i.e. when the same taxon name is used for different taxa across the tree of life).
  • Robust error handling, progress tracking, and logging, with compliance to NCBI API rate limits (10 requests/second). Caches taxonomy lookups for reduced API calls.
  • Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids "unverified" sequences and WGS entries not containing sequence data (i.e. master records).
  • 'Checkpointing': if a run fails/crashes, the script can be rerun using the same arguments and it will resume from where it stopped.
  • When more than 50 matching GenBank records are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders the records by sequence length, and processes the longest sequences first.
  • Can output corresponding genbank (.gb) files for each fetched nucleotide and/or protein sequences

Contents

Installation

  • Due to the risk of dependency conflicts, it's recommended to install Gene Fetch in a Conda environment.
  • First Conda needs to be installed, which can be done from here.
  • Once installed:
# Create new environment
conda create -n gene-fetch

# Activate environment
conda activate gene-fetch
  • Gene Fetch and all necessary dependencies can then be installed via Bioconda, PyPI, or by specifying environment.yaml:
# Install via bioconda
conda install bioconda::gene-fetch

# Or, install via pip
pip install gene-fetch

# Or, via environment specification
conda env update --name gene-fetch -f environment.yaml --prune

# Verify installation
gene-fetch --help
  • If you would rather clone this repository and run a standalone version of Gene Fetch for some reason, you can do that as follows:
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch

# Activate conda environment (once created), and install gene-fetch (+ dependencies) via your preferred method.

# Run standalone Gene Fetch
python /path/to/gene_fetch.py [options]

Recommended: Testing

  • The Gene Fetch package includes some basic tests for each module that we recommend are run after installation.
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch

# Install pytest
pip install pytest

# Run tests
pytest
  • This will take a few minutes to run the tests. You will get 1 warning regarding API credentials as these are not provided in the basic tests.

Usage

gene-fetch -g/--gene <gene_name> --type <sequence_type> -i/--in <samples.csv> -o/--out <output_directory> 
  • --help: Show usage help and exit.

Required arguments

  • -g/--gene: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).
  • --type: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).
  • -i/--in: Path to input CSV file containing sample IDs and TaxIDs (see Input section below).
  • -i2/--in2: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see Input section below).
  • o/--out: Path to output directory. The directory will be created if it does not exist.
  • e/--email and -k/--api-key: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found here.

Optional arguments

  • --protein-size: Minimum protein sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 500).
  • --nucleotide-size: Minimum nucleotide sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 1000).
  • s/--single: Taxonomic ID for 'single' sequence search mode (-i and -i2 are ignored when run with -s mode). 'single' mode will fetch all (or N if specifying --max-sequences) target gene or protein sequences on GenBank for a specific taxonomic ID.
  • --max-sequences: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single' mode).
  • -b/--genbank: Saves genbank (.gb) files for fetched nucleotide and/or protein sequences to genbank/ (applies when run in 'batch' or 'single' mode).

Examples

Fetch both protein and nucleotide sequences for COI with default sequence length thresholds, and store the corresponding genbank records.

gene-fetch -e your.email@domain.com -k your_api_key \
            -g cox1 -o ./output_dir -i ./samples.csv \
            --type both --genbank

Fetch rbcL nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp

gene-fetch -e your.email@domain.com -k your_api_key \
            -g rbcl -o ./output_dir -i2 ./taxonomy.csv \
            --type nucleotide --nucleotide-size 1000

Retrieve 1000 available matK protein sequences >400aa for Arabidopsis thaliana (taxid: 3702).

gene-fetch -e your.email@domain.com -k your_api_key \
            -g matk -o ./output_dir -s 3702 \
            --type protein --protein-size 400 --max-sequences 1000

Input

Example 'samples.csv' input file (-i/--in)

ID taxid
sample-1 177658
sample-2 177627
sample-3 3084599

Example 'samples_taxonomy.csv' input file (-i2/--in2)

ID phylum class order family genus species
sample-1 Arthropoda Insecta Diptera Acroceridae Astomella
sample-2 Arthropoda Insecta Hemiptera Cicadellidae Psammotettix Psammotettix sabulicola
sample-3 Arthropoda Insecta Trichoptera Limnephilidae Dicosmoecus Dicosmoecus palatus
  • Leave blank if taxonomic information not known/needed

Output

'Batch' mode

output_dir/
├── genbank/                    # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.
├── nucleotide/                 # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│   ├── sample-1_dna.fasta   
│   ├── sample-2_dna.fasta
│   └── ...
├── sample-1.fasta              # Protein sequences.
├── sample-2.fasta
├── sequence_references.csv     # Sequence metadata.
├── failed_searches.csv         # Failed search attempts (if any).
└── gene_fetch.log              # Log.

sequence_references.csv output example

ID taxid protein_accession protein_length nucleotide_accession nucleotide_length matched_rank ncbi_taxonomy reference_name protein_reference_path nucleotide_reference_path
sample-1 177658 AHF21732.1 510 KF756944.1 1530 genus:Apatania Eukaryota; ...; Apataniinae; Apatania sample-1 abs/path/to/protein_references/sample-1.fasta abs/path/to/protein_references/sample-1_dna.fasta
sample-2 2719103 QNE85983.1 518 MT410852.1 1557 species:Isoptena serricornis Eukaryota; ...; Chloroperlinae; Isoptena sample-2 abs/path/to/protein_references/sample-2.fasta abs/path/to/protein_references/sample-2_dna.fasta
sample-3 1876143 YP_009526503.1 512 NC_039659.1 1539 genus:Triaenodes Eukaryota; ...; Triaenodini; Triaenodes sample-3 abs/path/to/protein_references/sample-3.fasta abs/path/to/protein_references/sample-3_dna.fasta

'Single' mode

output_dir/
├── genbank/                         # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.
├── nucleotide/                      # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│   ├── ACCESSION1_dna.fasta   
│   ├── ACCESSION2_dna.fasta
│   └── ...
├── ACCESSION1.fasta                 # Protein sequences.
├── ACCESSION2.fasta
├── fetched_nucleotide_sequences.csv # Only populated if '--type nucleotide/both' utilised. Sequence metadata.
├── fetched_protein_sequences.csv    # Only populated if '--type protein/both' utilised. Sequence metadata.
├── failed_searches.csv              # Failed search attempts (if any).
└── gene_fetch.log                   # Log.

fetched_protein|nucleotide_sequences.csv output example

ID taxid Description
PQ645072.1 1501 Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PQ645071.1 1537 Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PQ645070.1 1501 Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PQ645069.1 1518 Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PP355486.1 581 Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial

Running GeneFetch on a cluster

  • See 'gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular).
  • Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads - allocating lots of CPUs is unecessary as Gene Fetch is not paralellised (yet). The tool should run well with 4-10G memory and 1-2 CPUs.
  • Change paths and variables as needed.
  • Run 'gene_fetch.sh' with:
sbatch gene_fetch.sh

Supported targets

GeneFetch will function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the listed targets. More targets can be added into the script if necessary (see 'class config').

  • cox1/COI/cytochrome c oxidase subunit I
  • cox2/COII/cytochrome c oxidase subunit II
  • cox3/COIII/cytochrome c oxidase subunit III
  • cytb/cob/cytochrome b
  • nd1/NAD1/NADH dehydrogenase subunit 1
  • nd2/NAD2/NADH dehydrogenase subunit 2
  • rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit
  • matK/maturase K/maturase type II intron splicing factor
  • 16S ribosomal RNA/16s
  • SSU/18s
  • LSU/28s
  • 12S ribosomal RNA/12s
  • ITS (ITS1-5.8S-ITS2)
  • ITS1/internal transcribed spacer 1
  • ITS2/internal transcribed spacer 2
  • tRNA-Leucine/trnL

Benchmarking

Sample Description Run Mode Target Input File Data Type Memory CPUs Run Time (hh:mm:ss)
570 Arthropod samples Batch COI taxonomy.csv Both 4G 1 01:34:47
570 Arthropod samples Batch COI samples.csv Both (+ genbank) 4G 1 01:42:37
570 Arthropod samples Batch COI samples.csv Nucleotide 4G 1 1:07:53
570 Arthropod samples Batch ND1 samples.csv Nucleotide (>500bp) 4G 1 1:23:26
All available (30) A. thaliana sequences Single rbcL N/A Protein (>300aa) 4G 1 00:00:25
1000 Culicidae sequences Single COI N/A nucleotide (>500bp) 4G 1 0031:05
1000 M. tubercolisis sequences Single 16S N/A nucleotide 4G 1 01:23:54

Future Development

  • Add optional alignment of retrieved sequences
  • Further improve efficiency of record searching and selecting the longest sequence
  • Add support for additional genetic markers beyond the currently supported set
  • Add BOLD query falback if no 'quality' sequence is found in GenBank

Contributions and guidelines

First off, thanks for taking the time to contribute! ❤️

  • If you hav any questions, we assume that you have read the available Documentation. It may also be worth searching for existing Issues that might awnser your question(s). In case you have found a suitable issue and still need clarification, you can write your question in this issue.
  • If you feel you still need clarification or want to report a possible bug/unexpected behaviour, we recommend opening an Issue and provide as much context as you can about what behaviour you were expecting and the behaviour you're running into.
  • If you want to suggest a novel feature or minor improvements to existing functionality, please make your case for the feature/enchanment by opening an Issue or create a pull request with your contribution (at which point it will be evaluated as a possible addition). We aim to address any issues as soon as possible.

Authorship & citation

GeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2025).

If you use GeneFetch, please cite our publication: XYZ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gene_fetch-1.0.12.tar.gz (62.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gene_fetch-1.0.12-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file gene_fetch-1.0.12.tar.gz.

File metadata

  • Download URL: gene_fetch-1.0.12.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.8 Linux/6.1.0-37-amd64

File hashes

Hashes for gene_fetch-1.0.12.tar.gz
Algorithm Hash digest
SHA256 3c0172379bf9d8e91f8cea402585d6bcd2dbe09c4a38d906bc31e1c1b32c3845
MD5 e7aec0b5f1f351b6324a425c3fc8072b
BLAKE2b-256 45fbec885f58f356e2242e9ffba9e2b02b6d17ec2291ea2581e51f1474fb79f5

See more details on using hashes here.

File details

Details for the file gene_fetch-1.0.12-py3-none-any.whl.

File metadata

  • Download URL: gene_fetch-1.0.12-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.12.8 Linux/6.1.0-37-amd64

File hashes

Hashes for gene_fetch-1.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 e24f5f591658983ba4915bec7977175bdf1e7cb94150341becd6d819cb956503
MD5 5665f56a39ede96edd13868cb34c2798
BLAKE2b-256 d15ee7fc00c3714df06724a7af91cda40501d218673119220c93593e87ed80f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page