Skip to main content

A Python tool for fetching bacterial genome metadata and sequences.

Project description

fetchm: Metadata Fetching and Analysis Tool

Overview

fetchm is a command-line tool for bacterial comparative genomics workflows. It starts from an ncbi_dataset.tsv downloaded from the NCBI Genome interface, retrieves linked BioSample metadata, standardizes key fields, summarizes the dataset, generates figures, and can optionally download the filtered genome FASTA files.

The tool is intended primarily for bacterial genomes. Metadata structures differ across organism groups, so non-bacterial datasets may not behave consistently.

Features

  • Fetch Isolation Source, Collection Date, Geographic Location, and Host from NCBI BioSample.
  • Filter records by ANI status and optional CheckM completeness threshold.
  • Standardize common missing-value strings and harmonize collection year and country names.
  • Generate summary tables, harmonization reports, and publication-ready plots.
  • Download genome FASTA files from NCBI FTP after filtering by host, year, country, continent, or subcontinent.
  • Audit an existing sequence directory with --check-only.

Installation

Create a fresh environment and install from PyPI:

conda create -n fetchm python=3.9
conda activate fetchm
pip install fetchm

fetchm uses Python dependencies only. No separate wget installation is required for the current release.

NCBI API Key

For faster metadata retrieval, you can provide an NCBI API key.

How to create one:

  1. Sign in to your My NCBI account.
  2. Open Account Settings.
  3. Find API Key Management.
  4. Create an API key.

Official NCBI references:

How fetchm uses request pacing:

  • without an API key: default request delay is 0.34 seconds
  • with an API key: default request delay is 0.12 seconds
  • without an API key: default worker count is 3
  • with an API key: default worker count is 8

fetchm also keeps a persistent SQLite metadata cache inside each organism output directory so reruns do not need to refetch previously retrieved BioSample records.

You can pass the key directly:

fetchm metadata --input ncbi_dataset.tsv --outdir results/ --api-key YOUR_NCBI_API_KEY

Or use an environment variable:

export NCBI_API_KEY=YOUR_NCBI_API_KEY
fetchm metadata --input ncbi_dataset.tsv --outdir results/

Optional contact email:

fetchm metadata --input ncbi_dataset.tsv --outdir results/ --api-key YOUR_NCBI_API_KEY --email you@example.com

Optional worker override:

fetchm metadata --input ncbi_dataset.tsv --outdir results/ --api-key YOUR_NCBI_API_KEY --workers 8

Usage

fetchm has three main commands:

fetchm metadata --input ncbi_dataset.tsv --outdir results/
fetchm run --input ncbi_dataset.tsv --outdir results/
fetchm seq --input results/<organism>/metadata_output/ncbi_clean.csv --outdir results/<organism>/sequence

Common examples:

fetchm metadata --input ncbi_dataset.tsv --outdir results/ --ani all
fetchm run --input ncbi_dataset.tsv --outdir results/ --checkm 95
fetchm seq --input ncbi_clean.csv --outdir sequence_output --country Bangladesh
fetchm seq --input ncbi_clean.csv --outdir sequence_output --cont Asia
fetchm seq --input ncbi_clean.csv --outdir sequence_output --check-only

Sequence filters:

fetchm seq \
  --input results/<organism>/metadata_output/ncbi_clean.csv \
  --outdir results/<organism>/sequence \
  --host "Homo sapiens" \
  --year 2018-2024 \
  --country Bangladesh

Legacy compatibility commands are still available:

fetchM --input ncbi_dataset.tsv --outdir results/
fetchM --input ncbi_dataset.tsv --outdir results/ --seq
fetchM-seq --input ncbi_clean.csv --outdir sequence_output

Demo Files

Two example inputs are already bundled in the repository:

  • test.tsv: quick smoke-test dataset.
  • Vibrio_v1.tsv: the larger dataset used in the manuscript workflow.
  • figures/fetchm_workflow.svg: workflow flowchart for GitHub/documentation.
  • figures/fetchm_workflow.tiff: 600 dpi manuscript-ready workflow figure.

Quick smoke test:

fetchm metadata --input test.tsv --outdir test_output

Input Requirements

Download ncbi_dataset.tsv from the NCBI Genome Datasets interface.

If you are unsure which export options to pick, selecting all available columns in the NCBI table export is the safest route.

Required columns:

Column Name Description
Assembly Accession Unique identifier for the assembly
Assembly Name Name of the genome assembly
Organism Name Scientific name of the organism
ANI Check status ANI validation status from NCBI
Annotation Name Annotation pipeline name
Assembly Stats Total Sequence Length Total sequence length
Assembly BioProject Accession Linked BioProject accession
Assembly BioSample Accession Linked BioSample accession
Annotation Count Gene Total Total annotated genes
Annotation Count Gene Protein-coding Protein-coding genes
Annotation Count Gene Pseudogene Pseudogenes
CheckM completeness CheckM completeness value
CheckM contamination CheckM contamination value

Tips:

  • The file must be tab-separated.
  • Keep the original header names unchanged.
  • --checkm is optional. If you do not provide it, no CheckM filtering is applied.

Output

For each run, fetchm creates an organism-specific result directory containing:

  • metadata_output/ncbi_dataset_updated.tsv
  • metadata_output/ncbi_clean.csv
  • metadata_output/metadata_summary.csv
  • metadata_output/assembly_summary.csv
  • metadata_output/annotation_summary.csv
  • metadata_output/metadata_harmonization_report.csv
  • figures/*.tiff
  • figures/Geographic Location_map.jpg
  • sequence/*.fna when sequence downloading is enabled
  • sequence/failed_accessions.txt after sequence audit or download

The harmonization report gives a quick completeness summary for the standardized metadata fields.

Notes

  • fetchm run already includes sequence downloading.
  • fetchm metadata and fetchm run support --ani, --checkm, --sleep, --api-key, --email, and --workers.
  • fetchm seq supports --host, --year, --country, --cont, --subcont, --retries, --retry-delay, and --check-only.
  • Scatter plots are skipped automatically when the filtered dataset does not contain enough valid points.
  • Runtime depends strongly on dataset size, NCBI responsiveness, and network conditions.

License

MIT License.

Author

Tasnimul Arabi Anik

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchm-0.1.11.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fetchm-0.1.11-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file fetchm-0.1.11.tar.gz.

File metadata

  • Download URL: fetchm-0.1.11.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for fetchm-0.1.11.tar.gz
Algorithm Hash digest
SHA256 73db4312ecabcbfb4c6861a90c423ab2ce7e577313970c1fb6d92fe3eb813944
MD5 2c4d75e5a2b50396ee6a283ff7c8c654
BLAKE2b-256 c4b86ce8afe619733783748e61324c18863832a21e7093d39c3eccfa78e68fb7

See more details on using hashes here.

File details

Details for the file fetchm-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: fetchm-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for fetchm-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 82a24ef1c1d80faddb385e4bb1588b3569e44958c7230a6b7f0e44ce64bc9266
MD5 64562972927ed82361c747a7fa2de9a9
BLAKE2b-256 a82b4f741e24d59a6fe6391bc2d1c5ff943de64770e817242c3c60f59eea16e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page