A Python tool for fetching bacterial genome metadata and sequences.
Project description
fetchm: Metadata Fetching and Analysis Tool
Overview
fetchm is a command-line tool for bacterial comparative genomics workflows. It starts from an ncbi_dataset.tsv downloaded from the NCBI Genome interface, retrieves linked BioSample metadata, standardizes key fields, summarizes the dataset, generates figures, and can optionally download the filtered genome FASTA files.
The tool is intended primarily for bacterial genomes. Metadata structures differ across organism groups, so non-bacterial datasets may not behave consistently.
Features
- Fetch
Isolation Source,Collection Date,Geographic Location, andHostfrom NCBI BioSample. - Filter records by ANI status and optional CheckM completeness threshold.
- Standardize common missing-value strings and harmonize collection year and country names.
- Generate summary tables, harmonization reports, and publication-ready plots.
- Download genome FASTA files from NCBI FTP after filtering by host, year, country, continent, or subcontinent.
- Audit an existing sequence directory with
--check-only.
Installation
Create a fresh environment and install from PyPI:
conda create -n fetchm python=3.9
conda activate fetchm
pip install fetchm
fetchm uses Python dependencies only. No separate wget installation is required for the current release.
NCBI API Key
For faster metadata retrieval, you can provide an NCBI API key.
How to create one:
- Sign in to your My NCBI account.
- Open Account Settings.
- Find
API Key Management. - Create an API key.
Official NCBI references:
- https://www.ncbi.nlm.nih.gov/books/NBK25497/
- https://www.ncbi.nlm.nih.gov/books/NBK53593/
- https://www.ncbi.nlm.nih.gov/datasets/docs/v2/api/api-keys/
How fetchm uses request pacing:
- without an API key: default request delay is
0.34seconds - with an API key: default request delay is
0.12seconds - without an API key: default worker count is
3 - with an API key: default worker count is
8
fetchm also keeps a persistent SQLite metadata cache inside each organism output directory so reruns do not need to refetch previously retrieved BioSample records.
Sequence downloads also keep a small SQLite cache of resolved assembly directory paths inside the sequence output directory so reruns can skip repeated FTP path discovery.
You can pass the key directly:
fetchm metadata --input ncbi_dataset.tsv --outdir results/ --api-key YOUR_NCBI_API_KEY
Or use an environment variable:
export NCBI_API_KEY=YOUR_NCBI_API_KEY
fetchm metadata --input ncbi_dataset.tsv --outdir results/
Optional contact email:
fetchm metadata --input ncbi_dataset.tsv --outdir results/ --api-key YOUR_NCBI_API_KEY --email you@example.com
Optional worker override:
fetchm metadata --input ncbi_dataset.tsv --outdir results/ --api-key YOUR_NCBI_API_KEY --workers 8
Optional sequence download worker override:
fetchm seq --input ncbi_clean.csv --outdir sequence_output --download-workers 4
Usage
fetchm has three main commands:
fetchm metadata --input ncbi_dataset.tsv --outdir results/
fetchm run --input ncbi_dataset.tsv --outdir results/
fetchm seq --input results/<organism>/metadata_output/ncbi_clean.csv --outdir results/<organism>/sequence
Common examples:
fetchm metadata --input ncbi_dataset.tsv --outdir results/ --ani all
fetchm run --input ncbi_dataset.tsv --outdir results/ --checkm 95
fetchm seq --input ncbi_clean.csv --outdir sequence_output --country Bangladesh
fetchm seq --input ncbi_clean.csv --outdir sequence_output --cont Asia
fetchm seq --input ncbi_clean.csv --outdir sequence_output --check-only
Sequence filters:
fetchm seq \
--input results/<organism>/metadata_output/ncbi_clean.csv \
--outdir results/<organism>/sequence \
--host "Homo sapiens" \
--year 2018-2024 \
--country Bangladesh
Legacy compatibility commands are still available:
fetchM --input ncbi_dataset.tsv --outdir results/
fetchM --input ncbi_dataset.tsv --outdir results/ --seq
fetchM-seq --input ncbi_clean.csv --outdir sequence_output
Demo Files
Two example inputs are already bundled in the repository:
test.tsv: quick smoke-test dataset.Vibrio_v1.tsv: the larger dataset used in the manuscript workflow.figures/fetchm_workflow.svg: workflow flowchart for GitHub/documentation.figures/fetchm_workflow.tiff: 600 dpi manuscript-ready workflow figure.
Quick smoke test:
fetchm metadata --input test.tsv --outdir test_output
Input Requirements
Download ncbi_dataset.tsv from the NCBI Genome Datasets interface.
If you are unsure which export options to pick, selecting all available columns in the NCBI table export is the safest route.
Required columns:
| Column Name | Description |
|---|---|
Assembly Accession |
Unique identifier for the assembly |
Assembly Name |
Name of the genome assembly |
Organism Name |
Scientific name of the organism |
ANI Check status |
ANI validation status from NCBI |
Annotation Name |
Annotation pipeline name |
Assembly Stats Total Sequence Length |
Total sequence length |
Assembly BioProject Accession |
Linked BioProject accession |
Assembly BioSample Accession |
Linked BioSample accession |
Annotation Count Gene Total |
Total annotated genes |
Annotation Count Gene Protein-coding |
Protein-coding genes |
Annotation Count Gene Pseudogene |
Pseudogenes |
CheckM completeness |
CheckM completeness value |
CheckM contamination |
CheckM contamination value |
Tips:
- The file must be tab-separated.
- Keep the original header names unchanged.
--checkmis optional. If you do not provide it, no CheckM filtering is applied.
Output
For each run, fetchm creates an organism-specific result directory containing:
metadata_output/ncbi_dataset_updated.tsvmetadata_output/ncbi_clean.csvmetadata_output/metadata_summary.csvmetadata_output/assembly_summary.csvmetadata_output/annotation_summary.csvmetadata_output/metadata_harmonization_report.csvfigures/*.tifffigures/Geographic Location_map.jpgsequence/*.fnawhen sequence downloading is enabledsequence/failed_accessions.txtafter sequence audit or download
The harmonization report gives a quick completeness summary for the standardized metadata fields.
Notes
fetchm runalready includes sequence downloading.fetchm metadataandfetchm runsupport--ani,--checkm,--sleep,--api-key,--email, and--workers.fetchm seqsupports--host,--year,--country,--cont,--subcont,--retries,--retry-delay,--check-only, and--download-workers.- Scatter plots are skipped automatically when the filtered dataset does not contain enough valid points.
- Runtime depends strongly on dataset size, NCBI responsiveness, and network conditions.
License
MIT License.
Author
Tasnimul Arabi Anik
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetchm-0.1.12.tar.gz.
File metadata
- Download URL: fetchm-0.1.12.tar.gz
- Upload date:
- Size: 25.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a6fd30b7945a948e33b31a762de5006990c078f6704493d92f703d5bc0e17fe
|
|
| MD5 |
88a07d58aa727a91e55ef913eb143e24
|
|
| BLAKE2b-256 |
98d940ba0fc845519e15616414cc9db3a76a2b1dc81a904fe96c1dcd1bf25cc9
|
File details
Details for the file fetchm-0.1.12-py3-none-any.whl.
File metadata
- Download URL: fetchm-0.1.12-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bba04112d268b985ac29013820cb7dd5a1f3316f9c50cb698ba2cfc9f638d848
|
|
| MD5 |
ada7fbbf0a8f1183c3453d925e786c4a
|
|
| BLAKE2b-256 |
907693a6ae48409db5888fdd917f5c35fd816c770a49bc295ab15a6f848e58f9
|