A Python tool for fetching bacterial genome metadata and sequences.
Project description
fetchm: Metadata Fetching and Analysis Tool
Overview
fetchm is a Python-based tool for fetching and analyzing genomic metadata from NCBI BioSample records. When you download ncbi_dataset.tsv from the NCBI genome database, the metadata fields such as 'Collection Date', 'Host', 'Geographic Location', and 'Isolation Source' are missing. This tool helps fetch the associated metadata for each BioSample ID. fetchm requires an input file (ncbi_dataset.tsv) from the NCBI genome database, retrieves additional annotations from NCBI, filters the data based on quality thresholds, and generates visualizations to help interpret the results. You can also download the filtered sequences.
Features
- Fetch metadata from NCBI BioSample API.
- Filter genomes based on CheckM completeness and ANI check status.
- Generate metadata summaries and annotation statistics.
- Create various visualizations for geographic distribution, collection dates, gene counts, continent, and subcontinent.
- Download genome sequences (optional).
- Download sequences after filtering by host species, year, country, continent, and subcontinent.
Installation
Install in a New Conda Environment
conda create -n fetchm python=3.9
conda activate fetchm
pip install fetchm
Usage
fetchm has three main modes:
- Generate metadata summaries and
ncbi_clean.csvfrom an NCBI dataset TSV:
fetchm metadata --input ncbi_dataset.tsv --outdir results/
- Run the full workflow: metadata generation plus sequence download:
fetchm run --input ncbi_dataset.tsv --outdir results/
- Download sequences later from an existing
ncbi_clean.csv:
fetchm seq --input results/<organism>/metadata_output/ncbi_clean.csv --outdir results/<organism>/sequence
Common examples:
Download all metadata records regardless of ANI status:
fetchm metadata --input ncbi_dataset.tsv --outdir results/ --ani all
Run the full pipeline with a CheckM threshold:
fetchm run --input ncbi_dataset.tsv --outdir results/ --checkm 95
Download only sequences from human isolates collected between 2018 and 2024:
fetchm seq \
--input results/<organism>/metadata_output/ncbi_clean.csv \
--outdir results/<organism>/sequence \
--host "Homo sapiens" \
--year 2018-2024
Download only sequences from a specific country or continent:
fetchm seq --input ncbi_clean.csv --outdir sequence_output --country Bangladesh
fetchm seq --input ncbi_clean.csv --outdir sequence_output --cont Asia
Check download completeness without downloading anything:
fetchm seq --input ncbi_clean.csv --outdir sequence_output --check-only
Important notes:
fetchm runalready includes sequence downloading. You do not need to add--seqwhen usingfetchm run.--seqis only relevant for the legacyfetchMcommand, where it controls whether sequence downloading happens after metadata generation.fetchm seqsupports metadata-based sequence filters:--host,--year,--country,--cont, and--subcont.- Metadata filtering options for
fetchm metadataandfetchm runinclude--ani,--checkm, and--sleep. - Sequence retry behavior can be adjusted with
--retriesand--retry-delay.
Legacy compatibility commands:
fetchM --input ncbi_dataset.tsv --outdir results/
fetchM --input ncbi_dataset.tsv --outdir results/ --seq
fetchM-seq --input ncbi_clean.csv --outdir sequence_output
Test With test.tsv
Run a quick metadata-only smoke test:
fetchm metadata --input test.tsv --outdir test_output
Run the full pipeline, including sequence download:
fetchm run --input test.tsv --outdir test_output
Check downloaded sequence completeness from the generated ncbi_clean.csv:
fetchm seq \
--input test_output/Staphylococcus_haemolyticus/metadata_output/ncbi_clean.csv \
--outdir test_output/Staphylococcus_haemolyticus/sequence \
--check-only
Input
Download ncbi_dataset.tsv of your target organism(s) from the NCBI genome database. -ncbi_dataset.tsv
Required Columns for ncbi_dataset.tsv in fetchm
Before running fetchm, ensure that your ncbi_dataset.tsv file includes the following columns. These columns are necessary for metadata enrichment, quality filtering, and downstream analysis.
🧬 Required Columns
| Column Name | Description |
|---|---|
Assembly Accession |
Unique identifier for the assembly |
Assembly Name |
Name of the genome assembly |
Organism Name |
Scientific name of the organism |
ANI Check status |
Status of Average Nucleotide Identity (ANI) check |
Annotation Name |
Annotation version or label used |
Assembly Stats Total Sequence Length |
Total length (in base pairs) of all sequences in the assembly |
Assembly BioProject Accession |
Accession ID for the related BioProject |
Assembly BioSample Accession |
Accession ID for the related BioSample |
Annotation Count Gene Total |
Total number of genes annotated |
Annotation Count Gene Protein-coding |
Number of protein-coding genes |
Annotation Count Gene Pseudogene |
Number of pseudogenes |
CheckM completeness |
Completeness score from CheckM (in %) |
CheckM contamination |
Contamination score from CheckM (in %) |
✅ Tips
- The file must be tab-separated (
.tsvformat). - Don't change Column headers
Output
fetchm creates a subdirectory in /results/ based on the organism name provided in the input file. Inside this subdirectory, the following folders are created:
- Metadata summaries in
metadata_output/annotation_summary.csvassembly_summary.csvmetadata_summary.csvncbi_clean.csvncbi_filtered.csvncbi_dataset_updated.tsv
- Figures in
figures/Annotation Count Gene Protein-coding_distribution.tiffAnnotation Count Gene Pseudogene_distribution.tiffAnnotation Count Gene Total_distribution.tiffAssembly Stats Total Sequence Length_distribution.tiffCollection Date_bar_plots.tiffContinent_bar_plots.tiffGeographic Location_bar_plots.tiffGeographic Location_map.jpgHost_bar_plots.tiffscatter_plot_gene_protein_coding_vs_collection_date.tiffscatter_plot_gene_total_vs_collection_date.tiffscatter_plot_total_sequence_length_vs_collection_date.tiffSubcontinent_bar_plots.tiff
- Sequences in
sequence/(if--seqis enabled, it will contain the downloaded genome sequences).
Visualizations
Annotation Distributions
Assembly Statistics
Metadata Summaries
Scatter Plots
License
This project is licensed under the MIT License.
Author
Developed by Tasnimul Arabi Anik.
Contributions
Contributions and improvements are welcome! Feel free to submit a pull request or report issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetchm-0.1.9.tar.gz.
File metadata
- Download URL: fetchm-0.1.9.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a0984195602b26f5a5055defa6e83c2d677701ff67a75c2ddfc0466aa9072bb
|
|
| MD5 |
a45608617a16c99d3f87ed6f8299caaf
|
|
| BLAKE2b-256 |
12b363a76bf5a8d99f7e857c86482ae3207144bd23c86cc1eb858aa1831020b0
|
File details
Details for the file fetchm-0.1.9-py3-none-any.whl.
File metadata
- Download URL: fetchm-0.1.9-py3-none-any.whl
- Upload date:
- Size: 19.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
614d46d37473531d04d23a4a58e9a46c353ebb4d961dfe5d2a9a72dfe7cd8864
|
|
| MD5 |
4a8e3b956587792eee2faaa5a80fb672
|
|
| BLAKE2b-256 |
d1dcfa8b0332b77ad3fdc8b8b13c14a0c987368dc41f20f550ce0a876375056c
|