Skip to main content

A Python tool for fetching metadata for bacterial genomes.

Project description

FetchM: Metadata Fetching and Analysis Tool

Overview

FetchM is a Python-based tool for fetching and analyzing genomic metadata from NCBI BioSample records. When you download ncbi_dataset.tsv from the NCBI genome database, the metadata fields such as 'Collection Date', 'Host', 'Geographic Location', and 'Isolation Source' are missing. This tool helps fetch the associated metadata for each BioSample ID. FetchM requires an input file (ncbi_dataset.tsv) from the NCBI genome database, retrieves additional annotations from NCBI, filters the data based on quality thresholds, and generates visualizations to help interpret the results. You can also download the filtered sequences.

Features

  • Fetch metadata from NCBI BioSample API.
  • Filter genomes based on CheckM completeness and ANI check status.
  • Generate metadata summaries and annotation statistics.
  • Create various visualizations for geographic distribution, collection dates, gene counts, continent, and subcontinent.
  • Download genome sequences (optional).
  • Download sequences after filtering by host species, year, country, continent, and subcontinent.

Installation

Option 1: Install via Conda (Recommended)

conda install -c conda-forge fetchM

Option 2: Install in a New Conda Environment (Isolated)

conda create -n fetchM_env -c conda-forge fetchM
conda activate fetchM_env

Option 3: Install via pip

pip install fetchM

Usage

Run FetchM with the following command:

fetchM --input input.tsv --outdir results/

Additional Options:

  • --checkm CHECKM (Minimum CheckM completeness threshold, default: 95)
  • --sleep (Time to wait between requests, default: 0.5s)
  • --seq (Enable sequence download mode)

Downloading sequences based on different criteria

  • --host HOST [HOST ...] (Filter by host species, e.g., "Homo sapiens" "Bos taurus")
  • --year YEAR [YEAR ...] (Filter by year or year range, e.g., "2015" "2018-2025")
  • --country COUNTRY [COUNTRY ...] (Filter by country, e.g., "Bangladesh" "United States")
  • --cont CONT [CONT ...] (Filter by continent, e.g., "Asia" "Africa")
  • --subcont SUBCONT [SUBCONT ...] (Filter by subcontinent, e.g., "Southern Asia" "Western Africa")

Input

Download the ncbi_dataset.tsv from NCBI genome database for your target organism -ncbi_dataset.tsv

Output

FetchM creates a subdirectory in /results/ based on the organism name provided in the input file. Inside this subdirectory, the following folders are created:

  • Metadata summaries in metadata_output/
    • annotation_summary.csv
    • assembly_summary.csv
    • metadata_summary.csv
    • ncbi_clean.csv
    • ncbi_filtered.csv
    • ncbi_dataset_updated.tsv
  • Figures in figures/
    • Annotation Count Gene Protein-coding_distribution.tiff
    • Annotation Count Gene Pseudogene_distribution.tiff
    • Annotation Count Gene Total_distribution.tiff
    • Assembly Stats Total Sequence Length_distribution.tiff
    • Collection Date_bar_plots.tiff
    • Continent_bar_plots.tiff
    • Geographic Location_bar_plots.tiff
    • Host_bar_plots.tiff
    • scatter_plot_gene_protein_coding_vs_collection_date.tiff
    • scatter_plot_gene_total_vs_collection_date.tiff
    • scatter_plot_total_sequence_length_vs_collection_date.tiff
    • Subcontinent_bar_plots.tiff
  • Sequences in sequences/ (if --seq is enabled, it will contain the downloaded genome sequences).

Visualizations

Annotation Distributions

Annotation Count Gene Protein-coding Annotation Count Gene Pseudogene Annotation Count Gene Total

Assembly Statistics

Assembly Sequence Length

Metadata Summaries

Collection Date Distribution Geographic Location Distribution Host Distribution Continent Distribution Subcontinent Distribution

Scatter Plots

Gene Protein Coding vs Collection Date Gene Total vs Collection Date Sequence Length vs Collection Date

License

This project is licensed under the MIT License.

Author

Developed by Tasnimul Arabi Anik.

Contributions

Contributions and improvements are welcome! Feel free to submit a pull request or report issues.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchm-0.1.1.tar.gz (14.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fetchm-0.1.1-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

fetchM-0.1.1-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file fetchm-0.1.1.tar.gz.

File metadata

  • Download URL: fetchm-0.1.1.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for fetchm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0547e34bc9b3674f42adca684df4c1e17fffa8d0666cb8f5cb630012d05fb39c
MD5 e535fd492b1d10cc4937b338fe8b3fd2
BLAKE2b-256 d2f60a66917aa15b73ed962e613c399b1e537de1b1f5d8c2e81740968f83ce16

See more details on using hashes here.

File details

Details for the file fetchm-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fetchm-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for fetchm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 63f5e7f01dc0c470606e1c6e92357fc7c6a62357ca94658868f49c9ecbf58bf2
MD5 5cc776f4d633f31c6a73330d96d9961a
BLAKE2b-256 0a521aebbf1a1c96c173b1c865bbfdf2c914eb9658e5933b5a6da05c6a91418c

See more details on using hashes here.

File details

Details for the file fetchM-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fetchM-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for fetchM-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 175aeedfa433720b141030e45c3c4011283d67b89c538a40d4651bc6093ef42a
MD5 bba1ea2ea547230877f2d22df065e695
BLAKE2b-256 3586a3f5253eac258871089c756c1c963ce4d6cde09f371cf40a484741d08d01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page