Skip to main content

Standalone comprehensive genome metadata standardization and sequence download toolkit.

Project description

FetchM2

FetchM2 is a standalone command-line toolkit for genome metadata retrieval, comprehensive metadata standardization, audit reporting, and optional sequence download.

It keeps the simple standalone installation model of the original public FetchM, while packaging deterministic rule files and QA concepts developed in FetchM Web.

What FetchM2 Does

  • Reads NCBI Genome Datasets TSV/CSV exports.
  • Optionally fetches linked BioSample metadata from NCBI.
  • Standardizes host, country/geography, collection year, sample type, isolation source, isolation site, environment medium, host disease, and host health state.
  • Adds host TaxID, rank, lineage fields, match method, confidence, and review status.
  • Writes clean metadata tables and audit reports.
  • Downloads genome FASTA files from NCBI with flexible filters.
  • Runs offline on already annotated tables for reproducible tests and local standardization.

Installation

Recommended clean environment:

python -m venv fetchm2-env
source fetchm2-env/bin/activate
pip install fetchm2

For development from source:

git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
cd FetchM2
python -m pip install -e ".[dev]"
pytest

FetchM2 uses Python dependencies only. taxonkit is optional. If available, FetchM2 can use it to enrich less common host TaxIDs with lineage fields; common host lineages are bundled.

Quick Start

Offline smoke test using the bundled example:

fetchm2 metadata --input examples/offline_metadata.tsv --outdir demo_out --offline

Full BioSample metadata retrieval:

fetchm2 metadata --input ncbi_dataset.tsv --outdir results

With NCBI API key:

export NCBI_API_KEY=YOUR_NCBI_API_KEY
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --workers 6 --sleep 0.15

All-in-one metadata plus sequence download:

fetchm2 run --input ncbi_dataset.tsv --outdir results --download

Filtered sequence download from a clean table:

fetchm2 seq \
  --input results/metadata_output/fetchm2_clean.csv \
  --outdir results/sequence \
  --host "Homo sapiens" \
  --country Bangladesh \
  --year-from 2018 \
  --year-to 2024

Main Commands

fetchm2 metadata --help
fetchm2 run --help
fetchm2 seq --help
fetchm2 audit --help

Metadata Outputs

FetchM2 writes:

  • metadata_output/fetchm2_clean.csv
  • metadata_output/fetchm2_clean.tsv
  • metadata_output/fetchm2_report.md
  • audit/standardization_summary.csv
  • audit/top_host_review_needed.csv
  • audit/standardization_audit.md

Important standardized fields include:

  • Host_SD, Host_TaxID, Host_Rank, Host_Superkingdom, Host_Phylum, Host_Class, Host_Order, Host_Family, Host_Genus, Host_Species
  • Host_Common_Name, Host_Match_Method, Host_Confidence, Host_Review_Status
  • Sample_Type_SD, Sample_Type_SD_Broad
  • Isolation_Source_SD, Isolation_Source_SD_Broad
  • Isolation_Site_SD
  • Environment_Medium_SD, Environment_Medium_SD_Broad
  • Environment_Broad_Scale_SD, Environment_Local_Scale_SD
  • Host_Disease_SD, Host_Health_State_SD
  • Country, Continent, Subcontinent, Collection_Year

Sequence Download Options

FetchM2 supports filtering by:

  • host
  • host rank
  • country
  • continent
  • subcontinent
  • sample type
  • isolation source
  • environment medium
  • collection year range
  • maximum genomes

Use --check-only to audit a sequence output directory without downloading.

API Keys

For NCBI, prefer environment variables:

export NCBI_API_KEY=YOUR_NCBI_API_KEY
export NCBI_EMAIL=you@example.com

Do not place API keys in scripts, notebooks, README files, or Git commits.

Design Compared With FetchM and FetchM Web

FetchM2 uses the original FetchM standalone flow as the command-line baseline:

  • metadata
  • run
  • seq
  • SQLite cache
  • NCBI BioSample fetch
  • sequence download from NCBI FTP

FetchM2 adds FetchM Web-style standardized metadata fields and deterministic rule files:

  • host synonyms and negative host rules
  • controlled source/sample/environment categories
  • approved broad vocabulary
  • production-style audit gate
  • richer sequence filtering on standardized fields

FetchM2 intentionally does not use embeddings or AI for production mappings. Embeddings can be used later as a review assistant, but final production rules should remain deterministic and auditable.

Testing

Run:

pytest
python -m build
python -m pip install dist/fetchm2-*.whl
fetchm2 metadata --input examples/offline_metadata.tsv --outdir smoke_out --offline
fetchm2 seq --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_seq --country Bangladesh --check-only

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchm2-0.1.0.tar.gz (370.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fetchm2-0.1.0-py3-none-any.whl (362.6 kB view details)

Uploaded Python 3

File details

Details for the file fetchm2-0.1.0.tar.gz.

File metadata

  • Download URL: fetchm2-0.1.0.tar.gz
  • Upload date:
  • Size: 370.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for fetchm2-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9efa1c20f0a581ef06828bd330c7242c64f771c024fa38d71237c7a6f01fbfbf
MD5 d838f2b912038143a4576a482e9fdb6f
BLAKE2b-256 7b41d8138da206e4a423d1d3c252ff7ebcdffcf61292b6771b9f536fd562d0ab

See more details on using hashes here.

File details

Details for the file fetchm2-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fetchm2-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 362.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for fetchm2-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7acdea10007557aa03895bfaea5fdee483af5def03fd326af4cfcbb295c1348b
MD5 86b89f477d4a04f4983fe9f8f4ad98e6
BLAKE2b-256 c4fd44ae7f5e9cbd4702b51ce79bc832edaa731a7b99e4276829002f4f893eef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page