Standalone comprehensive genome metadata standardization and sequence download toolkit.
Project description
FetchM2
FetchM2 is a standalone command-line toolkit for genome metadata retrieval, comprehensive metadata standardization, audit reporting, and optional sequence download.
It keeps the simple standalone installation model of the original public FetchM, while packaging deterministic rule files and QA concepts developed in FetchM Web.
What FetchM2 Does
- Reads NCBI Genome Datasets TSV/CSV exports.
- Optionally fetches linked BioSample metadata from NCBI.
- Standardizes host, country/geography, collection year, sample type, isolation source, isolation site, environment medium, host disease, and host health state.
- Adds host TaxID, rank, lineage fields, match method, confidence, and review status.
- Writes clean metadata tables and audit reports.
- Downloads genome FASTA files from NCBI with flexible filters.
- Runs offline on already annotated tables for reproducible tests and local standardization.
Installation
Recommended clean environment:
python -m venv fetchm2-env
source fetchm2-env/bin/activate
pip install fetchm2
For development from source:
git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
cd FetchM2
python -m pip install -e ".[dev]"
pytest
FetchM2 uses Python dependencies only. taxonkit is optional. If available, FetchM2 can use it to enrich less common host TaxIDs with lineage fields; common host lineages are bundled.
Quick Start
Offline smoke test using the bundled example:
fetchm2 metadata --input examples/offline_metadata.tsv --outdir demo_out --offline
Full BioSample metadata retrieval:
fetchm2 metadata --input ncbi_dataset.tsv --outdir results
With NCBI API key:
export NCBI_API_KEY=YOUR_NCBI_API_KEY
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --workers 6 --sleep 0.15
All-in-one metadata plus sequence download:
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
Filtered sequence download from a clean table:
fetchm2 seq \
--input results/metadata_output/fetchm2_clean.csv \
--outdir results/sequence \
--host "Homo sapiens" \
--country Bangladesh \
--year-from 2018 \
--year-to 2024
Main Commands
fetchm2 metadata --help
fetchm2 run --help
fetchm2 seq --help
fetchm2 audit --help
Metadata Outputs
FetchM2 writes:
metadata_output/fetchm2_clean.csvmetadata_output/fetchm2_clean.tsvmetadata_output/fetchm2_report.mdaudit/standardization_summary.csvaudit/top_host_review_needed.csvaudit/standardization_audit.md
Important standardized fields include:
Host_SD,Host_TaxID,Host_Rank,Host_Superkingdom,Host_Phylum,Host_Class,Host_Order,Host_Family,Host_Genus,Host_SpeciesHost_Common_Name,Host_Match_Method,Host_Confidence,Host_Review_StatusSample_Type_SD,Sample_Type_SD_BroadIsolation_Source_SD,Isolation_Source_SD_BroadIsolation_Site_SDEnvironment_Medium_SD,Environment_Medium_SD_BroadEnvironment_Broad_Scale_SD,Environment_Local_Scale_SDHost_Disease_SD,Host_Health_State_SDCountry,Continent,Subcontinent,Collection_Year
Sequence Download Options
FetchM2 supports filtering by:
- host
- host rank
- country
- continent
- subcontinent
- sample type
- isolation source
- environment medium
- collection year range
- maximum genomes
Use --check-only to audit a sequence output directory without downloading.
API Keys
For NCBI, prefer environment variables:
export NCBI_API_KEY=YOUR_NCBI_API_KEY
export NCBI_EMAIL=you@example.com
Do not place API keys in scripts, notebooks, README files, or Git commits.
Design Compared With FetchM and FetchM Web
FetchM2 uses the original FetchM standalone flow as the command-line baseline:
- metadata
- run
- seq
- SQLite cache
- NCBI BioSample fetch
- sequence download from NCBI FTP
FetchM2 adds FetchM Web-style standardized metadata fields and deterministic rule files:
- host synonyms and negative host rules
- controlled source/sample/environment categories
- approved broad vocabulary
- production-style audit gate
- richer sequence filtering on standardized fields
FetchM2 intentionally does not use embeddings or AI for production mappings. Embeddings can be used later as a review assistant, but final production rules should remain deterministic and auditable.
Testing
Run:
pytest
python -m build
python -m pip install dist/fetchm2-*.whl
fetchm2 metadata --input examples/offline_metadata.tsv --outdir smoke_out --offline
fetchm2 seq --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_seq --country Bangladesh --check-only
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetchm2-0.1.0.tar.gz.
File metadata
- Download URL: fetchm2-0.1.0.tar.gz
- Upload date:
- Size: 370.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9efa1c20f0a581ef06828bd330c7242c64f771c024fa38d71237c7a6f01fbfbf
|
|
| MD5 |
d838f2b912038143a4576a482e9fdb6f
|
|
| BLAKE2b-256 |
7b41d8138da206e4a423d1d3c252ff7ebcdffcf61292b6771b9f536fd562d0ab
|
File details
Details for the file fetchm2-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fetchm2-0.1.0-py3-none-any.whl
- Upload date:
- Size: 362.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7acdea10007557aa03895bfaea5fdee483af5def03fd326af4cfcbb295c1348b
|
|
| MD5 |
86b89f477d4a04f4983fe9f8f4ad98e6
|
|
| BLAKE2b-256 |
c4fd44ae7f5e9cbd4702b51ce79bc832edaa731a7b99e4276829002f4f893eef
|