Skip to main content

Standalone comprehensive genome metadata standardization and sequence download toolkit.

Project description

FetchM2

FetchM2 is a comprehensive standalone command-line toolkit for bacterial genome metadata analysis, metadata standardization, audit reporting, and optional genome sequence download.

FetchM2 is designed as the updated successor to the original FetchM standalone tool. It keeps the same practical command-line workflow, but adds many more standardized metadata fields, richer filtering, packaged curation rules, audit outputs, and reproducible test data.

Recommended one-command workflow:

fetchm2 run --input ncbi_dataset.tsv --outdir results --download

Key Features

  • Standalone command-line tool installable with pip or a conda environment.
  • Reads NCBI Genome Datasets TSV/CSV exports.
  • Optionally fetches linked BioSample metadata from NCBI with retry, cache, and fallback lookup support.
  • Supports offline analysis when metadata columns are already present.
  • Applies packaged deterministic standardization rules.
  • Writes clean CSV and TSV metadata outputs.
  • Generates metadata analysis tables and figures automatically.
  • Produces audit summaries and review queues.
  • Downloads genome FASTA files from NCBI.
  • Supports flexible sequence-download filtering by standardized metadata.
  • Includes test.tsv, matching the public FetchM-style test dataset layout.
  • Includes examples/offline_metadata.tsv for fast local smoke testing.

Installation

Option 1: pip

python -m venv fetchm2-env
source fetchm2-env/bin/activate
pip install fetchm2

Verify:

fetchm2 --version

Option 2: conda / mamba environment

Clone the repository and create the environment:

git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
cd FetchM2
mamba env create -f environment.yml
conda activate fetchm2

If you use conda instead of mamba:

conda env create -f environment.yml
conda activate fetchm2

The conda environment includes taxonkit, which can improve host lineage enrichment for less common TaxIDs. FetchM2 still works without taxonkit; common host lineages are bundled.

Option 3: developer install

git clone https://github.com/Tasnimul-Arabi-Anik/FetchM2.git
cd FetchM2
python -m pip install -e ".[dev]"
pytest

Quick Start

Run the bundled standalone smoke test:

fetchm2 metadata --input examples/offline_metadata.tsv --outdir demo_out --offline

Run the FetchM-style test dataset:

fetchm2 metadata --input test.tsv --outdir test_out --offline

test.tsv contains assembly-level NCBI dataset columns and BioSample accessions. In offline mode, FetchM2 analyzes assembly statistics and any metadata already present in the table. To populate host, source, sample, environment, and geography from NCBI BioSample records, run without --offline.

Run metadata retrieval with BioSample enrichment:

fetchm2 metadata --input test.tsv --outdir test_out_live --workers 3 --sleep 0.4

Use an NCBI API key for larger jobs:

export NCBI_API_KEY=YOUR_NCBI_API_KEY
export NCBI_EMAIL=you@example.com
fetchm2 metadata --input ncbi_dataset.tsv --outdir results --workers 6 --sleep 0.15

Run metadata standardization and sequence download in one command:

fetchm2 run --input ncbi_dataset.tsv --outdir results --download

Typical Species/Genus Workflow

  1. Download an NCBI Genome Datasets TSV or CSV for your target species or genus.
  2. Run FetchM2:
fetchm2 run --input ncbi_dataset.tsv --outdir results --download
  1. Review the main outputs:
  • results/metadata_output/fetchm2_clean.csv
  • results/metadata_analysis/metadata_analysis_report.md
  • results/audit/standardization_audit.md
  • results/audit/production_readiness_gate.md
  • results/sequence/

For large NCBI retrieval jobs without an API key, use a conservative request delay:

fetchm2 run --input ncbi_dataset.tsv --outdir results --download --workers 3 --sleep 0.4

Metadata Retrieval Workflow

FetchM2 can work in two modes.

Offline mode:

  • Uses metadata columns already present in the input table.
  • Applies standardization rules.
  • Generates audit and metadata analysis outputs.
  • Does not contact NCBI.

Live BioSample mode:

  • Reads BioSample accessions from NCBI dataset exports.
  • Retrieves BioSample records through NCBI E-utilities.
  • Uses direct BioSample XML first, then an esummary fallback when the direct record lacks usable attributes.
  • Tracks raw BioSample attribute names and matched standardized attribute names.
  • Uses a local SQLite cache so repeated runs do not refetch the same BioSample records.
  • Uses request throttling, retry, and backoff behavior for temporary NCBI rate-limit or server errors.

Important output columns from retrieval include:

  • BioSample
  • BioSample Taxonomy Name
  • Metadata Fetch Status
  • Metadata Fetch Reason
  • Metadata Fetch Error
  • Metadata Raw Attribute Names
  • Metadata Matched Attribute Names

FetchM2 currently recognizes common BioSample attribute aliases for host, source, sample type, isolation site, collection date, geography, environmental medium/broad/local scale, host disease, and host health state.

Main Commands

fetchm2 metadata --help
fetchm2 run --help
fetchm2 seq --help
fetchm2 audit --help
fetchm2 validate --help
fetchm2 analyze --help

fetchm2 metadata

Reads an NCBI dataset TSV/CSV, optionally fetches BioSample metadata, standardizes fields, and writes clean outputs.

Example:

fetchm2 metadata \
  --input ncbi_dataset.tsv \
  --outdir results \
  --ani OK \
  --checkm 95 \
  --workers 6

Common options:

  • --input: NCBI dataset TSV/CSV.
  • --outdir: output directory.
  • --ani: filter by ANI Check status, for example OK.
  • --checkm: minimum CheckM completeness.
  • --api-key: NCBI API key; can also use NCBI_API_KEY.
  • --email: NCBI email; can also use NCBI_EMAIL.
  • --workers: BioSample fetch worker count.
  • --sleep: shared request delay between NCBI calls. Use a slower value such as 0.4 to 0.5 for unauthenticated larger jobs.
  • --offline: skip NCBI fetching and standardize existing columns only.
  • --no-analysis: skip automatic metadata_analysis/ table and figure generation.

fetchm2 run

Runs metadata analysis and, if requested, sequence download.

fetchm2 run \
  --input ncbi_dataset.tsv \
  --outdir results \
  --ani OK \
  --checkm 95 \
  --download \
  --country Bangladesh \
  --host "Homo sapiens" \
  --year-from 2018 \
  --year-to 2024

fetchm2 seq

Downloads genome FASTA files using the standardized clean metadata table.

fetchm2 seq \
  --input results/metadata_output/fetchm2_clean.csv \
  --outdir results/sequence \
  --host "Homo sapiens" \
  --country Bangladesh \
  --year-from 2018 \
  --year-to 2024

Check expected sequences without downloading:

fetchm2 seq \
  --input results/metadata_output/fetchm2_clean.csv \
  --outdir results/sequence \
  --country Bangladesh \
  --check-only

fetchm2 audit

Audits an existing standardized output:

fetchm2 audit \
  --input results/metadata_output/fetchm2_clean.csv \
  --outdir results/audit_rerun

fetchm2 validate

Runs the same production-readiness checks as audit, but names the workflow explicitly for CLI validation:

fetchm2 validate \
  --input results/metadata_output/fetchm2_clean.csv \
  --outdir results/validation

fetchm2 analyze

Generates metadata analysis outputs from any existing clean metadata CSV.

fetchm2 analyze \
  --input results/metadata_output/fetchm2_clean.csv \
  --outdir results/metadata_analysis_rerun \
  --top-n 30

Metadata Outputs

FetchM2 writes:

  • metadata_output/fetchm2_clean.csv
  • metadata_output/fetchm2_clean.tsv
  • metadata_output/fetchm2_report.md
  • audit/standardization_summary.csv
  • audit/top_host_review_needed.csv
  • audit/standardization_audit.md
  • metadata_analysis/metadata_analysis_report.md
  • metadata_analysis/tables/field_coverage_summary.csv
  • metadata_analysis/tables/top_values_by_field.csv
  • metadata_analysis/tables/numeric_summary.csv
  • metadata_analysis/figures/*.png

Typical output structure:

results/
├── metadata_output/
│   ├── fetchm2_clean.csv
│   ├── fetchm2_clean.tsv
│   └── fetchm2_report.md
├── metadata_analysis/
│   ├── metadata_analysis_report.md
│   ├── tables/
│   └── figures/
├── audit/
│   ├── standardization_summary.csv
│   ├── standardization_audit.md
│   ├── production_readiness_gate.md
│   ├── production_readiness_gate.json
│   ├── top_host_review_needed.csv
│   ├── non_country_values_in_country.csv
│   ├── country_continent_mismatch.csv
│   ├── country_subcontinent_mismatch.csv
│   ├── invalid_collection_years.csv
│   ├── invalid_host_like_sample_type.csv
│   ├── source_like_mapped_hosts.csv
│   ├── source_like_unmapped_hosts_for_review.csv
│   ├── broad_vocabulary_leakage.csv
│   ├── sequence_readiness.csv
│   └── rule_count_summary.csv
└── sequence/
    ├── *.fna
    ├── failed_accessions.txt
    ├── sequence_download_summary.csv
    └── fetchm2_sequence_cache.sqlite3

Standardized Metadata Fields

FetchM2 keeps the original input columns and adds standardized fields.

Host Standardization

Original FetchM had host-oriented metadata summaries. FetchM2 expands this into detailed host standardization:

  • Host_Original
  • Host_Cleaned
  • Host_SD
  • Host_TaxID
  • Host_Rank
  • Host_Superkingdom
  • Host_Phylum
  • Host_Class
  • Host_Order
  • Host_Family
  • Host_Genus
  • Host_Species
  • Host_Common_Name
  • Host_Context_SD
  • Host_Match_Method
  • Host_Confidence
  • Host_Review_Status

Examples:

  • human, human blood, Homosapines variants can map to Homo sapiens, TaxID 9606.
  • cattle feces can map to Bos taurus, TaxID 9913, while also preserving feces/stool as sample metadata.
  • bacteria culture, DH5a, lab strain terms, missing values, and source/material terms are blocked from becoming host values.

Source, Sample, and Environment

FetchM2 standardizes source/sample/environment fields into:

  • Sample_Type_SD
  • Sample_Type_SD_Broad
  • Isolation_Source_SD
  • Isolation_Source_SD_Broad
  • Isolation_Site_SD
  • Environment_Medium_SD
  • Environment_Medium_SD_Broad
  • Environment_Broad_Scale_SD
  • Environment_Local_Scale_SD

Examples:

  • blood -> Sample_Type_SD=blood
  • urine -> Sample_Type_SD=urine
  • feces, faeces, stool -> Sample_Type_SD=feces/stool
  • soil -> Environment_Medium_SD=soil
  • sediment -> Environment_Medium_SD=sediment
  • wastewater, sewage -> Environment_Medium_SD=wastewater/sewage
  • hospital surface -> healthcare/source context
  • rectal swab -> sample type plus anatomical site when available

Geography and Date

FetchM2 standardizes:

  • Country
  • Continent
  • Subcontinent
  • Collection_Year

It also blocks common false positives such as:

  • Hospital as country
  • ground turkey as Turkey
  • Guinea pig as Guinea
  • Norway rat as Norway
  • Aspergillus niger as Niger

Disease and Health State

FetchM2 includes:

  • Host_Disease_SD
  • Host_Health_State_SD

These are conservative deterministic fields. Disease words are not treated as sample material unless an actual specimen is present.

Sequence Download Features

FetchM2 downloads genome FASTA files from the NCBI genomes FTP structure using Assembly Accession and Assembly Name.

Filtering options:

  • --host
  • --host-rank
  • --country
  • --continent
  • --subcontinent
  • --sample-type
  • --isolation-source
  • --environment-medium
  • --year-from
  • --year-to
  • --max-genomes

Download control:

  • --download-workers
  • --retries
  • --retry-delay
  • --keep-gz
  • --check-only

Outputs:

  • genome FASTA files
  • failed_accessions.txt
  • sequence_download_summary.csv
  • fetchm2_sequence_cache.sqlite3

Test Dataset

FetchM2 includes:

  • test.tsv: FetchM-style NCBI dataset example copied from the public FetchM test dataset.
  • examples/test_ncbi_dataset.tsv: same dataset stored under examples.
  • examples/offline_metadata.tsv: small annotated metadata table for fast offline testing.

Run:

fetchm2 metadata --input test.tsv --outdir test_run --offline
fetchm2 audit --input test_run/metadata_output/fetchm2_clean.csv --outdir test_run/audit_check

For BioSample metadata retrieval:

fetchm2 metadata --input test.tsv --outdir test_run_live --workers 3 --sleep 0.34

Rule Files Packaged With FetchM2

FetchM2 ships deterministic rules in src/fetchm2/data/:

  • host_synonyms.csv
  • host_negative_rules.csv
  • controlled_categories.csv
  • approved_broad_categories.csv
  • geography_reviewed_rules.csv
  • collection_date_reviewed_rules.csv
  • country_mapping.json

These rules let the standalone tool produce richer standardized fields without needing a web database.

API Keys

For NCBI, use environment variables:

export NCBI_API_KEY=YOUR_NCBI_API_KEY
export NCBI_EMAIL=you@example.com

Do not put API keys in scripts, notebooks, README files, Git commits, or issue reports.

Validation

Run local validation:

pytest
python -m build
python -m twine check dist/*
python -m pip install dist/fetchm2-*.whl
fetchm2 metadata --input examples/offline_metadata.tsv --outdir smoke_out --offline
fetchm2 validate --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_out/validation
fetchm2 seq --input smoke_out/metadata_output/fetchm2_clean.csv --outdir smoke_seq --country Bangladesh --check-only

The validation report is in:

docs/VALIDATION_REPORT.md

More analysis details:

docs/METADATA_ANALYSIS.md
docs/STANDARDIZATION.md
docs/SEQUENCE_DOWNLOAD.md
docs/RELEASE_CHECKLIST.md

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetchm2-0.1.2.tar.gz (395.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fetchm2-0.1.2-py3-none-any.whl (374.2 kB view details)

Uploaded Python 3

File details

Details for the file fetchm2-0.1.2.tar.gz.

File metadata

  • Download URL: fetchm2-0.1.2.tar.gz
  • Upload date:
  • Size: 395.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for fetchm2-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c0765c2fdfd1299e05b91ddf5764f2640de39340bd98952b1478cedcaaa34982
MD5 8e6dea758d0176a320eda61cd207cae7
BLAKE2b-256 bf96f18d12a36a3ba98ffea3d9ec28c2b11dd1d39d37e2195f39b9f0988ac7e9

See more details on using hashes here.

File details

Details for the file fetchm2-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: fetchm2-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 374.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for fetchm2-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8adbd5b5c83ddacffc60fe33ae6bf18fc02ab1611ccb915cdbf627b04c62f31e
MD5 2d3637e907decc4581ddc313d8bc508b
BLAKE2b-256 f5b61336c6b8d8125ff9f86a8eecd237e265af83372b51508ebde17c0093da8d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page