Fast bacterial genome typing with MLST, cgMLST, and scheme-free discovery
Project description
gmlst
gmlst is a fast Python 3.12 CLI for bacterial genome typing with classical MLST, large cgMLST and wgMLST schemes, and scheme-free discovery workflows. It supports assembled genomes and raw reads, several alignment backends, multiple public data providers, custom local schemes, offline cache reuse, and local MST visualization from one command-line interface.
English | 简体中文
Features
- 🧬 Broad typing support: run
gmlst typing mlst,gmlst typing cgmlst, andgmlst typing tgmlstfrom the same CLI. - ⚡ Multiple backends: use BLAST+, KMA, minimap2, MUMmer4, with built-in exact-hash pre-resolution for cgMLST workflows.
- 🧫 FASTA and FASTQ input: type assembled genomes and paired-end raw reads with backend-aware handling.
- 🗂️ Multiple providers: work with PubMLST, Pasteur BIGSdb, Enterobase, cgmlst.org, and local custom schemes.
- 🧠 Smart cgMLST modes: choose
standard,chew-fast,chew-ultrafast,chew-bsr, orchew-balanceddepending on speed and evidence needs. - 🆕 Novel allele workflow: detect novel alleles, extract novel profiles, and build custom laboratory databases.
- 🔍 Scheme-free typing: run
tgmlstfor de novo allele discovery without a preselected public scheme. - 📦 Rich outputs: export
tsv,json,pretty, and GrapeTree-compatible tables. - 🌐 Local visualization: launch a Flask + Vue web app with
gmlst visual webto inspect MST results locally. - 💾 Cache-first operation: downloaded schemes and built indexes are reused for offline or repeated runs.
- 🧵 Batch processing: use sample-level workers and backend threads for high-throughput workflows.
- 🧬 CDS-aware calling: cgMLST workflows can use Pyrodigal for CDS prediction and chewBBACA-compatible classification paths.
Installation
Option 1, pixi, recommended
Pixi installs Python, external bioinformatics tools, and the editable package in one environment.
curl -fsSL https://pixi.sh/install.sh | bash
git clone https://github.com/indexofire/gmlst.git
cd gmlst
pixi install
pixi run gmlst --version
Option 2, pip
Use this if you already manage your own Python and system tools.
python3 -m venv .venv
source .venv/bin/activate
pip install gmlst
# Install external tools separately, for example with conda or mamba
conda install -c bioconda blast minimap2 mummer4 mmseqs2 prodigal kma kmc samtools
Option 3, from source
git clone https://github.com/indexofire/gmlst.git
cd gmlst
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
gmlst --help
External tools managed by pixi
blast >=2.14minimap2 >=2.26mummer4 >=4.0mmseqs2 >=15prodigal >=2.6kma >=1.6.8kmc >=3.2.4samtools >=1.23.1
Python package requirements
clickflaskrequestsrichxxhashpyyamlpyrodigal
Quick Start
1. Browse and download a scheme
# List cached and available schemes
gmlst scheme list
# Restrict to one provider
gmlst scheme list -p pubmlst
# Download a scheme to the local cache
gmlst scheme download -s saureus_1
2. Type one sample
# MLST on an assembled genome
gmlst typing mlst -s saureus_1 sample.fasta
# MLST on paired-end reads
gmlst typing mlst -s saureus_1 -b minimap2 sample_R1.fastq.gz sample_R2.fastq.gz
# cgMLST on an assembly
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-fast sample.fna
3. Batch processing
# Write TSV output for many assemblies
gmlst typing mlst -s saureus_1 --max-workers 8 samples/*.fasta -o results.tsv
# Save machine-readable JSON for downstream novel extraction
gmlst typing mlst -s saureus_1 --format json samples/*.fasta -o results.json
4. Understand the output
Default output is TSV, compatible with the familiar tseemann/mlst style.
FILE SCHEME ST arcC aroE glpF gmk pta tpi yqiL
sample1.fasta saureus_1 1 1 1 1 1 1 1 1
sample2.fasta saureus_1 - 1 ~2 3? - 1 1 1
- plain allele number, exact known allele match
~23, non-exact high-coverage call, typically a closest or novel-style locus depending on identity15?, partial locus hit with insufficient coverage-, locus not found
Use --format pretty for human-readable terminal output and --format json for downstream automation.
Alignment Backends
| Backend | CLI selectable | FASTA | FASTQ | Best fit | Notes |
|---|---|---|---|---|---|
blastn |
Yes | Yes | No | Classical MLST on assemblies | Strong baseline for exact allele calls and targeted review |
kma |
Yes | Yes | Yes | FASTQ typing and cgMLST FASTQ routes | Good fit for mapping-based allele calling on reads |
minimap2 |
Yes | Yes | Yes | Fast assembly typing and flexible read workflows | Used heavily in cgMLST optimization paths |
nucmer |
Yes | Yes | No | Sensitive assembly comparison | Useful for distant matches and alternate evidence |
Backend notes
typing mlstandtyping cgmlstauto-detect common paired FASTQ naming patterns such as_R1/_R2,_1/_2, and.1/.2.typing cgmlstusesminimap2by default for FASTA assemblies.- For FASTQ cgMLST, the CLI follows a KMA-first policy and treats chew-style cgMLST modes as FASTA-oriented compatibility options.
GMLST_MINIMAP2_KMER_ENGINE=python|kmc|autocontrols the minimap2 k-mer support scorer.
Data Providers
| Provider | Source | Typical use |
|---|---|---|
pubmlst |
PubMLST REST catalogs | Common public MLST schemes |
pasteur |
Pasteur BIGSdb API | BIGSdb-hosted species collections |
enterobase |
Enterobase scheme downloads | Large curated scheme sets |
cgmlst |
cgmlst.org | cgMLST-focused public schemes |
local |
Local cache and custom schemes | Private laboratory databases and exported custom schemes |
Examples:
gmlst scheme list -p pubmlst
gmlst scheme list -p enterobase -t cgmlst
gmlst scheme list -p local
gmlst scheme show -s saureus_1
Novel Data Workflow
Build a local custom scheme from novel calls collected during routine typing.
# 1. Type samples and save JSON
gmlst typing mlst -s saureus_1 --format json *.fasta -o typing_results.json
# 2. Extract novel alleles and novel profiles
gmlst utils extract -i typing_results.json --novel-allele --novel-profile --data-dir novel_data
# 3. Create a local custom scheme
gmlst scheme create -t mlst -s saureus_1 --data-dir novel_data --desc "Lab collection 2024"
# 4. Add more novel data later
gmlst scheme update-custom -s custom_1 --data-dir more_novel_data
# 5. Export for downstream MST work
gmlst scheme export -s custom_1 --format grapetree -o custom_1_grapetree.tsv
TSV fallback is also supported when you only have tabular typing output and the original sample files are available:
gmlst utils extract -i typing_results.tsv -s saureus_1 --novel-allele --novel-profile \
--samples-dir ./samples --data-dir novel_data
cgMLST Modes
gmlst typing cgmlst supports several calling modes for different speed and evidence trade-offs.
| Mode | What it does | Good default |
|---|---|---|
standard |
Conservative baseline behavior | Start here if you want predictable generic settings |
chew-fast |
Exact-hash plus minimap2 prefilter with targeted rescue | Fast everyday assembly typing |
chew-ultrafast |
More aggressive speed profile with bounded second-pass rescue | Large batches where turnaround matters most |
chew-bsr |
Adds protein-level exact-hash style resolution on top of chew-fast |
Cases where protein evidence is useful |
chew-balanced |
Hash-first path with targeted blastn fallback |
Balance speed with stronger low-confidence review |
Examples:
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode standard sample.fna
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-fast sample.fna
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-ultrafast sample.fna
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-bsr sample.fna
gmlst typing cgmlst -s vparahaemolyticus_3 --cgmlst-mode chew-balanced sample.fna
Scheme-free Typing (tgmlst)
Use tgmlst when you want scheme-free allele discovery and optional scheme reuse.
# Run scheme-free typing
gmlst typing tgmlst sample.fna --stats
# Save a discovered scheme for reuse
gmlst typing tgmlst sample.fna --save-scheme tgmlst_scheme.json
# Reuse a previously saved scheme
gmlst typing tgmlst another_sample.fna --load-scheme tgmlst_scheme.json --format json
Useful options include --hash-strategy, --summary-report, --error-report, and --fail-on-error.
Visualization
Launch the local web application to build an MST from cgMLST or exported GrapeTree-style profiles.
gmlst visual web --open-browser
Or bind to a custom address:
gmlst visual web --host 0.0.0.0 --port 8787
The web UI accepts TSV data, builds a minimum spanning tree, and serves a local Flask API with a Vue frontend.
Configuration
Key environment variables:
| Variable | Purpose |
|---|---|
GMLST_CACHE_DIR |
Override the default cache root, usually ~/.cache/gmlst |
GMLST_TMPDIR |
Override temporary working directory used during typing and refinement |
GMLST_MINIMAP2_KMER_ENGINE |
Choose minimap2 k-mer support engine: python, kmc, or auto |
GMLST_PUBMLST_BASE_URL |
Override PubMLST API base URL |
GMLST_PASTEUR_BASE_URL |
Override Pasteur BIGSdb API base URL |
GMLST_PRIVATE_BIGSDB_URL |
Register a private BIGSdb instance as an extra provider |
GMLST_PRIVATE_BIGSDB_NAME |
Name shown for the private BIGSdb provider |
GMLST_PRIVATE_BIGSDB_LABEL |
Human-readable label for the private BIGSdb provider |
Example:
export GMLST_CACHE_DIR="$HOME/.cache/gmlst"
export GMLST_TMPDIR="$PWD/.tmp/gmlst"
export GMLST_MINIMAP2_KMER_ENGINE=auto
export GMLST_PUBMLST_BASE_URL="https://rest.pubmlst.org/db"
export GMLST_PASTEUR_BASE_URL="https://bigsdb.pasteur.fr/api/db"
Private BIGSdb example:
export GMLST_PRIVATE_BIGSDB_URL="http://127.0.0.1:9000/api/db"
export GMLST_PRIVATE_BIGSDB_NAME="labdb"
export GMLST_PRIVATE_BIGSDB_LABEL="Lab BIGSdb"
gmlst scheme list -p labdb
Output Format Details
The default TSV format uses compact markers per locus.
| Marker | Meaning |
|---|---|
23 |
Exact allele call |
~23 |
Non-exact but high-coverage call, used for closest hits and novel-like loci |
15? |
Partial call, coverage below the confident threshold |
- |
Missing locus |
JSON output is the best choice when you want structured fields such as per-locus call metadata and novel_sequence extraction data.
Multicopy Loci Notes
- Conflicting multicopy calls are reported with comma notation such as
1,2. - When conflicting multicopy loci are present, ST is reported as
-to avoid overconfident profile assignment. - Same-allele copy counting such as
1,1is optional and currently exposed through--count-same-copyforblastnworkflows.
Recommended review pattern:
# Fast first pass
gmlst typing mlst -s vparahaemolyticus_1 *.fna -o pass1.tsv
# Targeted second pass on flagged samples
gmlst typing mlst -s vparahaemolyticus_1 -b blastn --count-same-copy flagged_sample.fna
Development
Set up the development environment:
pixi install
pixi run install-dev
Common tasks:
pixi run lint
pixi run format-check
pixi run test
pixi run check
Direct Ruff commands also work:
pixi run ruff check .
pixi run ruff format .
See docs/contributing.md for contributor workflow and docs/architecture.md for module boundaries and typing-path contracts.
Documentation Index
- docs/README.md for the full documentation map
- docs/installation.md for installation details
- docs/quickstart.md for a guided first run
- docs/commands.md for the CLI reference
- README_ZH.md for the Chinese root guide
License
Released under the MIT License.
Acknowledgments
- Inspired by tseemann/mlst
- Uses public scheme data from PubMLST
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gmlst-0.1.0.tar.gz.
File metadata
- Download URL: gmlst-0.1.0.tar.gz
- Upload date:
- Size: 283.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f75ccfcb468bb728cb87d7e6a0f5e22f9b24e035f4cfe8d00e69833abe373129
|
|
| MD5 |
9b84ceb3d1e368dd2dfb1c721d6ffa26
|
|
| BLAKE2b-256 |
3eb8ede73dcd2a8eb6da7f90f4777416657ddacfb9faf2aacf8b12053c7a287c
|
Provenance
The following attestation bundles were made for gmlst-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on indexofire/gmlst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gmlst-0.1.0.tar.gz -
Subject digest:
f75ccfcb468bb728cb87d7e6a0f5e22f9b24e035f4cfe8d00e69833abe373129 - Sigstore transparency entry: 1501429907
- Sigstore integration time:
-
Permalink:
indexofire/gmlst@9c14fd31b506842cc113b79199a7b934fd3c0aa5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/indexofire
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@9c14fd31b506842cc113b79199a7b934fd3c0aa5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gmlst-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gmlst-0.1.0-py3-none-any.whl
- Upload date:
- Size: 327.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc7bb61f3e41272058ba509d7f506f61e57dab0e7e26866f642ef15676354a1a
|
|
| MD5 |
401bda175355dbd64e4eae3bc273c620
|
|
| BLAKE2b-256 |
47b48ff7a30b423b57ebfd2b787c9117e16854ca7b3827c19031d391e50afbd2
|
Provenance
The following attestation bundles were made for gmlst-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on indexofire/gmlst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gmlst-0.1.0-py3-none-any.whl -
Subject digest:
cc7bb61f3e41272058ba509d7f506f61e57dab0e7e26866f642ef15676354a1a - Sigstore transparency entry: 1501429997
- Sigstore integration time:
-
Permalink:
indexofire/gmlst@9c14fd31b506842cc113b79199a7b934fd3c0aa5 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/indexofire
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@9c14fd31b506842cc113b79199a7b934fd3c0aa5 -
Trigger Event:
release
-
Statement type: