symclatron: symbiont classifier
Project description
symclatron: symbiont classifier
symclatron classifies microbial genomes into three lifestyle categories:
Free-livingSymbiont;Host-associatedSymbiont;Obligate-intracellular
It accepts protein FASTA directly, or nucleotide FASTA with automatic conversion to proteins before classification.
What symclatron implements
For each genome, symclatron currently performs the following workflow:
- Validates the input FASTA file(s) and checks that genome identifiers derived from filenames are unique.
- Detects whether each input file contains proteins, nucleotide genes/CDS, or nucleotide contigs/assemblies.
- Converts nucleotide input to proteins.
- Gene/CDS FASTA (
.ffn,.fnn) is translated in frame. - Contig/assembly FASTA (
.fa,.fas,.fasta,.fna) is gene-called and translated withpyrodigal.
- Gene/CDS FASTA (
- Runs HMM searches against the
symclatronfeature set and theUNI56marker set. - Builds feature matrices for the
symcla,symreg, andhostclaXGBoost submodels. - Computes additional distance-based features relative to the training data.
- Applies the final neural-network model to produce the reported class and confidence score.
- Optionally relabels low-confidence predictions as
Unknownwhen--confidence-thresholdis provided. - Writes final tables, summaries, logs, and optional intermediate files.
The final reported class is produced by the neural-network stage. The hostcla model is still run and its intermediate bitscore table is retained in the output directory.
Installation
The project metadata currently targets Python 3.12 on Linux and Apple Silicon macOS (osx-arm64, including M1-M5). The recommended install path is pixi; mamba/conda also works.
Option 1: pixi (recommended)
Install pixi:
curl -fsSL https://pixi.sh/install.sh | sh
Then install symclatron and download the data bundle:
pixi global install --pinning-strategy no-pin -c conda-forge -c bioconda -c https://repo.prefix.dev/astrogenomics symclatron
symclatron setup
If symclatron was already installed with default pinning, reset it once so future upgrades
do not get stuck on an older 0.x minor line:
pixi global add --environment symclatron --pinning-strategy no-pin symclatron
pixi global update symclatron
Run the bundled self-test:
symclatron test
Option 2: mamba or conda
mamba create -n symclatron -c conda-forge -c bioconda -c https://repo.prefix.dev/astrogenomics symclatron
mamba run -n symclatron symclatron setup
mamba run -n symclatron symclatron test
pixi global install and the conda-based workflow install both CLI names:
symclatronsymcla
For example, symcla classify ... is equivalent to symclatron classify ....
First-time setup
Before classification, download the packaged database and model bundle once:
symclatron setup
Useful setup options:
--force,-f: remove any existing bundled data and download again--data-url: override the default GitHub Release URL--data-sha256: verify the downloaded archive against a SHA256 digest--quiet,-q: suppress routine progress messages
By default, setup downloads the bundle from the GitHub Release tag db-latest.
Accepted inputs
--genome-dir can point either to a directory containing one genome per file, or to a single FASTA file.
Supported FASTA input types
| Input type | Typical suffixes | What symclatron does |
|---|---|---|
| Protein FASTA | .faa, .faa.gz, .aa, .aa.fasta, .pep, .pep.fasta, .protein.faa |
Uses proteins directly |
| Nucleotide genes/CDS FASTA | .ffn, .ffn.gz, .fnn, .fnn.gz |
Translates sequences in frame |
| Nucleotide contig/assembly FASTA | .fa, .fa.gz, .fas, .fas.gz, .fasta, .fasta.gz, .fna, .fna.gz |
Predicts genes and proteins with pyrodigal |
Notes:
- Gzipped FASTA files are supported.
- If file extensions are ambiguous, use
--input-kind proteins,--input-kind genes, or--input-kind contigs. - Use
--input-extto restrict which files are picked up from a directory. - Genome identifiers in the output come from input filenames, so filenames should be unique within a run.
Quick start
Classify protein FASTA
symclatron classify --genome-dir /path/to/proteins --output-dir results
Classify contig FASTA and predict proteins automatically
symclatron classify --genome-dir /path/to/contigs --output-dir results
Force contig mode and only include .fna files
symclatron classify \
--genome-dir /path/to/inputs \
--input-kind contigs \
--input-ext .fna \
--output-dir results
Override the default confidence threshold
symclatron classify \
--genome-dir /path/to/genomes \
--confidence-threshold 0.80 \
--output-dir results
CLI reference
symclatron classify
symclatron classify [OPTIONS]
Options:
--genome-dir,-i: input directory or single FASTA file--input-kind:auto,proteins,genes, orcontigs--input-ext: limit directory scanning to specific extensions; repeat the flag or pass comma-separated values--output-dir,-o: results directory; default isoutput_Symclatron_<DATETIME>--keep-tmp: keep intermediate files instead of removingtmp/--threads,-t: number of HMMER threads, from1to32--confidence-threshold: threshold for conservative interpretation; defaults to0.725and can be overridden--quiet,-q: suppress routine console progress output--verbose: increase log detail
Examples:
symclatron classify --genome-dir genomes --output-dir results
symclatron classify --genome-dir genomes --threads 8 --keep-tmp --output-dir results
symclatron classify --genome-dir genomes --quiet --output-dir results
symclatron classify --genome-dir genomes --verbose --output-dir results
symclatron test
symclatron test [OPTIONS]
This runs the bundled example data installed by symclatron setup.
Options:
--keep-tmp: keep intermediate files for the test run--mode:proteins,contigs, orboth(default)--output-dir,-o: test output root; default isoutput_test_Symclatron_<DATETIME>--confidence-threshold: threshold for conservative interpretation; defaults to0.725and can be overridden
When --mode both is used, results are written under:
<output-dir>/faa<output-dir>/fna
symclatron setup
symclatron setup [OPTIONS]
Options:
--force,-f: redownload the data bundle even if it already exists--quiet,-q: suppress routine setup messages--data-url: override the default bundle URL--data-sha256: expected SHA256 digest for the bundle
Help and version
symclatron --help
symclatron classify --help
symclatron setup --help
symclatron test --help
symclatron --version
Output files
The main output directory is intentionally kept simple for end users. At the top level you should expect:
symclatron_results.tsvclassification_summary.txtlogs/extra_results/
Final result table: symclatron_results.tsv
Columns:
taxon_oid: genome identifier derived from the input filenamecompleteness_UNI56: estimated completeness based on theUNI56marker setclassification: final predicted lifestyle classconfidence: confidence score for the reported classpasses_confidence_threshold: boolean column showing whetherconfidence >= applied_thresholdclassification_thresholded: conservative label using the applied threshold; predictions below threshold are reported asUnknown
Exact class labels written by the current implementation are:
Free-livingSymbiont;Host-associatedSymbiont;Obligate-intracellular
Summary and logs
classification_summary.txt: counts and summary statistics for the runlogs/symclatron.log: run loglogs/resource_usage_*.log: resource-monitoring log
Auxiliary TSV outputs in extra_results/
extra_results/bitscore_symcla.tsvextra_results/bitscore_symreg.tsvextra_results/bitscore_hostcla.tsvextra_results/shap_symreg.tsvextra_results/feature_contribution_symreg.tsvextra_results/shap_melt_symreg.tsv
Temporary files
If --keep-tmp is used, the tmp/ directory is kept. It contains renamed FASTA files, merged FASTA files, HMMER tables, model-specific feature tables, and additional intermediate prediction files.
Interpreting the results
- The
classificationcolumn always reports the highest-probability final class. - The default conservative threshold is
0.725for every run. - Use
--confidence-threshold <value>only when you want to override that default. - Lower-confidence calls are preserved in
classificationbut are relabeled asUnknowninclassification_thresholded. completeness_UNI56is provided to help judge how complete the genome appears relative to the marker set used by the workflow.
Citation
If you use symclatron in your research, please cite:
A genomic catalog of Earth’s bacterial and archaeal symbionts. Juan C. Villada, Yumary M. Vasquez, Gitta Szabo, Ewan Whittaker-Walker, Miguel F. Romero, Sarina Qin, Neha Varghese, Emiley A. Eloe-Fadrosh, Nikos C. Kyrpides, SymGs data consortium, Axel Visel, Tanja Woyke, Frederik Schulz bioRxiv 2025.05.29.656868; doi: https://doi.org/10.1101/2025.05.29.656868
Support
- Repository: https://github.com/NeLLi-team/symclatron
- Issues: https://github.com/NeLLi-team/symclatron/issues
- Author: Juan C. Villada jvillada@lbl.gov
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file symclatron-0.10.10.tar.gz.
File metadata
- Download URL: symclatron-0.10.10.tar.gz
- Upload date:
- Size: 947.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff1aceb7db819c9d1977ae1042a0c49cdf5c491064051f5656825828d4cef711
|
|
| MD5 |
eda577e299b1c34402b121cb5daf1f13
|
|
| BLAKE2b-256 |
1338258acaa0b81c8c142cc46480a0c56a6a600d0ad2be044660ab79ee412701
|
File details
Details for the file symclatron-0.10.10-py3-none-any.whl.
File metadata
- Download URL: symclatron-0.10.10-py3-none-any.whl
- Upload date:
- Size: 32.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d18ece12646308d11fcffd2492e53eca9e6e0928ce49be0e09b85e66e42b771d
|
|
| MD5 |
5a880ec2a5c7007a8968806c8ea614c9
|
|
| BLAKE2b-256 |
b8ba40044ec9545ce1ca8317f764e19b14ab7f0c3e1829c5f9dcfe3e1d4004ed
|