Skip to main content

symclatron: symbiont classifier

Project description

symclatron: symbiont classifier

Figure 1

symclatron classifies microbial genomes into three lifestyle categories:

  • Free-living
  • Symbiont;Host-associated
  • Symbiont;Obligate-intracellular

It accepts protein FASTA directly, or nucleotide FASTA with automatic conversion to proteins before classification.

What symclatron implements

For each genome, symclatron currently performs the following workflow:

  1. Validates the input FASTA file(s) and checks that genome identifiers derived from filenames are unique.
  2. Detects whether each input file contains proteins, nucleotide genes/CDS, or nucleotide contigs/assemblies.
  3. Converts nucleotide input to proteins.
    • Gene/CDS FASTA (.ffn, .fnn) is translated in frame.
    • Contig/assembly FASTA (.fa, .fas, .fasta, .fna) is gene-called and translated with pyrodigal.
  4. Runs HMM searches against the symclatron feature set and the UNI56 marker set.
  5. Builds feature matrices for the symcla, symreg, and hostcla XGBoost submodels.
  6. Computes additional distance-based features relative to the training data.
  7. Applies the final neural-network model to produce the reported class and confidence score.
  8. Optionally relabels low-confidence predictions as Unknown when --confidence-threshold is provided.
  9. Writes final tables, summaries, logs, and optional intermediate files.

The final reported class is produced by the neural-network stage. The hostcla model is still run and its intermediate bitscore table is retained in the output directory.

Installation

The project metadata currently targets Python 3.12 on POSIX/Linux systems. The recommended install path is pixi; mamba/conda also works.

Option 1: pixi (recommended)

Install pixi:

curl -fsSL https://pixi.sh/install.sh | sh

Then install symclatron and download the data bundle:

pixi global install -c conda-forge -c bioconda -c https://repo.prefix.dev/astrogenomics symclatron
symclatron setup

Run the bundled self-test:

symclatron test

Option 2: mamba or conda

mamba create -n symclatron -c conda-forge -c bioconda -c https://repo.prefix.dev/astrogenomics symclatron
mamba run -n symclatron symclatron setup
mamba run -n symclatron symclatron test

pixi global install and the conda-based workflow install both CLI names:

  • symclatron
  • symcla

For example, symcla classify ... is equivalent to symclatron classify ....

First-time setup

Before classification, download the packaged database and model bundle once:

symclatron setup

Useful setup options:

  • --force, -f: remove any existing bundled data and download again
  • --data-url: override the default GitHub Release URL
  • --data-sha256: verify the downloaded archive against a SHA256 digest
  • --quiet, -q: suppress routine progress messages

By default, setup downloads the bundle from the GitHub Release tag db-latest.

Accepted inputs

--genome-dir can point either to a directory containing one genome per file, or to a single FASTA file.

Supported FASTA input types

Input type Typical suffixes What symclatron does
Protein FASTA .faa, .faa.gz, .aa, .aa.fasta, .pep, .pep.fasta, .protein.faa Uses proteins directly
Nucleotide genes/CDS FASTA .ffn, .ffn.gz, .fnn, .fnn.gz Translates sequences in frame
Nucleotide contig/assembly FASTA .fa, .fa.gz, .fas, .fas.gz, .fasta, .fasta.gz, .fna, .fna.gz Predicts genes and proteins with pyrodigal

Notes:

  • Gzipped FASTA files are supported.
  • If file extensions are ambiguous, use --input-kind proteins, --input-kind genes, or --input-kind contigs.
  • Use --input-ext to restrict which files are picked up from a directory.
  • Genome identifiers in the output come from input filenames, so filenames should be unique within a run.

Quick start

Classify protein FASTA

symclatron classify --genome-dir /path/to/proteins --output-dir results

Classify contig FASTA and predict proteins automatically

symclatron classify --genome-dir /path/to/contigs --output-dir results

Force contig mode and only include .fna files

symclatron classify \
  --genome-dir /path/to/inputs \
  --input-kind contigs \
  --input-ext .fna \
  --output-dir results

Override the default confidence threshold

symclatron classify \
  --genome-dir /path/to/genomes \
  --confidence-threshold 0.80 \
  --output-dir results

CLI reference

symclatron classify

symclatron classify [OPTIONS]

Options:

  • --genome-dir, -i: input directory or single FASTA file
  • --input-kind: auto, proteins, genes, or contigs
  • --input-ext: limit directory scanning to specific extensions; repeat the flag or pass comma-separated values
  • --output-dir, -o: results directory; default is output_Symclatron_<DATETIME>
  • --keep-tmp: keep intermediate files instead of removing tmp/
  • --threads, -t: number of HMMER threads, from 1 to 32
  • --confidence-threshold: threshold for conservative interpretation; defaults to 0.725 and can be overridden
  • --quiet, -q: suppress routine console progress output
  • --verbose: increase log detail

Examples:

symclatron classify --genome-dir genomes --output-dir results
symclatron classify --genome-dir genomes --threads 8 --keep-tmp --output-dir results
symclatron classify --genome-dir genomes --quiet --output-dir results
symclatron classify --genome-dir genomes --verbose --output-dir results

symclatron test

symclatron test [OPTIONS]

This runs the bundled example data installed by symclatron setup.

Options:

  • --keep-tmp: keep intermediate files for the test run
  • --mode: proteins, contigs, or both (default)
  • --output-dir, -o: test output root; default is output_test_Symclatron_<DATETIME>
  • --confidence-threshold: threshold for conservative interpretation; defaults to 0.725 and can be overridden

When --mode both is used, results are written under:

  • <output-dir>/faa
  • <output-dir>/fna

symclatron setup

symclatron setup [OPTIONS]

Options:

  • --force, -f: redownload the data bundle even if it already exists
  • --quiet, -q: suppress routine setup messages
  • --data-url: override the default bundle URL
  • --data-sha256: expected SHA256 digest for the bundle

Help and version

symclatron --help
symclatron classify --help
symclatron setup --help
symclatron test --help
symclatron --version

Output files

The main output directory is intentionally kept simple for end users. At the top level you should expect:

  • symclatron_results.tsv
  • classification_summary.txt
  • logs/
  • extra_results/

Final result table: symclatron_results.tsv

Columns:

  • taxon_oid: genome identifier derived from the input filename
  • completeness_UNI56: estimated completeness based on the UNI56 marker set
  • classification: final predicted lifestyle class
  • confidence: confidence score for the reported class
  • passes_confidence_threshold: boolean column showing whether confidence >= applied_threshold
  • classification_thresholded: conservative label using the applied threshold; predictions below threshold are reported as Unknown

Exact class labels written by the current implementation are:

  • Free-living
  • Symbiont;Host-associated
  • Symbiont;Obligate-intracellular

Summary and logs

  • classification_summary.txt: counts and summary statistics for the run
  • logs/symclatron.log: run log
  • logs/resource_usage_*.log: resource-monitoring log

Auxiliary TSV outputs in extra_results/

  • extra_results/bitscore_symcla.tsv
  • extra_results/bitscore_symreg.tsv
  • extra_results/bitscore_hostcla.tsv
  • extra_results/shap_symreg.tsv
  • extra_results/feature_contribution_symreg.tsv
  • extra_results/shap_melt_symreg.tsv

Temporary files

If --keep-tmp is used, the tmp/ directory is kept. It contains renamed FASTA files, merged FASTA files, HMMER tables, model-specific feature tables, and additional intermediate prediction files.

Interpreting the results

  • The classification column always reports the highest-probability final class.
  • The default conservative threshold is 0.725 for every run.
  • Use --confidence-threshold <value> only when you want to override that default.
  • Lower-confidence calls are preserved in classification but are relabeled as Unknown in classification_thresholded.
  • completeness_UNI56 is provided to help judge how complete the genome appears relative to the marker set used by the workflow.

Citation

If you use symclatron in your research, please cite:

A genomic catalog of Earth’s bacterial and archaeal symbionts. Juan C. Villada, Yumary M. Vasquez, Gitta Szabo, Ewan Whittaker-Walker, Miguel F. Romero, Sarina Qin, Neha Varghese, Emiley A. Eloe-Fadrosh, Nikos C. Kyrpides, SymGs data consortium, Axel Visel, Tanja Woyke, Frederik Schulz bioRxiv 2025.05.29.656868; doi: https://doi.org/10.1101/2025.05.29.656868

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

symclatron-0.10.8.tar.gz (946.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

symclatron-0.10.8-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file symclatron-0.10.8.tar.gz.

File metadata

  • Download URL: symclatron-0.10.8.tar.gz
  • Upload date:
  • Size: 946.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for symclatron-0.10.8.tar.gz
Algorithm Hash digest
SHA256 d218564793ac0e4455646afd95b8464bd7c947b0625ce0b45be61392c32d28c1
MD5 34cb664f83f3de3ef931f2f500503dc0
BLAKE2b-256 7d67e9b0ed77563987c25de138b289a98efd55446c823e5a80c9699d9537843a

See more details on using hashes here.

File details

Details for the file symclatron-0.10.8-py3-none-any.whl.

File metadata

  • Download URL: symclatron-0.10.8-py3-none-any.whl
  • Upload date:
  • Size: 32.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for symclatron-0.10.8-py3-none-any.whl
Algorithm Hash digest
SHA256 586e8d1213775a4522da3238689d1907a7c1b29fb019c387c3879f82ee29c847
MD5 2fd3a5334838f378c7f3a51208a1e0f2
BLAKE2b-256 91b15ba5c9a7873dbb6781476e11e17464d06cacba2e977d96e96b38718c04ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page