Skip to main content

Lineage-aware Kraken2 refinement for long-read metagenomics

Project description

Release bioRxiv License: MIT CI codecov

Perseus: refining Kraken2 taxonomic classifications of long reads and contigs

logo

Perseus is a post-processing framework for refining Kraken2 taxonomic classifications, with a focus on long-read metagenomics (PacBio HiFi, ONT). While Kraken2’s exact k-mer matching enables fast and sensitive classification, it can produce overconfident fine-rank calls when evidence is sparse, conserved, or partially novel. Perseus addresses this limitation by distinguishing trustworthy from spurious taxonomic predictions using structured k-mer evidence already present in the Kraken2 output. Perseus is designed to reduce false positive fine-rank calls arising from conserved regions, sparse k-mer support, and reference database incompleteness—failure modes that are common in long-read and high-novelty metagenomes.

Perseus assigns confidence probabilities to each Kraken2 classification at every canonical taxonomic rank, enabling informed decisions to confirm assignments, back off to higher, lineage-consistent ranks, or convert predictions to unclassified.

Perseus is built on a multi-headed 1D convolutional neural network that operates directly on features derived from Kraken2 output. The workflow constructs a lineage-aware feature matrix from a standard Kraken2 output file, then performs inference to produce a Kraken2-compatible output augmented with per-rank confidence probabilities for each assignment. Perseus operates strictly as a downstream confidence filter and does not perform reclassification, alignment, or novel taxon discovery.


Installation

Conda installation (recommended)

Perseus is available through conda. We recommend creating a new environment:

conda create -n perseus -c matnguyen -c conda-forge -c pytorch perseus
conda activate perseus

Getting started

Feature extraction

Perseus will perform feature extraction on a Kraken2 output file and output a directory of sharded parquets containing the features.

perseus extract <kraken_file> <output_shards_directory>

Filtering

Perseus takes in the directory of sharded parquets and the Kraken2 output file for filtering.

perseus filter <shards_directory> <kraken_file> <output_path>

The output file will be similar to the Kraken2 output file, but without the string of k-mer matches, and with the following additional columns:

  1. perseus_taxid - the taxonomic ID assigned by Perseus
  2. prob_{rank} - the assignment probability at a canonical {rank}
  3. chosen_rank - the final chosen rank assigned by Perseus
  4. chosen_prob_at_rank - the probability at the final chosen rank

Testing Data

We provide some data for testing Perseus. They can be found under tests/test_data. The Kraken2 output file is tests/test_data/test_kraken, the shards are in tests/test_data/test_shards, and the expected Perseus output file is tests/test_data/filtered.txt.

Testing the Installation

Quick Example

Run Perseus on the included test data:

perseus extract tests/test_data/test_kraken.txt example_extract
perseus filter example_extract tests/test_data/test_kraken.txt example_filtered.txt

This should produce an output file example_filtered.txt.

Because Perseus uses floating-point operations (PyTorch), small numerical differences may occur across platforms. Therefore, the output may not match the reference file exactly with a simple diff.

To compare the output with the expected results using a numerical tolerance:

python scripts/compare_outputs.py example_filtered.txt tests/test_data/filtered.txt

Running the Full Test Suite (optional)

For a full reproducibility check, run the included test suite.

Install the testing dependency:

pip install pytest

Then run:

pytest -q

This runs unit tests and end-to-end pipeline tests used during development.

Citing Perseus

Our preprint can be found here: https://www.biorxiv.org/content/10.64898/2026.03.06.710148v1

Data Generation Scripts

Scripts for generating the inclusion/exclusion simulated data are found here: https://github.com/matnguyen/perseus-scripts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perseus_metagenomics-1.0.1.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

perseus_metagenomics-1.0.1-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file perseus_metagenomics-1.0.1.tar.gz.

File metadata

  • Download URL: perseus_metagenomics-1.0.1.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for perseus_metagenomics-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e8659d48ba9b378b22a7646f676da7da62c8599130ec6b4ef56b2523bc07e88d
MD5 1abd19c27fc9ed5e69dd2688f020c80e
BLAKE2b-256 5348d02f752c50d54721cad1f818e5c3ac4aa577c8228249928e9c8868fe50c5

See more details on using hashes here.

File details

Details for the file perseus_metagenomics-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for perseus_metagenomics-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8067d563dcdc86bdea3348b61063f05162b99d04ab52d440acd2528a4acec47f
MD5 8a490b855a913406b6a2560f158365eb
BLAKE2b-256 87d84cf2dc0489cb3acadcd6fcc90ed1a45a0c6fce78d901d46b93b477ed6e6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page