Skip to main content

Lineage-aware Kraken2 refinement for long-read metagenomics

Project description

Release bioRxiv License: MIT CI codecov

Perseus: refining Kraken2 taxonomic classifications of long reads and contigs

logo

Perseus is a post-processing framework for refining Kraken2 taxonomic classifications, with a focus on long-read metagenomics (PacBio HiFi, ONT). While Kraken2’s exact k-mer matching enables fast and sensitive classification, it can produce overconfident fine-rank calls when evidence is sparse, conserved, or partially novel. Perseus addresses this limitation by distinguishing trustworthy from spurious taxonomic predictions using structured k-mer evidence already present in the Kraken2 output. Perseus is designed to reduce false positive fine-rank calls arising from conserved regions, sparse k-mer support, and reference database incompleteness—failure modes that are common in long-read and high-novelty metagenomes.

Perseus assigns confidence probabilities to each Kraken2 classification at every canonical taxonomic rank, enabling informed decisions to confirm assignments, back off to higher, lineage-consistent ranks, or convert predictions to unclassified.

Perseus is built on a multi-headed 1D convolutional neural network that operates directly on features derived from Kraken2 output. The workflow constructs a lineage-aware feature matrix from a standard Kraken2 output file, then performs inference to produce a Kraken2-compatible output augmented with per-rank confidence probabilities for each assignment. Perseus operates strictly as a downstream confidence filter and does not perform reclassification, alignment, or novel taxon discovery.


Installation

Conda installation (recommended)

Perseus is available through conda. We recommend creating a new environment:

conda create -n perseus -c matnguyen -c conda-forge -c pytorch perseus
conda activate perseus

pip installation

Perseus is available on PyPi and can be installed through pip. There may be issues with installing ETE3 and PyTorch through pip, so we recommend using a new conda or virtual environment:

conda create -n perseus ete3 pytorch
pip install perseus-metagenomics

Getting started

Setup taxonomy database

Perseus will download an ETE3 taxonomy database.

perseus setup <db_path>

Feature extraction

Perseus will perform feature extraction on a Kraken2 output file and output a directory of sharded parquets containing the features.

perseus extract <kraken_file> <output_shards_directory> <db_path>

Filtering

Perseus takes in the directory of sharded parquets and the Kraken2 output file for filtering.

perseus filter <shards_directory> <kraken_file> <output_path> <db_path>

The output file will be similar to the Kraken2 output file, but without the string of k-mer matches, and with the following additional columns:

  1. perseus_taxid - the taxonomic ID assigned by Perseus
  2. perseus_taxonomy - the taxonomic name assigned by Perseus
  3. chosen_rank - the final chosen rank assigned by Perseus
  4. chosen_prob_at_rank - the probability at the final chosen rank
  5. prob_{rank} - the assignment probability at a canonical {rank}

Testing Data

We provide some data for testing Perseus. They can be found under tests/test_data. The Kraken2 output file is tests/test_data/test_kraken, the shards are in tests/test_data/test_shards, and the expected Perseus output file is tests/test_data/filtered.txt.

Testing the Installation

Quick Example

Run Perseus on the included test data:

perseus setup ete3_db
perseus extract tests/test_data/test_kraken.txt example_extract ete3_db
perseus filter example_extract tests/test_data/test_kraken.txt example_filtered.txt ete3_db

This should produce an output file example_filtered.txt.

Because Perseus uses floating-point operations (PyTorch), small numerical differences may occur across platforms. Therefore, the output may not match the reference file exactly with a simple diff.

To compare the output with the expected results using a numerical tolerance:

python scripts/compare_outputs.py example_filtered.txt tests/test_data/filtered.txt

Running the Full Test Suite (optional)

For a full reproducibility check, run the included test suite.

Install the testing dependency:

pip install pytest

Then run:

pytest -q

This runs unit tests and end-to-end pipeline tests used during development.

Citing Perseus

Our preprint can be found here: https://www.biorxiv.org/content/10.64898/2026.03.06.710148v1

Data Generation Scripts

Scripts for generating the inclusion/exclusion simulated data are found here: https://github.com/matnguyen/perseus-scripts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perseus_metagenomics-1.1.0.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

perseus_metagenomics-1.1.0-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file perseus_metagenomics-1.1.0.tar.gz.

File metadata

  • Download URL: perseus_metagenomics-1.1.0.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for perseus_metagenomics-1.1.0.tar.gz
Algorithm Hash digest
SHA256 1ac3045b4575208c96f37f4a13226eeb3c59985d20ab5ae04ff24fa477122e81
MD5 081552ca5c62d282447d4147f3895469
BLAKE2b-256 a08dd210df20ba46b893b02ef22baa5a7692de9439ca849299a2b0cabbdeb5f6

See more details on using hashes here.

File details

Details for the file perseus_metagenomics-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for perseus_metagenomics-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 88075b19857da63677694579027712ece91356d0b22cf38db7bce12733159426
MD5 802beb0f98de7c5d87d340d891d68d38
BLAKE2b-256 80bf150a59a74e9d4c9d187d1bb298d84b6b04287c448d178e1f0824a732ceb8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page