Lineage-aware Kraken2 refinement for long-read metagenomics

These details have not been verified by PyPI

Project links

Project description

Perseus: refining Kraken2 taxonomic classifications of long reads and contigs

Perseus is a post-processing framework for refining Kraken2 taxonomic classifications, with a focus on long-read metagenomics (PacBio HiFi, ONT). While Kraken2’s exact k-mer matching enables fast and sensitive classification, it can produce overconfident fine-rank calls when evidence is sparse, conserved, or partially novel. Perseus addresses this limitation by distinguishing trustworthy from spurious taxonomic predictions using structured k-mer evidence already present in the Kraken2 output. Perseus is designed to reduce false positive fine-rank calls arising from conserved regions, sparse k-mer support, and reference database incompleteness—failure modes that are common in long-read and high-novelty metagenomes.

Perseus assigns confidence probabilities to each Kraken2 classification at every canonical taxonomic rank, enabling informed decisions to confirm assignments, back off to higher, lineage-consistent ranks, or convert predictions to unclassified.

Perseus is built on a multi-headed 1D convolutional neural network that operates directly on features derived from Kraken2 output. The workflow constructs a lineage-aware feature matrix from a standard Kraken2 output file, then performs inference to produce a Kraken2-compatible output augmented with per-rank confidence probabilities for each assignment. Perseus operates strictly as a downstream confidence filter and does not perform reclassification, alignment, or novel taxon discovery.

Installation

Conda installation (recommended)

Perseus is available through conda. We recommend creating a new environment:

conda create -n perseus -c matnguyen -c conda-forge -c pytorch perseus
conda activate perseus

pip installation

Perseus is available on PyPi and can be installed through pip. There may be issues with installing ETE3 and PyTorch through pip, so we recommend using a new conda or virtual environment:

conda create -n perseus ete3 pytorch
pip install perseus-metagenomics

Getting started

Setup taxonomy database

Perseus will download an ETE3 taxonomy database.

perseus setup <db_path>

Feature extraction

Perseus will perform feature extraction on a Kraken2 output file and output a directory of sharded parquets containing the features.

perseus extract <kraken_file> <output_shards_directory> <db_path>

Filtering

Perseus takes in the directory of sharded parquets and the Kraken2 output file for filtering.

perseus filter <shards_directory> <kraken_file> <output_path> <db_path>

The output file will be similar to the Kraken2 output file, but without the string of k-mer matches, and with the following additional columns:

perseus_taxid - the taxonomic ID assigned by Perseus
perseus_taxonomy - the taxonomic name assigned by Perseus
chosen_rank - the final chosen rank assigned by Perseus
chosen_prob_at_rank - the probability at the final chosen rank
prob_{rank} - the assignment probability at a canonical {rank}

Testing Data

We provide some data for testing Perseus. They can be found under tests/test_data. The Kraken2 output file is tests/test_data/test_kraken, the shards are in tests/test_data/test_shards, and the expected Perseus output file is tests/test_data/filtered.txt.

Testing the Installation

Quick Example

Run Perseus on the included test data:

perseus setup ete3_db
perseus extract tests/test_data/test_kraken.txt example_extract ete3_db
perseus filter example_extract tests/test_data/test_kraken.txt example_filtered.txt ete3_db

This should produce an output file example_filtered.txt.

Because Perseus uses floating-point operations (PyTorch), small numerical differences may occur across platforms. Therefore, the output may not match the reference file exactly with a simple diff.

To compare the output with the expected results using a numerical tolerance:

python scripts/compare_outputs.py example_filtered.txt tests/test_data/filtered.txt

Running the Full Test Suite (optional)

For a full reproducibility check, run the included test suite.

Install the testing dependency:

pip install pytest

Then run:

pytest -q

This runs unit tests and end-to-end pipeline tests used during development.

Citing Perseus

Our preprint can be found here: https://www.biorxiv.org/content/10.64898/2026.03.06.710148v1

Data Generation Scripts

Scripts for generating the inclusion/exclusion simulated data are found here: https://github.com/matnguyen/perseus-scripts

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Mar 18, 2026

1.0.1

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

perseus_metagenomics-1.1.0.tar.gz (1.7 MB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

perseus_metagenomics-1.1.0-py3-none-any.whl (1.7 MB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file perseus_metagenomics-1.1.0.tar.gz.

File metadata

Download URL: perseus_metagenomics-1.1.0.tar.gz
Upload date: Mar 18, 2026
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for perseus_metagenomics-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1ac3045b4575208c96f37f4a13226eeb3c59985d20ab5ae04ff24fa477122e81`
MD5	`081552ca5c62d282447d4147f3895469`
BLAKE2b-256	`a08dd210df20ba46b893b02ef22baa5a7692de9439ca849299a2b0cabbdeb5f6`

See more details on using hashes here.

File details

Details for the file perseus_metagenomics-1.1.0-py3-none-any.whl.

File metadata

Download URL: perseus_metagenomics-1.1.0-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 1.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for perseus_metagenomics-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`88075b19857da63677694579027712ece91356d0b22cf38db7bce12733159426`
MD5	`802beb0f98de7c5d87d340d891d68d38`
BLAKE2b-256	`80bf150a59a74e9d4c9d187d1bb298d84b6b04287c448d178e1f0824a732ceb8`

See more details on using hashes here.

perseus-metagenomics 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Perseus: refining Kraken2 taxonomic classifications of long reads and contigs

Installation

Conda installation (recommended)

pip installation

Getting started

Setup taxonomy database

Feature extraction

Filtering

Testing Data

Testing the Installation

Quick Example

Running the Full Test Suite (optional)

Citing Perseus

Data Generation Scripts

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes