Lineage-aware Kraken2 refinement for long-read metagenomics
Project description
Perseus: refining Kraken2 taxonomic classifications of long reads and contigs
Perseus is a post-processing framework for refining Kraken2 taxonomic classifications, with a focus on long-read metagenomics (PacBio HiFi, ONT). While Kraken2’s exact k-mer matching enables fast and sensitive classification, it can produce overconfident fine-rank calls when evidence is sparse, conserved, or partially novel. Perseus addresses this limitation by distinguishing trustworthy from spurious taxonomic predictions using structured k-mer evidence already present in the Kraken2 output. Perseus is designed to reduce false positive fine-rank calls arising from conserved regions, sparse k-mer support, and reference database incompleteness—failure modes that are common in long-read and high-novelty metagenomes.
Perseus assigns confidence probabilities to each Kraken2 classification at every canonical taxonomic rank, enabling informed decisions to confirm assignments, back off to higher, lineage-consistent ranks, or convert predictions to unclassified.
Perseus is built on a multi-headed 1D convolutional neural network that operates directly on features derived from Kraken2 output. The workflow constructs a lineage-aware feature matrix from a standard Kraken2 output file, then performs inference to produce a Kraken2-compatible output augmented with per-rank confidence probabilities for each assignment. Perseus operates strictly as a downstream confidence filter and does not perform reclassification, alignment, or novel taxon discovery.
Installation
Conda installation (recommended)
Perseus is available through conda. We recommend creating a new environment:
conda create -n perseus -c matnguyen -c conda-forge -c pytorch perseus
conda activate perseus
Getting started
Feature extraction
Perseus will perform feature extraction on a Kraken2 output file and output a directory of sharded parquets containing the features.
perseus extract <kraken_file> <output_shards_directory>
Filtering
Perseus takes in the directory of sharded parquets and the Kraken2 output file for filtering.
perseus filter <shards_directory> <kraken_file> <output_path>
The output file will be similar to the Kraken2 output file, but without the string of k-mer matches, and with the following additional columns:
- perseus_taxid - the taxonomic ID assigned by Perseus
- prob_{rank} - the assignment probability at a canonical {rank}
- chosen_rank - the final chosen rank assigned by Perseus
- chosen_prob_at_rank - the probability at the final chosen rank
Testing Data
We provide some data for testing Perseus. They can be found under tests/test_data. The Kraken2 output file is tests/test_data/test_kraken, the shards are in tests/test_data/test_shards, and the expected Perseus output file is tests/test_data/filtered.txt.
Testing the Installation
Quick Example
Run Perseus on the included test data:
perseus extract tests/test_data/test_kraken.txt example_extract
perseus filter example_extract tests/test_data/test_kraken.txt example_filtered.txt
This should produce an output file example_filtered.txt.
Because Perseus uses floating-point operations (PyTorch), small numerical differences may occur across platforms. Therefore, the output may not match the reference file exactly with a simple diff.
To compare the output with the expected results using a numerical tolerance:
python scripts/compare_outputs.py example_filtered.txt tests/test_data/filtered.txt
Running the Full Test Suite (optional)
For a full reproducibility check, run the included test suite.
Install the testing dependency:
pip install pytest
Then run:
pytest -q
This runs unit tests and end-to-end pipeline tests used during development.
Citing Perseus
Our preprint can be found here: https://www.biorxiv.org/content/10.64898/2026.03.06.710148v1
Data Generation Scripts
Scripts for generating the inclusion/exclusion simulated data are found here: https://github.com/matnguyen/perseus-scripts
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file perseus_metagenomics-1.0.1.tar.gz.
File metadata
- Download URL: perseus_metagenomics-1.0.1.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8659d48ba9b378b22a7646f676da7da62c8599130ec6b4ef56b2523bc07e88d
|
|
| MD5 |
1abd19c27fc9ed5e69dd2688f020c80e
|
|
| BLAKE2b-256 |
5348d02f752c50d54721cad1f818e5c3ac4aa577c8228249928e9c8868fe50c5
|
File details
Details for the file perseus_metagenomics-1.0.1-py3-none-any.whl.
File metadata
- Download URL: perseus_metagenomics-1.0.1-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8067d563dcdc86bdea3348b61063f05162b99d04ab52d440acd2528a4acec47f
|
|
| MD5 |
8a490b855a913406b6a2560f158365eb
|
|
| BLAKE2b-256 |
87d84cf2dc0489cb3acadcd6fcc90ed1a45a0c6fce78d901d46b93b477ed6e6d
|