Skip to main content

A quick and precise pipeline for detecting phages in sequence assemblies.

Project description

                  .                                                         
               ,'/ \`.                                                               
              |\/___\/|                                                     
              \'\   /`/          ██╗ █████╗ ███████╗ ██████╗ ███████╗██████╗
               `.\ /,'           ██║██╔══██╗██╔════╝██╔════╝ ██╔════╝██╔══██╗                   
                  |              ██║███████║█████╗  ██║  ███╗█████╗  ██████╔╝ 
                  |         ██   ██║██╔══██║██╔══╝  ██║   ██║██╔══╝  ██╔══██╗
                 |=|        ╚█████╔╝██║  ██║███████╗╚██████╔╝███████╗██║  ██║
            /\  ,|=|.  /\    ╚════╝ ╚═╝  ╚═╝╚══════╝ ╚═════╝ ╚══════╝╚═╝  ╚═╝
        ,'`.  \/ |=| \/  ,'`.                                                 
      ,'    `.|\ `-' /|,'    `.                                              
    ,'   .-._ \ `---' / _,-.   `.                                            
       ,'    `-`-._,-'-'   `.       
      '  

Jaeger : an accurate and fast deep-learning tool to detect bacteriophage sequences

GitHub GitHub last commit (branch) Conda Conda PyPI version Downloads DOI

Jaeger is a tool that utilizes homology-free machine learning to identify phage genome sequences that are hidden within metagenomes. It is capable of detecting both phages and prophages within metagenomic assemblies.


Citing Jaeger


If you use Jaeger in your work, please consider citing its preprint:

To cite the code itself:

  • Jaeger: an accurate and fast deep-learning tool to detect bacteriophage sequences DOI


Installing Jaeger


option 1 : bioconda

The performance of the Jaeger workflow can be significantly increased by utilizing GPUs. To enable GPU support, the CUDA Toolkit and cuDNN library must be accessible to conda.

# create conda environment and install jaeger
mamba create -n jaeger -c bioconda jaeger-bio==1.2

# activate environment
conda activate jaeger

Test the installation with test data

jaeger test
option 2 : Installing from pypi (recomended)
# create a conda environment and activate  
mamba create -n jaeger -c nvidia -c conda-forge cuda-nvcc "python>=3.11,<=3.12" pip
conda activate jaeger

# OR create a virtual environment using venv
python3 -m venv jaeger
source jaeger/bin/activate    

# to install jaeger with GPU support
pip install jaeger-bio[gpu]

# to install without GPU support
pip install jaeger-bio[cpu]

# to install on a Mac(arm)
pip install jaeger-bio[darwin-arm]

# test the installation
jaeger test
option 3 : Installing the dev version
# create a conda environment and activate  
mamba create -n jaeger -c nvidia -c conda-forge cuda-nvcc "python>=3.11,<3.12" pip
conda activate jaeger

# OR create a virtual environment using venv
python3 -m venv jaeger
source jaeger/bin/activate    

# install jaeger

# to install with GPU support
pip install --no-cache-dir "jaeger-bio[gpu] @ git+https://github.com/MGXlab/Jaeger@dev"

# to install without GPU support
pip3 install --root-user-action=ignore --no-cache-dir "jaeger-bio[cpu] @ git+https://github.com/MGXlab/Jaeger@dev"

# to install on a Mac(arm)
pip3 install --root-user-action=ignore --no-cache-dir "jaeger-bio[darwin-arm] @ git+https://github.com/MGXlab/Jaeger@dev"

# test the installation
jaeger test
option 4 : Apptainer (singularity)

If you're using Apptainer on a cluster, it's recommended to build the container on your local machine and then transfer it to the cluster.

# get the container def
wget -O jaeger_singularity.def https://raw.githubusercontent.com/Yasas1994/Jaeger/dev/singularity/jaeger_singularity.def
# get the configuration file
wget -O config.json https://raw.githubusercontent.com/Yasas1994/Jaeger/dev/src/jaeger/data/config.json

# to build the container
apptainer build jaeger.sif singularity/jaeger_singularity.def

# test container
apptainer run --nv jaeger.sif jaeger --help

# test the installation
apptainer run --nv jaeger.sif jaeger test

# list jaeger models available for download
apptainer run --nv jaeger.sif download --list
# download jaeger models
apptainer run --nv jaeger.sif download --model jaeger_57341_1.5M_fragment --path /path/to/save/model --config /path/to/config.json

# run jaeger
apptainer run --nv jaeger.sif predict --model jaeger_57341_1.5M_fragment --config /path/to/config.json -i /path/to/input.fasta -o /path/to/save/results

Downloading models


Starting from version 1.2.0, users will need to download the new models separately after installing Jaeger. However, for backward compatibility, Jaeger will still include the old model by default.

Use the --list flag to print out all models available for download

jaeger download --list

Then to download the model and add it to the model path run

jaeger download --path /path/to/store/models --model jaeger_38341_1.4M

If you decide to change the model path later, or if you have a dir witg newly trained/tuned models register the path

jaeger register-models --path /new/model/path

Running Jaeger


CPU/GPU mode

Once the environment is properly set up, using Jaeger is straightforward. The program can accept both compressed and uncompressed .fasta files containing the contigs as input. It will output a table containing the predictions and various statistics calculated during runtime.

jaeger predict -i input_file.fasta -o output_dir --batch 128

To run jaeger with singularity

apptainer run --nv jaeger.sif jaeger predict -i input_file.fasta -o output_dir --batch 128
Selecting the batch parameter

You can control the number of parallel computations using this parameter. By default it is set to 96. If you run into OOM errors, please consider setting the --bactch option to a lower value. for example 96 is good enough for a graphics card with 4 Gb of memory.


What is in the output?


All predictions are summarized in a table located at output_dir/<input_file>_default.jaeger.tsv

┌───────────────────────────────────┬────────┬────────────┬─────────┬───┬─────────────┬────────────────┬──────────────────┬───────────────┐
│ contig_id                         ┆ length ┆ prediction ┆ entropy ┆ … ┆ Archaea_var ┆ window_summary ┆ terminal_repeats ┆ repeat_length │
╞═══════════════════════════════════╪════════╪════════════╪═════════╪═══╪═════════════╪════════════════╪══════════════════╪═══════════════╡
│ NODE_1109_length_9622_cov_23.163… ┆ 9622   ┆ Phage      ┆ 0.43    ┆ … ┆ 0.143       ┆ 1V1n2V         ┆ null             ┆ null          │
│ NODE_1181_length_9275_cov_26.864… ┆ 9275   ┆ Phage      ┆ 0.327   ┆ … ┆ 0.504       ┆ 4V             ┆ null             ┆ null          │
│ NODE_123_length_36569_cov_24.228… ┆ 36569  ┆ Phage      ┆ 0.503   ┆ … ┆ 1.554       ┆ 9V1n7V         ┆ null             ┆ null          │
│ NODE_149_length_32942_cov_23.754… ┆ 32942  ┆ Phage      ┆ 0.458   ┆ … ┆ 3.229       ┆ 3V1n1n11V      ┆ null             ┆ null          │
│ NODE_231_length_24276_cov_21.832… ┆ 24276  ┆ Phage      ┆ 0.502   ┆ … ┆ 1.467       ┆ 1V1n3V1n5V     ┆ null             ┆ null          │
└───────────────────────────────────┴────────┴────────────┴─────────┴───┴─────────────┴────────────────┴──────────────────┴───────────────┘

This table provides information about various contigs in a metagenomic assembly. Each row represents a single contig, and the columns provide information about the contig's ID, length, the number of windows identified as prokaryotic, viral, eukaryotic, and archaeal, the prediction of the contig (Phage or Non-phage), the score of the contig for each category (bacterial, viral, eukaryotic and archaeal), and a summary of the windows. The table can be used to identify potential phage sequences in the metagenomic assembly based on the prediction column. The score columns can be used to further evaluate the confidence of the prediction and the window summary column can be used to understand the count of windows that contributed to the final prediction.


Options


jaeger run --help

## Jaeger 1.1.30 (yet AnothEr phaGe idEntifier) Deep-learning based bacteriophage discovery 
https://github.com/Yasas1994/Jaeger.git
usage: jaeger run  -i INPUT -o OUTPUT

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to input file
  -o OUTPUT, --output OUTPUT
                        path to output directory
  --fsize [FSIZE]       length of the sliding window (value must be 2^n). default:2048
  --stride [STRIDE]     stride of the sliding window. default:2048 (stride==fsize)
  -m {default,experimental_1,experimental_2}, --model {default,experimental_1,experimental_2}
                        select a deep-learning model to use. default:default
  -p, --prophage        extract and report prophage-like regions. default:False
  -s [SENSITIVITY], --sensitivity [SENSITIVITY]
                        sensitivity of the prophage extraction algorithm (between 0 - 4). default: 1.5
  --lc [LC]             minimum contig length to run prophage extraction algorithm. default: 500000 bp
  --rc [RC]             minium reliability score required to accept predictions. default: 0.2
  --pc [PC]             minium phage score required to accept predictions. default: 3
  --batch [BATCH]       parallel batch size, set to a lower value if your gpu runs out of memory. default:96
  --workers [WORKERS]   number of threads to use. default:4
  --getalllogits        writes window-wise scores to a .npy file
  --getsequences        writes the putative phage sequences to a .fasta file
  --cpu                 ignore available gpus and explicitly run jaeger on cpu. default: False
  --physicalid [PHYSICALID]
                        sets the default gpu device id (for multi-gpu systems). default: 0
  --getalllabels        get predicted labels for Non-Viral contigs. default: False
  -v, --verbose         Verbosity level : -vvv warning, -vv info, -v debug, (default info)

Misc. Options:
  -f, --overwrite       Overwrite existing files



Python Library


Jaeger can be integrated into python scripts using the jaegeraa python library as follows. currently the predict function accepts 4 different input types.

  1. Nucleotide sequence -> str
  2. List of Nucleotide sequences -> list(str,str,..)
  3. python file object -> (io.TextIOWrapper)
  4. python generator object that yields Nucleotide sequences as str (types.GeneratorType)
  5. Biopython Seq object
from jaegeraa.api import Predictions

model=Predictor()
predictions=model.predict(input,stride=2048,fragsize=2048,batch=100)
model.predict()

returns a dictionary of lists in the following format

{'contig_id': ['seq_0', 'seq_1'],
 'length': [19000, 10503],
 '#num_prok_windows': [0, 0],
 '#num_vir_windows': [9, 0],
 '#num_fun_windows': [0, 5],
 '#num_arch_windows': [0, 0],
 'prediction': ['Phage', 'Non-phage'],
 'bac_score': [-1.9552012549506292, -1.9441368103027343],
 'vir_score': [6.6312947273254395, -3.097817325592041],
 'fun_score': [-5.712721400790745, -0.6870137214660644],
 'arch_score': [-2.4369852013058133, -0.8941479325294495],
 'window_summary': ['9V', '5n']}
 

This dictionary can be easily converted to a pandas dataframe using DataFrame.from_dict() method

import pandas as pd
df = DataFrame.from_dict(predictions)

Notes


  • The program expects the input file to be in .fasta format.
  • The program uses a sliding window approach to scan the input sequences, so the stride argument determines how far the window will move after each scan.
  • The batch argument determines how many sequences will be processed in parallel.
  • The program is compatible with both CPU and GPU. By default, it will run on the GPU, but if the --cpu option is provided, it will use the specified number of threads for inference.
  • The program uses a pre-trained neural network model for phage genome prediction.
  • The --getalllabels option will output predicted labels for Non-Viral contigs, which can be useful for further analysis. It's recommended to use the output of this program in conjunction with other methods for phage genome identification.

Predicting prophages with Jaeger


jaeger run -p -i NC_002695.fna -o outdir 

The outdir will contain the following files

|____Escherichia_coli_O157-H7_prophages
| |____plots
| | |____NC_002695_Escherichia_coli_O157-H7_jaeger.pdf
| |____prophages_jaeger.tsv
|____Escherichia_coli_O157-H7_jaeger.log
|____Escherichia_coli_O157-H7_default_jaeger.tsv

users can find the following visulaization in the plots directory

dark mode



list of prophage coordinates can be found in prophages_jaeger.tsv

┌─────────────┬────────────┬──────────┬──────────┬───┬──────────┬────────┬────────────┬────────────┐
│ contig_id   ┆ alignment_ ┆ identiti ┆ identity ┆ … ┆ gc%      ┆ reject ┆ attL       ┆ attR       │
│             ┆ length     ┆ es       ┆          ┆   ┆          ┆        ┆            ┆            │
╞═════════════╪════════════╪══════════╪══════════╪═══╪══════════╪════════╪════════════╪════════════╡
│ NC_002695   ┆ 16.0       ┆ 16.0     ┆ 1.0      ┆ … ┆ 0.435049 ┆ false  ┆ GCACCATTTA ┆ GCACCATTTA │
│ Escherichia ┆            ┆          ┆          ┆   ┆          ┆        ┆ AATCAA     ┆ AATCAA     │
│ coli O157-… ┆            ┆          ┆          ┆   ┆          ┆        ┆            ┆            │
│ NC_002695   ┆ 15.0       ┆ 15.0     ┆ 1.0      ┆ … ┆ 0.493497 ┆ false  ┆ GCTTTTTTAT ┆ GCTTTTTTAT │
│ Escherichia ┆            ┆          ┆          ┆   ┆          ┆        ┆ ACTAA      ┆ ACTAA      │
│ coli O157-… ┆            ┆          ┆          ┆   ┆          ┆        ┆            ┆            │
│ NC_002695   ┆ 60.0       ┆ 60.0     ┆ 1.0      ┆ … ┆ 0.511819 ┆ false  ┆ TGGCGGAAGC ┆ TGGCGGAAGC │
│ Escherichia ┆            ┆          ┆          ┆   ┆          ┆        ┆ GCAGAGATTC ┆ GCAGAGATTC │
│ coli O157-… ┆            ┆          ┆          ┆   ┆          ┆        ┆ GAACTCTGGA ┆ GAACTCTGGA │
│             ┆            ┆          ┆          ┆   ┆          ┆        ┆ AC…        ┆ AC…        │
│ NC_002695   ┆ 16.0       ┆ 16.0     ┆ 1.0      ┆ … ┆ 0.499516 ┆ false  ┆ TTCTTTATTA ┆ TTCTTTATTA │
│ Escherichia ┆            ┆          ┆          ┆   ┆          ┆        ┆ CCGGCG     ┆ CCGGCG     │
│ coli O157-… ┆            ┆          ┆          ┆   ┆          ┆        ┆            ┆            │
│ NC_002695   ┆ 14.0       ┆ 14.0     ┆ 1.0      ┆ … ┆ 0.529465 ┆ false  ┆ CGTCATCAAG ┆ CGTCATCAAG │
│ Escherichia ┆            ┆          ┆          ┆   ┆          ┆        ┆ TGCA       ┆ TGCA       │
│ coli O157-… ┆            ┆          ┆          ┆   ┆          ┆        ┆            ┆            │
└─────────────┴────────────┴──────────┴──────────┴───┴──────────┴────────┴────────────┴────────────┘


Visualizing predictions


You can use phage_contig_annotator to annotate and visualize Jaeger predictions.


Acknowlegements


This work was supported by the European Union’s Horizon 2020 research and innovation program, under the Marie Skłodowska-Curie Actions Innovative Training Networks grant agreement no. 955974 (VIROINF), the European Research Council (ERC) Consolidator grant 865694

       

The ascii art logo is from https://ascii.co.uk/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jaeger_bio-1.26.1b1.tar.gz (33.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jaeger_bio-1.26.1b1-py3-none-any.whl (33.2 MB view details)

Uploaded Python 3

File details

Details for the file jaeger_bio-1.26.1b1.tar.gz.

File metadata

  • Download URL: jaeger_bio-1.26.1b1.tar.gz
  • Upload date:
  • Size: 33.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jaeger_bio-1.26.1b1.tar.gz
Algorithm Hash digest
SHA256 4943d73e198c74a647da6be905211387291e9967fb10761be72a28e7a067f449
MD5 1d441da577763c990ce9bf781219e479
BLAKE2b-256 1d7036404192454917d6caeffe1031024b6cba07749c6e6d36fcfe27f948d23d

See more details on using hashes here.

Provenance

The following attestation bundles were made for jaeger_bio-1.26.1b1.tar.gz:

Publisher: publish-to-pypi.yaml on MGXlab/Jaeger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file jaeger_bio-1.26.1b1-py3-none-any.whl.

File metadata

  • Download URL: jaeger_bio-1.26.1b1-py3-none-any.whl
  • Upload date:
  • Size: 33.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for jaeger_bio-1.26.1b1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6624f6465b0cf9f489d2df9be4d01775e5a998ce3f319d04f4bc91c6aaeb97a
MD5 820bd0961980a59273e2ac3fe7f2b4e6
BLAKE2b-256 3eef8d0239f51f2eccc4fd2c7cdc6c87d02508330a4ff57b52f88d5d3809826c

See more details on using hashes here.

Provenance

The following attestation bundles were made for jaeger_bio-1.26.1b1-py3-none-any.whl:

Publisher: publish-to-pypi.yaml on MGXlab/Jaeger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page