A quick and precise pipeline for detecting phages in sequence assemblies.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

              .
           ,'/ \`.                 
          |\/___\/|                
          \'\   /`/                 ██╗ █████╗ ███████╗ ██████╗ ███████╗██████╗ 
           `.\ /,'                  ██║██╔══██╗██╔════╝██╔════╝ ██╔════╝██╔══██╗
              |                     ██║███████║█████╗  ██║  ███╗█████╗  ██████╔╝
              |                ██   ██║██╔══██║██╔══╝  ██║   ██║██╔══╝  ██╔══██╗     
             |=|               ╚█████╔╝██║  ██║███████╗╚██████╔╝███████╗██║  ██║
        /\  ,|=|.  /\           ╚════╝ ╚═╝  ╚═╝╚══════╝ ╚═════╝ ╚══════╝╚═╝  ╚═╝ 
    ,'`.  \/ |=| \/  ,'`.
  ,'    `.|\ `-' /|,'    `.
,'   .-._ \ `---' / _,-.   `.
   ,'    `-`-._,-'-'    `.
  '

Jaeger : A quick and precise pipeline for detecting phages in sequence assemblies.

Jaeger is a tool that utilizes homology-free machine learning to identify phage genome sequences that are hidden within metagenomes. It is capable of detecting both phages and prophages within metagenomic assemblies.

Installation

Linux and Mac (x64_86)

option 1 : bioconda

The performance of the Jaeger workflow can be significantly increased by utilizing GPUs. To enable GPU support, the CUDA Toolkit and cuDNN library must be accessible to conda.


# create conda environment and install jaeger
conda create -n jaeger -c conda-forge -c anaconda -c bioconda jaeger

# activate environment
conda activate jaeger

troubleshooting

If you have a GPU on the system, and jaeger fails to detect it, try these steps.

If you are on a HPC check whether cuda-toolkit is available as a module. (Skip this step if you are trying this out on your PC)

module avail

angsd/0.937         boost/1.71.0        clang/14.0.4  fastp/0.23.1   gcc/13.2.0     julia/1.9.2         modeller/9.23      proj/7.0.1          structure/2.3.4     vcftools/0.1.16  
autodockvina/1.1.2  boost/1.79.0        clang/17.0.5  fastqc/0.11.9  hdf5/1.12.1    kalign/1.04         mrbayes/3.2.7      r/4.1.1             superlu-dist/8.1.2  
bamutil/1.0.15      bowtie/2.4.2        colmap/3.8    fgsl/1.5.0     hdf5/1.14.0    likwid/5.2.0        openmpi/4.1.1      r/4.3.1             superlu-dist/8.2.0  
baypass/2.2         bwa/0.7.17          cuda/11.4     fsl/6.0.2      hhsuite/3.3.0  likwid/5.2.1        openpmix/3.1.5     samtools/1.12       superlu/4.3         
bcftools/1.15       cdhit/4.8.1         cuda/11.7     gams/36.2.0    I-TASSER/5.1   mathematica/13.2.1  petsc-real/3.18.1  singularity/3.10.0  transdecoder/5.7.0  
bedtools/2.30.0     ceres-solver/2.1.0  cuda/12.0.0   gcc/12.2.0

If so, load it

module load cuda/11.7

Next, check whether the NVIDIA GPU driver is properly configured.

nvidia-smi

Above command returns the following output if everything is properly set-up. You can also determine the cuda version from it. For example here it is 11.7 (for step 3)

Mon Apr  8 14:26:43 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01    Driver Version: 515.86.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| N/A   51C    P8     6W /  N/A |   5344MiB /  6144MiB |     27%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2198      G   /usr/lib/xorg/Xorg                 69MiB |
|    0   N/A  N/A   1247272      C   ...a3/envs/jaeger/bin/python     5271MiB |
+-----------------------------------------------------------------------------+

Check whether Jaeger detects the GPU now.

If that fails you will have to manually configure the conda environment as shown in step 3.

- cuda-toolkit for cuda>=11.1 can be found here https://anaconda.org/nvidia/cuda-toolkit (not recommended)

This example shows the installation process for cuda=11.3.0. Simply change the version number on the second "nvidia/label/cuda-11.x.x" command to install a different version



# create a conda environment
conda create -n jaeger python=3.9 pip

# cudatoolkit and cudnn
conda install -n jaeger -c "nvidia/label/cuda-11.3.0" cudatoolkit=11
conda install -n jaeger -c conda-forge cudnn

# install jaeger
conda install -n jaeger -c conda-forge -c anaconda -c bioconda jaeger

# activate environment
conda activate jaeger

More inoformation on properly setting setting up tensorflow can be found here

option 2 : Installing from pypi (not recommended)

# create a conda environment and activate  
conda create -n jaeger python=3.9 pip
conda activate jaeger

#install jaeger
pip install jaeger-bio

Mac (ARM)

  # create a conda environment
  conda create -c conda-forge -c apple -c bioconda -c defaults -n jaeger python=3.9.2 pip tensorflow=2.6 tensorflow-deps=2.6.0 numpy=1.19.5 tqdm=4.64.0 biopython=1.78

  # install tensorflow
  conda activate jaeger
  pip install tensorflow-macos
  pip install tensorflow-metal

  # install jaeger
  pip install jaeger-bio

Running Jaeger

CPU/GPU mode

Once the environment is properly set up, using Jaeger is straightforward. The program can accept both compressed and uncompressed .fasta files containing the contigs as input. It will output a table containing the predictions and various statistics calculated during runtime.

Jaeger -i input_file.fasta -o output_dir --batch 128

multi-GPU mode

We provide a new program that allows users to automatically run multiple instances of Jaeger on several GPUs allowing maximum utilization of state-of-the-art hardware. This program accepts a csv file with paths to all input .fasta files. Column with the file paths should be named as 'paths'. All other arguments remains similar to 'Jaeger' program.

Jaeger_parallel -i input_file.csv -o output_dir --batch 128

Selecting the batch parameter

You can control the number of parallel computations using this parameter. By default it is set to 512. If you run into OOM errors, please consider setting the --bactch option to a lower value. for example 128 is good enough for a graphics card with 6 Gb of memory.

options

Jaeger --help


## Jaeger 1.1.25 (yet AnothEr phaGe idEntifier) Deep-learning based bacteriophage discovery 
https://github.com/Yasas1994/Jaeger.git

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        path to input file
  -o OUTPUT, --output OUTPUT
                        path to output directory
  --fsize [FSIZE]       length of the sliding window (value must be 2^n). default:2048
  --stride [STRIDE]     stride of the sliding window. default:2048 (stride==fsize)
  -m {default,experimental_1,experimental_2}, --model {default,experimental_1,experimental_2}
                        select a deep-learning model to use. default:default
  -p, --prophage        extract and report prophage-like regions. default:False
  -s [SENSITIVITY], --sensitivity [SENSITIVITY]
                        sensitivity of the prophage extraction algorithm (between 0 - 4). default: 1.5
  --lc [LC]             minimum contig length to run prophage extraction algorithm. default: 500000 bp
  --batch [BATCH]       parallel batch size, set to a lower value if your gpu runs out of memory. default:96
  --workers [WORKERS]   number of threads to use. default:4
  --getalllogits        return position-wise logits for each prediction window as a .npy file
  --usecutoffs          use cutoffs to obtain the class prediction
  --cpu                 ignore available gpus and explicitly run jaeger on cpu. default: False
  --virtualgpu          create and run jaeger on a virtualgpu. default: False
  --physicalid [PHYSICALID]
                        sets the default gpu device id (for multi-gpu systems). default:0
  --getalllabels        get predicted labels for Non-Viral contigs. default:False

Misc. Options:
  -v, --verbose         Verbosity level : -v warning, -vv info, -vvv debug, (default info)
  -f, --overwrite       Overwrite existing files
  --progressbar         show progress bar

Python Library

Jaeger can be integrated into python scripts using the jaegeraa python library as follows. currently the predict function accepts 4 different input types.

Nucleotide sequence -> str
List of Nucleotide sequences -> list(str,str,..)
python file object -> (io.TextIOWrapper)
python generator object that yields Nucleotide sequences as str (types.GeneratorType)
Biopython Seq object

from jaegeraa.api import Predictions

model=Predictor()
predictions=model.predict(input,stride=2048,fragsize=2048,batch=100)
model.predict()

returns a dictionary of lists in the following format

{'contig_id': ['seq_0', 'seq_1'],
 'length': [19000, 10503],
 '#num_prok_windows': [0, 0],
 '#num_vir_windows': [9, 0],
 '#num_fun_windows': [0, 5],
 '#num_arch_windows': [0, 0],
 'prediction': ['Phage', 'Non-phage'],
 'bac_score': [-1.9552012549506292, -1.9441368103027343],
 'vir_score': [6.6312947273254395, -3.097817325592041],
 'fun_score': [-5.712721400790745, -0.6870137214660644],
 'arch_score': [-2.4369852013058133, -0.8941479325294495],
 'window_summary': ['9V', '5n']}

This dictionary can be easily converted to a pandas dataframe using DataFrame.from_dict() method

import pandas as pd
df = DataFrame.from_dict(predictions)

Notes

The program expects the input file to be in .fasta format.
The program uses a sliding window approach to scan the input sequences, so the stride argument determines how far the window will move after each scan.
The batch argument determines how many sequences will be processed in parallel.
The program is compatible with both CPU and GPU. By default, it will run on the GPU, but if the --cpu option is provided, it will use the specified number of threads for inference.
The program uses a pre-trained neural network model for phage genome prediction.
The --getalllabels option will output predicted labels for Non-Viral contigs, which can be useful for further analysis. It's recommended to use the output of this program in conjunction with other methods for phage genome identification.

What is in the output?

contig_id	length	prediction	entropy	realiability_score	host_contam	prophage_contam	#_Bacteria_windows	#_Phage_windows	#_Eukarya_windows	#_Archaea_windows	Bacteria_score	Bacteria_var	Phage_score	Phage_var	Eukarya_score	Eukarya_var	Archaea_score	Archaea_var	window_summary
NODE_94_length_44776_cov_27.159388	44776	Phage	0.385	0.719	False	False	2	19	0	0	0.966	1.27	3.66	1.679	-5.832	2.477	-3.199	1.619	5V1n14V1n
NODE_123_length_36569_cov_24.228077	36569	Phage	0.503	0.695	False	False	1	16	0	0	0.945	0.766	3.453	1.116	-6.02	2.471	-2.795	1.554	9V1n7V
NODE_149_length_32942_cov_23.754006	32942	Phage	0.458	0.758	False	False	1	14	1	0	-0.023	0.602	3.924	3.352	-7.18	5.324	-2.023	3.229	3V2n11V
NODE_231_length_24276_cov_21.832294	24276	Phage	0.502	0.761	False	False	2	9	0	0	1.08	0.978	3.297	1.479	-5.773	1.05	-2.682	1.467	1V1n3V1n5V
NODE_262_length_22786_cov_22.465664	22786	Phage	0.452	0.709	False	False	1	9	0	1	0.383	0.768	3.465	1.919	-6.875	1.275	-1.683	4.078	2V1n6V1n1V

This table provides information about various contigs in a metagenomic assembly. Each row represents a single contig, and the columns provide information about the contig's ID, length, the number of windows identified as prokaryotic, viral, eukaryotic, and archaeal, the prediction of the contig (Phage or Non-phage), the score of the contig for each category (bacterial, viral, eukaryotic and archaeal), and a summary of the windows. The table can be used to identify potential phage sequences in the metagenomic assembly based on the prediction column. The score columns can be used to further evaluate the confidence of the prediction and the window summary column can be used to understand the count of windows that contributed to the final prediction.

Predicting prophages with Jaeger

Jaeger -p -i NZ_CP033092.fna -o outdir

Visualizing predictions

You can use phage_contig_annotator to annotate and visualize Jaeger predictions.

Acknowlegements

This work was supported by the European Union’s Horizon 2020 research and innovation program, under the Marie Skłodowska-Curie Actions Innovative Training Networks grant agreement no. 955974 (VIROINF), the European Research Council (ERC) Consolidator grant 865694

ascii art from https://ascii.co.uk/

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

1.1.30

Nov 13, 2024

1.1.26

Apr 9, 2024

This version

1.1.25

Apr 8, 2024

1.1.24

Apr 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jaeger-bio-1.1.25.tar.gz (33.4 MB view details)

Uploaded Apr 8, 2024 Source

Built Distribution

jaeger_bio-1.1.25-py3-none-any.whl (33.4 MB view details)

Uploaded Apr 8, 2024 Python 3

File details

Details for the file jaeger-bio-1.1.25.tar.gz.

File metadata

Download URL: jaeger-bio-1.1.25.tar.gz
Upload date: Apr 8, 2024
Size: 33.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for jaeger-bio-1.1.25.tar.gz
Algorithm	Hash digest
SHA256	`c5b624e30c0dc976a85c199f9734debb8779c31237db54f4672d8cd6f694e753`
MD5	`f91c86362a3028dd34f2f641da432bcc`
BLAKE2b-256	`62994af4cebace227b020dc804b3cfeeef28221de33068a0f69baa58cddb5eef`

See more details on using hashes here.

File details

Details for the file jaeger_bio-1.1.25-py3-none-any.whl.

File metadata

Download URL: jaeger_bio-1.1.25-py3-none-any.whl
Upload date: Apr 8, 2024
Size: 33.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for jaeger_bio-1.1.25-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d82ac03ed6cb1b0118a4c32a5d5c64571e7221a6cf834da2760b88eb59cf82f8`
MD5	`3ea74df238cf9363813334c197cd5dbe`
BLAKE2b-256	`16cbf44263cf580c8b996a32d861a531dd29a40660093a20b8416b748c26defd`