Protein-content-based bacterial pathogenicity classifier

These details have not been verified by PyPI

Project links

Project description

WSPC

Installation and dependencies

WSPC package can be installed via one of the following options:

Using pip install (via the command line):

pip install wspc

Using conda (on a conda environment):

conda install -c zivukelsongroup wspc

Dependencies:

Python >=3.6
Packages: pandas, numpy, scikit-learn, scipy

Command Line

In windows: make sure that the python "Scripts\" directory is added to PATH, so that the package can be executed as a command

Usage:

usage: wspc [-h] [-m {predict,fit}] -i I [-o OUTPUT] [-l LABELS_PATH] [--model_path MODEL_PATH] [-k K] [-t T]

optional arguments:
  -h, --help            show this help message and exit
  -m {predict,fit}, --mode {predict,fit}
  -i I                  input directory with genome *.txt files or a merged input *.fasta file
  -o OUTPUT, --output OUTPUT
                        output directory, default current directory
  -l LABELS_PATH, --labels_path LABELS_PATH
                        path to *.csv file with labels
  --model_path MODEL_PATH
                        path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used
  -k K                  parameter for training - selecting k-best features using chi2
  -t T                  parameter for training - clustering threshold

Predict:

You can predict the pathogenicity potentials of group of genomes using a saved model in a *.pkl file. If a path is not provided, saved pre-trained model will be used. The WSPC pre-trained model can be found in https://github.com/shakedna1/wspc_rep/blob/main/src/wspc/model/WSPC_model.pkl.

wspc -m predict -i path_to_input_genomes

Train:

Train a new model using the fit command.

You can train a new model using the same k (selecting k-best features using chi2) and t (clustering threshold) values of WSPC (450 and 0.18 respectively) or using a different values of your choice.

wspc -m fit -i path_to_input_genomes -l path_to_labels -k 450 -t 0.18

Reconstruction of Training and Prediction on the dataset from the paper

Download and extract the WSPC dataset (WSPC train set & WSPC test set) from https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip In Ubuntu:
```
   wget https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip
   unzip train_test_datasets.zip
```
Train:
```
wspc -m fit -i train_genomes.fasta -l train_genomes_info.csv -k 450 -t 0.18
```
The file trained_model.pkl will be saved in the same directory (or in the directory provided through the -o argument)

Test:

wspc -m predict -i test_genomes.fasta --model_path trained_model.pkl

The file predictions.csv will contain the predictions

Running WSPC as a python module

Below are a detailed running examples of WSPC as a python module:

1. Train a new model and predict genomes pathogenicity using the new model:

Imports:

import wspc

Train a new model:

X_train = wspc.read_genomes(path_to_genomes)
y = wspc.read_labels(path_to_labels, X_train)

model = wspc.fit(X_train, y, k=450, t=0.18)

Predict pathogenicity:

X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)

2. Predict genomes pathogenicity using an exiting model:

Imports:

import wspc

Load a pre-trained model:

model_path - path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used

model = wspc.load_model(model_path)

Predict pathogenicity:

X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)

WSPC input:

WSPC handle different types of input:

Input directory with genome *.tab and\or *.txt files:

*.tab file - Public genomes on PATRIC database are available through a genomes directory. Each genome directory includes a .features.tab file, which provides all genomic features and related information in tab-delimited format, including PGFams information. For features.tab file example, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/Bacpacs/patric_files/1041522.28.PATRIC.features.tab

*.txt file - Output file of the PATRIC annotation service for new genome. For more detailes on the file and the annotation service, see explanation at the section: "Obtain PATRIC Global Protein Families (PGFams) annotations for new sequenced genome" below.
Merged input *.fasta file: A merged file in a fasta format that contains concatenation of the PGFams information, which can be extracted from a *.tab file using the field: pgfam_id and from a *.txt file using the fiels: "pgfam".

For the merged file exact format, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/train_genomes.fasta

Example of the fasta file content:
```
>1346.123
PGF_10048015
PGF_00062045
PGF_00409415
PGF_00766022
PGF_02011026
X
X
X
PGF_07480521
PGF_01162199
PGF_03475877
PGF_00876106
PGF_06473395
PGF_06429692
PGF_00007012
PGF_04788810
```
- 1346.123 - genome name, the lines below the genome name represent the genome sequence of PGFam annotations. X represents a missing/un-annotated gene.
Note that any protein family annotation IDs can be used, e.g., COGs, eggNOGs etc.

Obtain PATRIC Global Protein Families (PGFams) annotations for a newly sequenced genome:

PATRIC Provides Global Protein Families (PGFams) annotations service for new genomes. In order to generate PGFams annotations file for a new sequenced genome:

Use PATRIC's Genome Annotation Service: https://patricbrc.org/app/Annotation.

For detailed instructions, Follow the instructions under the PATRIC genome annotations service documentation: https://docs.patricbrc.org/user_guides/services/genome_annotation_service.html
Download the resulting "Taxonomy name + label".txt file (click on view, then download. "Taxonomy name + label" is the genome name).
If you wish to create a merged *.fasta file for number of genomes, the column "pgfam" will be used for pgfam extraction.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.6

Aug 27, 2021

0.0.5

Jul 31, 2021

0.0.4

Jul 31, 2021

0.0.3

Jul 30, 2021

0.0.2

Jun 3, 2021

0.0.1

Jun 3, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wspc-0.0.6.tar.gz (181.7 kB view hashes)

Uploaded Aug 27, 2021 Source

Built Distribution

wspc-0.0.6-py3-none-any.whl (188.0 kB view hashes)

Uploaded Aug 27, 2021 Python 3

Hashes for wspc-0.0.6.tar.gz

Hashes for wspc-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`0358c66ae0d88211faf417d9d0c57696b239a89ffb8eade2119707aafdddd1b4`
MD5	`417db80228a2ce84adee00d0acd60231`
BLAKE2b-256	`1be6b873515917641227b708c9128600e7e72e4a66b4c504a98c4e7b9a2f16f0`

Hashes for wspc-0.0.6-py3-none-any.whl

Hashes for wspc-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5c07a65c57c84bc1d51170fe5dd4926b586727aa637df20ab95d54addfe3d7d4`
MD5	`d89076e9be224c4400c47dd41acfb753`
BLAKE2b-256	`cc5c751f3efaeefe23080a1484cf63aa6bf4989575f794126e7de85683da3f99`