Protein-content-based bacterial pathogenicity classifier
Project description
WSPC
Installation and dependencies
WSPC package can be installed via one of the following options:
- Using pip install (via the command line):
pip install wspc
- Using conda (on a conda environment):
conda install -c zivukelsongroup wspc
Dependencies:
- Python >=3.6
- Packages: pandas, numpy, scikit-learn, scipy
Command Line
In windows: make sure that the python "Scripts\" directory is added to PATH, so that the package can be executed as a command
Usage:
usage: wspc [-h] [-m {predict,fit}] -i I [-o OUTPUT] [-l LABELS_PATH] [--model_path MODEL_PATH] [-k K] [-t T]
optional arguments:
-h, --help show this help message and exit
-m {predict,fit}, --mode {predict,fit}
-i I input directory with genome *.txt files or a merged input *.fasta file
-o OUTPUT, --output OUTPUT
output directory, default current directory
-l LABELS_PATH, --labels_path LABELS_PATH
path to *.csv file with labels
--model_path MODEL_PATH
path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used
-k K parameter for training - selecting k-best features using chi2
-t T parameter for training - clustering threshold
Predict:
You can predict the pathogenicity potentials of group of genomes using a saved model in a *.pkl file. If a path is not provided, saved pre-trained model will be used. The WSPC pre-trained model can be found in https://github.com/shakedna1/wspc_rep/blob/main/src/wspc/model/WSPC_model.pkl.
wspc -m predict -i path_to_input_genomes
Train:
Train a new model using the fit command.
You can train a new model using the same k (selecting k-best features using chi2) and t (clustering threshold) values of WSPC (450 and 0.18 respectively) or using a different values of your choice.
wspc -m fit -i path_to_input_genomes -l path_to_labels -k 450 -t 0.18
Reconstruction of Training and Prediction on the dataset from the paper
-
Download and extract the WSPC dataset (WSPC train set & WSPC test set) from https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip In Ubuntu:
wget https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip unzip train_test_datasets.zip
-
Train:
wspc -m fit -i train_genomes.fasta -l train_genomes_info.csv -k 450 -t 0.18
The file trained_model.pkl will be saved in the same directory (or in the directory provided through the -o argument)
-
Test:
wspc -m predict -i test_genomes.fasta --model_path trained_model.pkl
The file predictions.csv will contain the predictions
Running WSPC as a python module
Below are a detailed running examples of WSPC as a python module:
1. Train a new model and predict genomes pathogenicity using the new model:
Imports:
import wspc
Train a new model:
X_train = wspc.read_genomes(path_to_genomes)
y = wspc.read_labels(path_to_labels, X_train)
model = wspc.fit(X_train, y, k=450, t=0.18)
Predict pathogenicity:
X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)
2. Predict genomes pathogenicity using an exiting model:
Imports:
import wspc
Load a pre-trained model:
model_path - path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used
model = wspc.load_model(model_path)
Predict pathogenicity:
X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)
WSPC input:
WSPC handle different types of input:
-
Input directory with genome *.tab and\or *.txt files:
*.tab file - Public genomes on PATRIC database are available through a genomes directory. Each genome directory includes a .features.tab file, which provides all genomic features and related information in tab-delimited format, including PGFams information. For features.tab file example, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/Bacpacs/patric_files/1041522.28.PATRIC.features.tab
*.txt file - Output file of the PATRIC annotation service for new genome. For more detailes on the file and the annotation service, see explanation at the section: "Obtain PATRIC Global Protein Families (PGFams) annotations for new sequenced genome" below.
-
Merged input *.fasta file: A merged file in a fasta format that contains concatenation of the PGFams information, which can be extracted from a *.tab file using the field: pgfam_id and from a *.txt file using the fiels: "pgfam".
For the merged file exact format, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/train_genomes.fasta
Example of the fasta file content:
>1346.123 PGF_10048015 PGF_00062045 PGF_00409415 PGF_00766022 PGF_02011026 X X X PGF_07480521 PGF_01162199 PGF_03475877 PGF_00876106 PGF_06473395 PGF_06429692 PGF_00007012 PGF_04788810
- 1346.123 - genome name, the lines below the genome name represent the genome sequence of PGFam annotations. X represents a missing/un-annotated gene.
Note that any protein family annotation IDs can be used, e.g., COGs, eggNOGs etc.
Obtain PATRIC Global Protein Families (PGFams) annotations for a newly sequenced genome:
PATRIC Provides Global Protein Families (PGFams) annotations service for new genomes. In order to generate PGFams annotations file for a new sequenced genome:
-
Use PATRIC's Genome Annotation Service: https://patricbrc.org/app/Annotation.
For detailed instructions, Follow the instructions under the PATRIC genome annotations service documentation: https://docs.patricbrc.org/user_guides/services/genome_annotation_service.html
-
Download the resulting "Taxonomy name + label".txt file (click on view, then download. "Taxonomy name + label" is the genome name).
-
If you wish to create a merged *.fasta file for number of genomes, the column "pgfam" will be used for pgfam extraction.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.