Skip to main content

Protein-content-based bacterial pathogenicity classifier

Project description

WSPC

Installation and dependencies

WSPC package can be installed via one of the following options:

  1. Using pip install (via the command line):
pip install wspc
  1. Using conda (on a conda environment):
conda install -c zivukelsongroup wspc

Dependencies:

  • Python >=3.6
  • Packages: pandas, numpy, scikit-learn, scipy

Command Line

In windows: make sure that the python "Scripts\" directory is added to PATH, so that the package can be executed as a command

Usage:

usage: wspc [-h] [-m {predict,fit}] -i I [-o OUTPUT] [-l LABELS_PATH] [--model_path MODEL_PATH] [-k K] [-t T]

optional arguments:
  -h, --help            show this help message and exit
  -m {predict,fit}, --mode {predict,fit}
  -i I                  input directory with genome *.txt files or a merged input *.fasta file
  -o OUTPUT, --output OUTPUT
                        output directory, default current directory
  -l LABELS_PATH, --labels_path LABELS_PATH
                        path to *.csv file with labels
  --model_path MODEL_PATH
                        path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used
  -k K                  parameter for training - selecting k-best features using chi2
  -t T                  parameter for training - clustering threshold

Predict:

You can predict the pathogenicity potentials of group of genomes using a saved model in a *.pkl file. If a path is not provided, saved pre-trained model will be used. The WSPC pre-trained model can be found in https://github.com/shakedna1/wspc_rep/blob/main/src/wspc/model/WSPC_model.pkl.

wspc -m predict -i path_to_input_genomes

Train:

Train a new model using the fit command.

You can train a new model using the same k (selecting k-best features using chi2) and t (clustering threshold) values of WSPC (450 and 0.18 respectively) or using a different values of your choice.

wspc -m fit -i path_to_input_genomes -l path_to_labels -k 450 -t 0.18

Reconstruction of Training and Prediction on the dataset from the paper

  1. Download and extract the WSPC dataset (WSPC train set & WSPC test set) from https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip In Ubuntu:

       wget https://github.com/shakedna1/wspc_rep/raw/main/Data/train_test_datasets.zip
       unzip train_test_datasets.zip
    
  2. Train:

    wspc -m fit -i train_genomes.fasta -l train_genomes_info.csv -k 450 -t 0.18
    

    The file trained_model.pkl will be saved in the same directory (or in the directory provided through the -o argument)

  3. Test:

    wspc -m predict -i test_genomes.fasta --model_path trained_model.pkl
    

    The file predictions.csv will contain the predictions

Running WSPC as a python module

Below are a detailed running examples of WSPC as a python module:

1. Train a new model and predict genomes pathogenicity using the new model:

Imports:

import wspc

Train a new model:

X_train = wspc.read_genomes(path_to_genomes)
y = wspc.read_labels(path_to_labels, X_train)

model = wspc.fit(X_train, y, k=450, t=0.18)

Predict pathogenicity:

X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)

2. Predict genomes pathogenicity using an exiting model:

Imports:

import wspc

Load a pre-trained model:

model_path - path to a saved model in a *.pkl file. If not provided, saved pre-trained model will be used

model = wspc.load_model(model_path)

Predict pathogenicity:

X_test = wspc.read_genomes(path_to_genomes)
predictions = wspc.predict(X_test, model)

WSPC input:

WSPC handle different types of input:

  1. Input directory with genome *.tab and\or *.txt files:

    *.tab file - Public genomes on PATRIC database are available through a genomes directory. Each genome directory includes a .features.tab file, which provides all genomic features and related information in tab-delimited format, including PGFams information. For features.tab file example, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/Bacpacs/patric_files/1041522.28.PATRIC.features.tab

    *.txt file - Output file of the PATRIC annotation service for new genome. For more detailes on the file and the annotation service, see explanation at the section: "Obtain PATRIC Global Protein Families (PGFams) annotations for new sequenced genome" below.

  2. Merged input *.fasta file: A merged file in a fasta format that contains concatenation of the PGFams information, which can be extracted from a *.tab file using the field: pgfam_id and from a *.txt file using the fiels: "pgfam".

    For the merged file exact format, look at the file: https://github.com/shakedna1/wspc_rep/blob/main/Data/train_genomes.fasta

    Example of the fasta file content:

    >1346.123
    PGF_10048015
    PGF_00062045
    PGF_00409415
    PGF_00766022
    PGF_02011026
    X
    X
    X
    PGF_07480521
    PGF_01162199
    PGF_03475877
    PGF_00876106
    PGF_06473395
    PGF_06429692
    PGF_00007012
    PGF_04788810
    
    • 1346.123 - genome name, the lines below the genome name represent the genome sequence of PGFam annotations. X represents a missing/un-annotated gene.

    Note that any protein family annotation IDs can be used, e.g., COGs, eggNOGs etc.

Obtain PATRIC Global Protein Families (PGFams) annotations for a newly sequenced genome:

PATRIC Provides Global Protein Families (PGFams) annotations service for new genomes. In order to generate PGFams annotations file for a new sequenced genome:

  1. Use PATRIC's Genome Annotation Service: https://patricbrc.org/app/Annotation.

    For detailed instructions, Follow the instructions under the PATRIC genome annotations service documentation: https://docs.patricbrc.org/user_guides/services/genome_annotation_service.html

  2. Download the resulting "Taxonomy name + label".txt file (click on view, then download. "Taxonomy name + label" is the genome name).

  3. If you wish to create a merged *.fasta file for number of genomes, the column "pgfam" will be used for pgfam extraction.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wspc-0.0.6.tar.gz (181.7 kB view hashes)

Uploaded Source

Built Distribution

wspc-0.0.6-py3-none-any.whl (188.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page