A python tool for numerically encoding protein sequences based on position specific scoring matrices
Project description
Tutorial
Generating PSSM profiles
To generate PSSM profiles for protein sequences, the helper function create_pssm_profile can be used. However, before using the function, the steps mentioned below must be followed.
-
Download a blast database: For eg. uniref50 database can be downloaded using this link
-
Download blast executables using this link preferably version 2.9.0. The psiblast program used to create the PSSM profiles will be downloaded as well along with makeblastdb program to be used in the next step
-
Make a local blast database using uniref50 fasta file and the blast executable (makeblastdb). The following command can be used for that purpose.
$makeblastdb -in uniref50.fasta -dbtype prot -out uniref50
For further information, refer here:
Once the above steps have been followed and there is an indexed blast database on your local machine, the create_pssm_profile function can be used. It requires the following arguments:
-
A comma separated protein sequence file where each line contains the name of the protein followed by its sequence separated by a comma.
-
The output directory where the user would like to store the pssm profiles.
-
The path of the psiblast program executable downloaded in step 2 described above.
-
The path of the indexed blast database directory created in step 3 described above.
# Usage example
from pssmpro.features import create_pssm_profile
# The comma separated protein sequence file
protein_sequence_file = "./pssmpro_test_data/test_seq.csv"
# Output directory where the pssm profiles will be stored
output_dir = "./pssmpro_test_data/pssm_profiles/"
# the path to the psiblast program executable downloaded as part of the blast program suite
psiblast_executable_path = "/opt/aci/sw/ncbi-rmblastn/2.9/0_gcc-8.3.1-bxy/bin/psiblast"
# prefix of the indexed blast database files created using makeblastdb
blast_db_prefix = "./pssmpro_test_data/uniref50/uniref50db"
# number of cores to be used while creating the pssm profiles
number_of_cores = 8
create_pssm_profile(protein_sequence_file, output_dir, psiblast_executable_path,
blast_db_prefix, number_of_cores)
Generating PSSM features
pssmpro contains 21 features which are capable of numerically encoding protein sequences using their pssm profiles. They are:
- aac_pssm
- aadp_pssm
- aatp
- ab_pssm
- d_fpssm
- dp_pssm
- dpc_pssm
- edp
- eedp
- k_separated_bigrams_pssm
- medp
- pse_pssm
- pssm_ac
- pssm_cc
- pssm_composition
- rpm_pssm
- rpssm
- s_fpssm
- smoothed_pssm
- tpc_pssm
- tri_gram_pssm
For a detailed description of the features, refer to the Supplementary Documents of the paper (link to be added). Other protein sequence encoding
NB: pssmpro is based on POSSUM link. The code has been adapted to work with python version 3 and above.
similar modules to encode protein sequences
Other modules that can be used to generate numerical encoding of protein sequences are:
# Usage example
# To create any one of the 21 features one can use the "get_feature" function
pssm_dir_path = "./pssmpro_test_data/pssm_profiles/"
feature_type = "aac_pssm"
output_dir_path = "./pssmpro_test_data/features/"
get_feature(pssm_dir_path, feature_type, output_dir_path)
# To create all 21 features at once, one can use the "get_all_features" function
get_all_features(pssm_dir_path, output_dit_path)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.