Skip to main content

A python tool for numerically encoding protein sequences based on position specific scoring matrices

Project description

Tutorial

Generating PSSM profiles

To generate PSSM profiles for protein sequences, the helper function create_pssm_profile can be used. However, before using the function, the steps mentioned below must be followed.

  • Download a blast database: For eg. uniref50 database can be downloaded using this link

    http://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref50/

  • Download blast executables using this link preferably version 2.9.0. The psiblast program used to create the PSSM profiles will be downloaded as well along with makeblastdb program to be used in the next step

    https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/

  • Make a local blast database using uniref50 fasta file and the blast executable (makeblastdb). The following command can be used for that purpose.

    $makeblastdb -in uniref50.fasta -dbtype prot -out uniref50

For further information, refer here:

https://quickgrid.blogspot.com/2018/10/Python-Sub-Process-Local-Psi-Blast-PSSM-Generation-from-FASTA-in-Directory-using-Uniref50-Database-in-Pycharm.html

Once the above steps have been followed and there is an indexed blast database on your local machine, the create_pssm_profile function can be used. It requires the following arguments:

  1. A comma separated protein sequence file where each line contains the name of the protein followed by its sequence separated by a comma.

  2. The output directory where the user would like to store the pssm profiles.

  3. The path of the psiblast program executable downloaded in step 2 described above.

  4. The path of the indexed blast database directory created in step 3 described above.

# Usage example

from pssmpro.features import create_pssm_profile

# The comma separated protein sequence file
protein_sequence_file = "./pssmpro_test_data/test_seq.csv"
# Output directory where the pssm profiles will be stored
output_dir = "./pssmpro_test_data/pssm_profiles/"
# the path to the psiblast program executable downloaded as part of the blast program suite 
psiblast_executable_path = "/opt/aci/sw/ncbi-rmblastn/2.9/0_gcc-8.3.1-bxy/bin/psiblast"
# prefix of the indexed blast database files created using makeblastdb
blast_db_prefix = "./pssmpro_test_data/uniref50/uniref50db"
# number of cores to be used while creating the pssm profiles
number_of_cores = 8


create_pssm_profile(protein_sequence_file, output_dir, psiblast_executable_path,
                    blast_db_prefix, number_of_cores)

Generating PSSM features

pssmpro contains 21 features which are capable of numerically encoding protein sequences using their pssm profiles. They are:

  1. aac_pssm
  2. aadp_pssm
  3. aatp
  4. ab_pssm
  5. d_fpssm
  6. dp_pssm
  7. dpc_pssm
  8. edp
  9. eedp
  10. k_separated_bigrams_pssm
  11. medp
  12. pse_pssm
  13. pssm_ac
  14. pssm_cc
  15. pssm_composition
  16. rpm_pssm
  17. rpssm
  18. s_fpssm
  19. smoothed_pssm
  20. tpc_pssm
  21. tri_gram_pssm

For a detailed description of the features, refer to the Supplementary Documents of the paper (link to be added). Other protein sequence encoding

NB: pssmpro is based on POSSUM link. The code has been adapted to work with python version 3 and above.

similar modules to encode protein sequences

Other modules that can be used to generate numerical encoding of protein sequences are:

  1. ngrampro link
  2. ifeatpro link
# Usage example

# To create any one of the 21 features one can use the "get_feature" function

pssm_dir_path = "./pssmpro_test_data/pssm_profiles/"
feature_type = "aac_pssm"
output_dir_path = "./pssmpro_test_data/features/"

get_feature(pssm_dir_path, feature_type, output_dir_path)
# To create all 21 features at once, one can use the "get_all_features" function

get_all_features(pssm_dir_path, output_dit_path)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pssmpro-0.0.2.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pssmpro-0.0.2-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file pssmpro-0.0.2.tar.gz.

File metadata

  • Download URL: pssmpro-0.0.2.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pssmpro-0.0.2.tar.gz
Algorithm Hash digest
SHA256 ac6da41cb5ac83ccce386ad7face01ac8d3c7440cafad1a6dabdc4e27279b07e
MD5 a778175f0d11a0cd0214677e7a66f0ec
BLAKE2b-256 37d0d4227a8a1f632e4402e3984083174e04a57df67347b9e2b30ddc100bced8

See more details on using hashes here.

File details

Details for the file pssmpro-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: pssmpro-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 25.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pssmpro-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 72f186a2fe3e6ad3f6e0b1b022223776215f4e56103642e710df16fb146286be
MD5 38743e84f42b42c0b412355eb06b9bcd
BLAKE2b-256 04d34be45ae12055ca3e2e5b1484a58d0d9ebe304e95ab91c9960597d82a5dc2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page