A small package to deal with HHSearch files.
Project description
hhsearch-python
Author: Tim. D
Current version: 1.12 - Python 3.7
This small package was made to handle data output by the software suite HHSearch. It was tested with output of the HHSearch version 1.5. The project's idea and draft itself originates from Dr. Schmidt and was done as a final task for one of his university modules.
HHsearch is a software suite for detecting remote homologues of proteins and generating high-quality alignments for homology modeling and function prediction.
HH-Suite Github | Quick Guide to HHSearch
Installation
You can simply install this package through your pip version.
pip install hhsearch-python
Requirements
For full functionalities you need the following packages as well.
pandas==0.23.4
matplotlib==3.0.2
numpy==1.15.4
Pillow==6.0.0
pymol==0.1.0
Except for PyMol, everything can easily be installed through pip install. PyMol needs to be installed separately, as well as being installed through pip
to be used in your regular Python environment.
pip
installation:
pip install -c schrodinger pymol
| pip install -c schrodinger pymol
PyMol Version | Documentation |
---|---|
MAC | https://pymolwiki.org/index.php/MAC_Install |
Windows | https://pymolwiki.org/index.php/Windows_Install |
Linux | https://pymolwiki.org/index.php/Linux_Install |
Wrapper - Jupyter Notebook
For this whole module, a wrapper with a UI has been created as Jupyter Notebook.
You just need to open the Jupyter Notebook in this repo with Jupyter, have your .hhm
and .hhs
files in subfolders somewhere in the same folder and install the module with pip install hhsearch-python
. The whole notebook itself is pretty self-explanatory and gives you almost all the options of the functions in this module as a nice UI.
Recommended if you like automation and simplicity through UI usage.
Functionalities
Broad information about Query & Hit
There are a small handful of functions within this package which can be used to generate a decent organized (visualized) output. However, for this all to work properly, you need to have all the needed .hhm
as well as all .hhs
files somewhere located in your current working directory.
# lets first import all our functions from the module.
from hhsearch_python import *
hhs_file = "data/hhs/d1e0ta1.hhs" # path to your .hhs file.
# first, we can use extract_HHSearch_data() to extract the whole HHSearch statistics into a pandas.DataFrame.
hhs_hits_statistics = extract_HHSearch_data(hhs_file)
However, we also want regular information about the Query itself, as well as about selected hits. For that we can use the two separate function extract_HHSearch_main
for the query .hhs
file, and get_alignment_term
for a selected hit of the previous created pandas.DataFrame
.
query_dict = extract_HHSearch_main(hhs_file)
# As an example how this dict() output looks like:
print(query_dict)
>> {'Query': 'Query d1e0ta1 b.58.1.1 (A:70-167) Pyruvate kinase (PK) {Escherichia coli [TaxId: 562]}',
'pdb_id': '1e0t',
'alignment_term': '/1e0t//A/70-167/CA',
'full_term': '/1e0t//A//CA',
'file_name': 'd1e0ta1'
}
# alignment_term is needed for a proper PyMol alignment later down the road,
# as well as full_term, ignoring the specific residues.
# Let's get information about the second hit of the statistics from the .hhs file.
hit_dict = get_alignment_term(hhs_hits_statistics, 2)
print(hit_dict)
>> {'pdb_id': '2vgb',
'alignment_term': '/2vgb//A/160-261/CA',
'full_term': '/2vgb//A//CA',
'file_name': 'd2vgba1'}
# except for the key "Query", get_alignment_term() outputs a structure identical dict() as extract_HHSearch_main()
Colorized Alignments - HTML formatted
Having selected the second alignment as our target-of-choice, we now desire more information about the alignment itself, so we extract the actual alignment with get_full_alignment
. It takes two arguments: the .hhs
file of the query, as well as the number of the hit within the .hhs
file, just like get_alignment_term
. So preferably, one looks at the previously created pandas.DataFrame hhs_hits_statistics
and choose a hit of interest from that.
# This also creates a html formatted file in a separate folder - /alignments_highlighted/<query>/<NoX-name>.html
# and also the same file as alignment.html in a folder called /lastrun/, all for your convenience.
alignment_of_interest = get_full_alignment(hhs_file, 2)
The HTML formatted output looks like the example below. As you can see, helices and sheets are colorized.
Also, if you desire this formatting to be applied on the whole .hhs
file, then you can use the function highlight_hhs_full(hhs_file)
and use the path of the desired .hhs
file as an argument. It returns the given hhs file as a colorized HTML formatted string and also stores within a separate folder /alignments_highlighted/<query-name>_full.html
as well as in the /lastrun folder under the filename hhs_full_colorized.html
.
# outputs the whole .hhs file colorized in the above-shown pattern.
full_hhs_colorized = highlight_hhs_full(hhs_file)
PyMol Alignments - Visualization | Animation
Having alignments organized and colorized is all useful, but we also want to actually create a more visual representation of the chosen alignment. For that, we can use the previous created dictionaries query_dict
and hit_dict
and give their information as arguments to the function pymol_alignment()
. This function also returns the rmsd value of atomic positions in ångström.
# building up the information from the query.
pdb_1 = query_dict.get("pdb_id")
aln_term_1 = query_dict.get("pdb_id")
full_term_1 = query_dict.get("pdb_id")
# buildung up the needed information from the chosen hit.
pdb_2 = hit_dict.get("pdb_id")
aln_term_2 = hit_dict.get("pdb_id")
full_term_2 = hit_dict.get("pdb_id")
# Also returns the RMSD values for the alignment.
rmsd = pymol_alignment(pdb_1,
pdb_2,
aln_term_1,
aln_term_2,
full_term_1,
full_term_2)
print(rmsd)
>> (0.8026888370513916, 85, 5, 1.2078778743743896, 98, 160.0, 98)
# In this example RMSD Value is about 0.803 Å over 85 C-αlpha atoms.
This will create two images in a different folder, as well as a no_zoom.pse
file, which can be opened with PyMol, alongside with the necessary .cif
files of the PDB entries into a separate folder called /cif/
.
About the pictures: One being zoomed-in into the area of aln_term_1
whih is in our example: /1e0t//A/70-167/CA
, showing the area of interest, as well as a non-zoomed--in picture of /1e0t//A//CA
in our example.
These images are stored into the /lastrun/
folder, as well as in the folder /PyMol_img/<pdb_1>/<pdb_1>-<pdb_2>/
.
Zoom | No-Zoom |
---|---|
However, pymol_alignment
also has an option to output an animated picture instead of just static pictures, as well as the option of a frame multiplier, which needs to be an integer up to 4. But this option takes much more time to process, but of course, gives a nicer output. Each frame multiplier basically doubles the time necessary to create the 360° view of the model. The frames are stored into a subdir /animation
in the lastrun/
folder, alongside with the animated gif, as well as in the separate folder PyMol_img/<pdb_1>/<pdb_1>-<pdb_2>/animation/<framemultiplier>
, while the animated gif is stored in the folder upper /animation
.
# as an example we will create an animated gif with the frame multiplier of 4
pymol_alignment(pdb_1,
pdb_2,
aln_term_1,
aln_term_2,
full_term_1,
full_term_2,
animation = True,
framemultiplier= 4)
Be aware, which each run, the lastrun folder's animation subfolder will always be cleared, so there's no confusion in case one runs one time with the animation feature, and in the next run without it.
# Example for animation = True, framemultiplier = 4 of our example
Barplots of chosen spans
At last, we want to create a barplot of the frequencies of the amino acids within our query based on the HHMs, as well as in our chosen hit. For that, we first need to extract the frequencies of the .hhm
file. This gives us a pandas.DataFrame with all the frequencies normed to one, calculated on information of the HHSuit Wiki.```
Frequency calculation:
entry = -1000 * log_2(frequency)
frequency = 2^(-entry/1000)
# First we need to set the path of the two hhm files. Luckily, we stored the file_names before.
query_filename = f'data/hhm/{query_dict.get("file_name")}.hhm'
hit_filename = f'data/hhm/{hit_dict.get("file_name")}.hhm'
query_frequencies = read_in_frequencies(query_filename)
hit_frequencies = read_in_frequencies(hit_filename)
The output DataFrame of the frequencies looks eventually like this:
Pos | AS | A | C | D | E | F | G | (...) |
---|---|---|---|---|---|---|---|---|
1 | M1 | 0.030019 | 0.000000 | 0.004325 | 0.014670 | 0.037111 | 0.012379 | (...) |
(...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) | (...) |
However, having the frequencies is one thing, we also want to visualize them. For that, one can use the plot_frequencies
function. This function takes in seven arguments in total, while only one is a requirement. You need to pass down the created pandas.DataFrame of the frequencies. If desired, the name of the created subfolder barplots/<name>
can be changed. I personally recommend to use the filenames out of the query_dict
and the hit_dict
with query_dict.get("file_name")
and hit_dict.get("file_name")
. The threshold describes the minimal frequency which has to be hit, so it ends up in the plot. Recommended would be something around 0.1, which equals 10%. Next, we need to set the span_start and span_end for our plot. As an example, we will pick the 1st residue as start and 50th as the end of the span. The filename describes the name the file will be stored under in the /lastrun
folder. Also, if one likes, one can add a title to the plot, however, I personally dislike this option, since it disturbs the cleaner look. Depending on the span you are choosing, this process can also take some decent time.
plot_frequencies(query_frequencies, # the pd.DataFrames of the frequencies
name = query_dict.get("file_name"), # desired output name.
threshold = 0.1, # 10% threshold
span_start = 1, # span starting @ 1
span_end = 50, # span ending @ 50
filename = "query_barplot.png",
title = False
)
plot_frequencies(hit_frequencies, # the pd.DataFrames of the frequencies
name = hit_dict.get("file_name"), # desired output name.
threshold = 0.1, # 10% threshold
span_start = 1, # span starting @ 1
span_end = 50, # span ending @ 50
filename = "hit_barplot.png",
title = False
)
Query : d1e0ta1, 1-50, min. 10% | Hit : d2vgba1, 1-50, min. 10% |
---|---|
Contact Information:
Telegram | |
---|---|
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for hhsearch_python-1.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d7f9fe879718c8e2d6040be13312de66db7db83570333213e3f9ba044005beb |
|
MD5 | 9e4c9ea8579a95d5a95b2ba1bfa1b83d |
|
BLAKE2b-256 | 92df7f3d88521c9b7a1a3c8e30fb9ebaca4ee8aa31d4e1434a4168cc0abb8dbc |