Generate macrocomplexes superimposing paired interacting elements.
Project description
Complex Constructor
Generate macrocomplexes superimposing paired interacting elements
protein - protein or protein - DNA
the result is stored in a PDB file
Table of Content
Why building complexes?
The total number of proteins in humans is around 20K. From this quantity we already know the structure of 4K proteins. However, for 6K we have good templates and for another 6K of them, we have reasonably good templates; that means we have more or less 50-70% of the structure of the human proteins covered.
On the other hand, the number of interactions that may occur between all the proteins is still unknown. There are some studies estimating 130-650K interactions, while others estimate more than millions. In this context, 120K interactions are confirmed from experiments and only 7K structures of complexes are known, which is a very small proportion of the total number of possible interactions anyway.
So, the complex-structure coverage, including homologs, is around the 30%, much less than the mentioned coverage of monomers. That means there is still a long way to go into the complex-structure study field.
The 3D structure of interacting proteins is necessary for them to develop their function, it is also fundamental for molecular recognition and contributes to the complexity of protein interaction networks too. However, identifying the structure of interacting proteins, complexes, is not an easy task. In monomers, homology modelling can be used to nearly cover all folds to determine a protein structure but, for interactions it is not enough.
Complex Constructor tries to generate macrocomplex structures. To do so, the superimposing technique is used: it receives a list of pdb files, each of these files contains the structure of an interacting pair and, by superimposing the common elements of different pairs, it builds the final structure. Although there are other methods to generate macrocomplexes, the superimposition is fast and effective. Furthermore, not only protein-protein interacting pairs can be analyzed, but also proteins with DNA, to end up generating a macrocomplex structure of proteins and DNA chains.
The core of Complex Constructor is the constructor
function, which is the responsible of the building process. The pdbs of the interacting pairs are analyzed and classified using the information of the FASTA file. The common chains in different pdb files are identified: they will have more than 99% of identity in their sequences. After this classification, constructor
begins to append elements to the macrocomplex: it starts with a pair, then it looks for another pair with a common element with the starting one and, if after superimposing the common element, the structure has not clashed, it appends the new element. Like this, the macrocomplex have now three elements. The next step will repeat the process looking for another pair, superimposing the common element and checking the possible clashes. The program is described here.
In addition, as a lot of complexes follow a determined stoichiometry, Complex Constructor can also make the construction following it. If a stoichiometry is defined, the final structure will contain the exact number of elements indicated by it.
Finally, a new pdb file with the resulting structure of the macrocomplex is stored in the output folder called outputFolder_model.pdb
.
Installation
Prerequisites:
- Python 3.0
https://www.python.org/download/releases/3.0/
Clone
- Clone this repo to your local machine using
https://github.com/Argonvi/SBI-PYT_Project.git
Setup
- Install and upgrade BioPython
https://biopython.org/wiki/Download
Using Python package management tool
pip
is easy
$ pip install biopython
$ pip install biopython --upgrade
Install from pip
- ComplexConstructor can also be installed with
pip
$ pip install complexconstructor
Options
ComplexConstructor can be run using command-line arguments or using the graphical interface.
Command-line
You can introduce the different arguments via command-line:
Mandatory arguments
To execute Complex Constructor three arguments are required:
-
-fa
--fasta
: FASTA file with the sequences of the proteins or DNA that will conform the complex. -
-pdb
--pdbDir
: directory containing the pdb files with the structure of the pairs that will conform the complex. -
-o
--output
: directory name where the complex results will be stored.
Note that, if the output directory already exists, the results will be overwritten.
Optional arguments
-
-v
--verbose
: show the detailed progression of the building process in a file called 'ComplexConstructor.log'. It will be stored in the output directory. -
-st
--stoichiometry
: File containing a determined stoichiometry of the complex. The information of the stoichiometry must be: the ID of the sequence chain (concordant with the FASTA file ID) followed by the number of times it has to be present in the complex after ' : '.ID_FASTA_file
:stoichiometry
One per line in format .txt. Take a look at some examples here.
Graphical interface
Otherwise, the macrocomplex can also be built using the graphical interface:
$ python3 complexconstructor -gui
In this case just the
-gui
tag is needed!
- To build the macrocomplex fill in the main window requirements. As for running the program via command-line, a FASTA file with the sequences and a directory with the pdb files of the interacting pairs are required. In addition, a name for the folder where the results will be stored is needed, after typing it you should confirm it.
Note that, to select the PDB directory you have to enter in the desired directory and then select it. You can see how it works in the examples.
Furthermore, additional options can be set:
-
In the main window you can specify if you want to create a log file where the process of the execution will be displayed. It will be stored in the output directory.
-
In the top menu, in the 'Options' dropdown, there is the 'Add Stoichiometry' option. You can upload a file with a determined stoichiometry to be applied to the macrocomplex. As said before, the format of this file has to be the ID of the sequence, concordant with the one in the FASTA file, followed by ':' and a number. This number will be the number of times the corresponging sequence will be in the final complex.
Finally, in the top menu you can consult the Complex Constructor 'Help' as well.
Examples
As said before, to generate any macrocomplex structure it is required the FASTA file and a direcory with paired structures in pdb format. All the data needed to execute the following examples is already located in the folder examples
. Tu run them, we need to be in the folder of the package.
$ cd ComplexConstructor
You can make sure you are in the correct folder with ls
command.
$ ls
If you obtain the following directories: assets
, complexconstructor
, examples
, together with other items, you are in the correct place!
Inside the folder
examples
there are all the examples we will describe in this tutorial in the corresponding directories:1gzx
,5fj8
, etc. Each of them contains the required files to run Complex Constructor together with a pdb file of the reference structure,exampleRef.pdb
, in order to check the superimposition between the constructed model and the real structure.
1GZX
Let's begin with the first example, the protein 1GZX. It is a small complex composed by two different aminoacid chains with stoichiometry two, so the final structure has four chains. To perform the construction of T state haemoglobin, stoichiometry 2A2B, we use the .txt file where the stoichiometry is explicited, in this case, 1gzx_st.txt
, already in the examples/1gzx
folder:
1GZXa:2
1GZXb:2
- Command line execution:
$ python3 complexconstructor -fa examples/1gzx/1gzx.fa -pdb examples/1gzx/1gzxDir -o 1GZX -st examples/1gzx/1gzx_st.txt -v
-fa
, mandatory: followed by the location of the FASTA file1gzx.fa
.
This file contain two IDs followed by the corresponding sequence, e.g.
1GZXa
,1GZXb
. Note that, the IDs in1gzx_st.txt
have to be concordant with them.
-pdb
, mandatory: followed by the location of directory with paired structures in pdb1gzxDir
.
In this case inside this folder we should have at least three pdb files, e.g.
1gzx_AB.pdb
,1gzx_AC.pdb
,1gzx_AD.pdb
. If there are redundant pairs they won't be considered.
-o
, mandatory: followed by the name given to the output directory where the results will be stored,1GZX
.
If a directory with the same name already exists in the working folder it will be overwritten.
-
-st
: followed by the stoychiometry information of the complex 2A2B that is in the file1gzx_st.txt
. -
-v
: turn ON the the verbose option. It is always recommended to create a logfile where the process information will be displayed. To deactivate the creation of the logfile, do not add the-v
flag.
- Graphical interface execution:
The resulting structure is stored in the directory 1GZX
, file 1GZX_model.pdb
.
Complex Constructor | Reference structure | Superimposition |
---|---|---|
We can observe that the resulting structure from Complex Constructor fits the reference downloaded from PDB quite well. The RMSD of the second chains of both model and reference, computed with ICM after supeimposing the first chains, is zero.
3KUY
3KUY is a complex composed by a DNA coil and a core made of protein chains. There are four aminoacid chains and one of nuclotides, all with stoichiometry two, which make a total of 10 chains. The procedure to run this example is the same as before. The data to construct the complex is inside the folder examples/3kuy
. Execution with command-line arguments:
$ python3 complexconstructor -fa examples/3kuy/3kuy.fa -pdb examples/3kuy/3kuyDir -o 3KUY -st examples/3kuy/3kuy_st.txt -v
The resulting structure in directory 3KUY
, file 3KUY_model.pdb
.
To execute this example, and the same for all the rest, with the graphical interface repeat the same process as for the previous example with the inputs of the folder
examples/3kuy
.
Complex Constructor | Reference structure | Superimposition |
---|---|---|
The whole complex is correctly constructed.
4R3O
A bigger case is the complex 4R3O, but just made of aminoacid chains. It is also symmetric and composed by 14 chains, stoichimetry two for all of them, a total of 28 chains in the complex. Its input data in examples/4r3o
.
$ python3 complexconstructor -fa examples/4r3o/4r3o.fa -pdb examples/4r3o/4r3oDir -o 4R3O -st examples/4r3o/4r3o_st.txt -v
The resulting structure in directory 4R3O
, file 4R3O_model.pdb
.
Complex Constructor | Reference structure | Superimposition |
---|---|---|
In this case, as well, the constructed model fits the reference structure.
5FJ8
The next complex is composed by aminoacid and nucleotide sequences but, in this case, it is not symmetric. It is composed by 20 chains, all just present once in the structure. The required inputs to the construction are in examples/5fj8
.
$ python3 complexconstructor -fa examples/5fj8/5fj8.fa -pdb examples/5fj8/5fj8Dir -o 5FJ8 -st examples/5fj8/5fj8_st.txt -v
The resulting structure in directory 5FJ8
, file 5FJ8_model.pdb
.
Complex Constructor | Reference structure | Superimposition |
---|---|---|
The model and the reference are superimposed with very little differences between both structures. In this particular case, the aminoacid chain Q, had several aminoacids labelled as 'unknown' in the pdb files and as 'X' in the FASTA sequence. To deal with this, we had to take out this aminoacids from input files, and so, the Q chain is partly constructed in the model. Nevertheless, the rest of the structure is correctly reproduced.
6GMH
Another non-symetric example, but also with DNA sequences, is 6GMH. It has 17 aminoacid chains and 3 DNA chains.
$ python3 complexconstructor -fa examples/6gmh/6gmh.fa -pdb examples/6gmh/6gmhDir -o 6GMH -st examples/6gmh/6gmh_st.txt -v
Resulting structure in directory 6GMH
, file 6GMH_model.pdb
.
Complex Constructor | Reference structure | Superimposition |
---|---|---|
In this case, the pdb strucure had bigger regions labelled as 'unknown'. As those regions are not recognized by Complex Constructor, we removed them from the input pdb pairs and from the FASTA sequences, as in the previous example. That is why the 'Q' chain is not completelly constructed: from the original number of 884 aminoacids, after the deletion of the 'unknowns', 584 remained. But, here as well, the correctly characterized aminoacids, and all the DNA sequences, are constructed in the right way as we can see in the image of the superimposition.
Enterovirus capsid
To make the prove with a complex with unknown structure we have run Complex Constructor with an enterovirus capsid, composed by three different aminoacid chains. To execute this example we provide an stoichiometry file with stoichiometry equal to 18 for all three chains. However, with a total of 54 chains the complex is not totally constructed.
The following image shows the construction of the complex with stoichiometry 32, a total of 96 chains. To perform the construction with so many chains, we had to modify the script: the program was able to generate a total of 54 chains, with letters A-Z and a-z, so the maximum stoichiometry for three chains was 18. In addition to this limitation, ICM and Chimera, allow a maximum of 99999 atoms to be represented, and the complex exceeded this number.
To solve both issues, we made modifications for this particular case to create two different pdb files, like this, we divided the pdb structure in two files ENTV_model_part1.pdb
and ENTV_model_part2.pdb
, to be able to repeat the chain letters in the second file and open both files with ICM or Chimera. These two pdb files are included in the directory examples/entv
.
Nevertheless, as the modifications were just perfomed to build this particular case, the stoichiometry file we have included to run this example, is the one that will work with the 'default' script, this is, stoichiometry equal to 18 for the three chains. Like this, the resulting pdb file will be partly constructed. To run the enterovirus capside 'reduced' example:
$ python3 complexconstructor -fa examples/entv/entv.fa -pdb examples/entv/entvDir -o ENTV -st examples/entv/entv_st.txt -v
The resulting structure is in directory ENTV
, file ENTV_model.pdb
, with the 'reduced' structure.
The 'extended' structure, conatined in ENTV_model_part1.pdb
and ENTV_model_part2.pdb
, is as follows:
Complex Constructor |
---|
We can see that, the structure is not jet completed, but with a third pbd this would be solved. The structure of the capside is correctly visualized.
Performance
- Before adding a new chain to the macrocomplex the number of clashes between the new chain and the previous structure is checked. The function
sequence_clashing
finds how many CA atoms from the new chain are closer than 2 angstroms to any other CA atom of the previous macrocomplex, this is, the number of clashes. If the number of clashes is above 20, the new chain won't be added to the macrocomplex.
Structure of the package
ComplexConstructor
-
main.py
: main module of Comple Constructor. It connects with all the rest of modules. -
utilities.py
: module with all the functions needed to run the construction of macrocomeplexes, asconstructor
orsuperimpositor
. -
argparser.py
: reads and organizes the command-line arguments. -
interface.py
: contains the configuration of the graphical interface and passes the inputs to the main module. -
logProgress.py
: this module creates the log file and its contents when the verbose option is selected. -
helpText.py
: it is a module with the help text to be added to the graphical interface 'Help' option.
Limitations
-
The main limitation of the program is the preparation of the input files. The FASTA file must have each ID just once, the repited sequences have to be erased. Furthermore, if the pairs of the pdb files are not correctly separated the program won't be able to identify the IDs of the FASTA file in the pdb files. Finally, for each ID, the sequences of the FASTA file must be highly similar to the one of the pdb, otherwise they won't be identified as the same.
-
The aminoacids labelled as 'unknown' in the pdb files and as 'X' in the fasta sequences, cannot be correctly identified so they should not be present in the input files. As a consequence, those chains are just partly constructed in the final model.
-
Although the similarity of chains from different pdb files is double checked, only the ones with sequence identity over 99%, the small differences between them may be propagated when building big structures. It could be partly solved if the similarity conditions were harder or with small structure corrections everytime a chain is added.
-
The refinement of the structures obtained would have been also very usefull.
-
We would have also liked to check the Z-scores of the resulting structures through the energy analisys to check the validity of the results.
Team
Paula Gomis Rosa | Arturo González Vilanova | Marta López Balastegui |
---|---|---|
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.