Direct couplings analysis (DCA) for protein and RNA sequences
Project description
About pydca
pydca is a Python implementation of direct coupling analysis (DCA) of residue coevolution for protein and RNA sequence families using the mean-field approximation algorithm. Given multiple sequence alignment (MSA) files in FASTA format, pydca computes the coevolutionary scores of pairs of sites in the alignment. In addition, when an optional file containing a reference sequence is supplied, scores corresponding to pairs of sites of this reference sequence are computed by mapping the reference sequence to the MSA. Furthermore, pydca provides commands to compute the parameters of the energy function, compare and visualize contact map or true positive rates of DCA predicted residue pairs.
Prerequisites
- Python3, version 3.5 or latter.
- python3-tk
Installing
Here we show installation of pydca in a Linux (Ubuntu) machine. A similar procedure is used for other Linux/Unix machines as well as Mac or Windows platforms.
Installing python3-tk
$ sudo apt install python3-tk
Create a virtual environment for Python3
pydca depends on python based libraries, e.g., numpy, scipy, matplotlib, or biopython. To avoid conflicts that can arise between different versions of libraries, its recommended that we install pydca separate from other global installations. Here, we use a separate virtual environment using pipenv. To install pipenv use use pip. Thus, first we need to install pip for Python3.
$ sudo apt install python3-pip
Now we can install pipenv as follows
$ pip3 install pipenv
Once pipenv is installed we can install pydca using pipenv. To demonstrate this, lets create a
directory called demo, you can call it another name. We will create this directory in current user's home directory. Then, cd to this directory and install pydca using pipenv.
$ mkdir demo && cd demo
We can see our current directory's path using pwd command.
$ pwd
$ /home/username/demo
Now we install pydca
$ pipenv install pydca
At this time, pipenv creates a virtual environment for us and installs pydca in this newly created virtual environment. After installation is completed pipenv has created two files in
the directory /home/username/demo. We can see them as
$ ls
Pipfile Pipfile.lock
These files contain metadata about the virtual environment. Now we can activate the virtual environment
$ pipenv shell
Launching subshell in virtual environment…
$ . /home/username/.local/share/virtualenvs/demo-HO4EpenA/bin/activate
(demo) $
The virtual environment is now activated. Lets check that we are executing Python from the virtual environment by starting Python shell, importing sys module and executing sys.executable.
(demo) $ python
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.executable
'/home/username/.local/share/virtualenvs/demo-HO4EpenA/bin/python'
>>>
Lets exit the Python shell. We do this using the exit() command.
>>> exit()
(demo) $
Now we are back in the virtual environment's shell.
Running DCA Computation
When pydca is installed, it provides a main command called mfdca. Executing this command displays help messages.
(demo) $ mfdca
usage: mfdca [-h]
{compute_di,compute_fn,compute_couplings,compute_fields,compute_fi ...
....
In the above, we have truncated the help message.
To compute DCA scores summarized by direct information score, we can execute
(demo) $ mfdca compute_di <biomolecule> <msa_file> --apc --verbose
<biomolecule> takes either protein or rna (case insensitive). The <msa_file> is
a FASTA formated multiple sequence alignment file. The optional argument --apc allows to make average product correction (APC) of DCA scores and --verbose triggers
logging messages to be displayed on the screen as DCA computation progresses. Thus, an exaple of DCA computation for an MSA file alignment.fa containing protein sequences would be:
(demo) $ mfdca compute_di protein alignment.fa --apc --verbose
There are a few subcommand in the pydca, e.g., to compute parameters of the global probability function, frequencies of alignment data, DCA scores summarized by Frobenius norm. One can lookup the help messages corresponding to a particular sub(command). Example:
(demo) $ mfdca compute_di --help
usage: mfdca compute_di [-h] [--verbose] [--apc] [--seqid SEQID]
[--pseudocount PSEUDOCOUNT]
[--refseq_file REFSEQ_FILE] [--output_dir OUTPUT_DIR]
[--force_seq_type]
biomolecule msa_file
...................................................
Finally, the current virtual environment can be deactivated using exit command.
Commonly Used Commands
Information about the commands and subcommands can be obtained using the --help optional argument from the command line. Below is a summary of commonly used commands.
Computing DCA Scores
In pydca DCA scores can be computed from the direct information score or from the Frobenius norm of the couplings using the compute_di or compute_fn subcommands. Example:
$ mfdca compute_di <biomolecule> <msa_file> --verbose
or
$ mfdca compute_fn <biomolecule> <msa_file> --verbose
To compute the average product corrected DCA score, we can use the --apc optional argument. Example:
$ mfdca compute_di <biomolecule> <msa_file> --apc --verbose
We can also set the values of the relative pseudocount and sequence identity. The default values are 0.5 and 0.8, respectively.
$ mfdca compute_di <biomolecule> <msa_file> --pseudocount 0.3 --seqid 0.9 --verbose
Furthermore, we can supply a FASTA formatted file containing a reference sequence so that DCA scores corresponding to residue pairs of that particular sequence are computed. Example:
$ mfdca compute_di <biomolecule> <msa_file> --refseq_file <refseq_file> --verbose
Plotting Contact Maps or True Positive Rates
Residue pairs ranked by DCA score can be visualized and compared with an existing PDB contact map. Example:
$ mfdca plot_contact_map <biomolecule> <pdb_chain_id> <pdb_file> <refseq_file> <dca_file> --linear_dist 5 --contact_dist 9.5 --num_dca_contacts 100 --verbose
In the above the optional argument --linear_dist filters out residue pairs that are not at least 5 residues apart in the sequence, --contact_dist sets the maximum distance (in Angstrom) between two residue pairs to be considered contacts in the PDB structure, and --num_dca_contacts sets the number of top DCA ranked residue pairs to be taken for contact map comparison. Two residues in a PDB structure are considered to be contacts if they have at least a pair of heavy atoms that are less than the distance set by --contact_dist parameter.
A similar command can be used to plot the true positive rates of DCA ranked residues pairs, except that there is no restriction on the number of DCA ranked pairs. Example:
$ mfdca plot_tp_rate <biomolecule> <pdb_chain_id> <pdb_file> <ref_seq_file> --linear_dist 5 --contact_dist 8.5 --verbose
Computing Couplings and Fields
The couplings and fields can be computed in a single step or separately. To compute these parameters in one run we can use the compute_params subcommand. Example:
$ mfdca compute_params <biomolecule> <msa_file> --pseudocount 0.4 --seqid 0.9 --verbose
or to compute the couplings alone
$ mfdca compute_couplings <biomolecule> <msa_file> --pseudocount 0.4 --seqid 0.9 --verbose
and the fields
$ mfdca compute_fields <biomolecule> <msa_file> --pseudocount 0.3 --seqid 0.75 --verbose
Computing Single- and Pair-site Frequencies
By default pydca uses a relative pseudocount of 0.5 thus the frequencies are regularized. To compute non-regularized frequencies we need to set the pseudocount to zero. Example:
$ mfdca compute_fi <biomolecule> <msa_file> --pseudocount 0 --verbose
The compute_fi subcommand computes single-site frequencies. To compute pair-site frequencies, we use the compute_fij subcommand.
$ mfdca compute_fij <biomolecule> <msa_file> --pseudocount 0 --verbrose
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pydca-0.2.tar.gz.
File metadata
- Download URL: pydca-0.2.tar.gz
- Upload date:
- Size: 53.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46de674ba8f55e57ad787204131cf174226111ac0f1eb99958f40ec75287817a
|
|
| MD5 |
6dc116876f1a3886dabe382be27d6254
|
|
| BLAKE2b-256 |
23f184f57c282955508dfb1689d5301224f4b0f61f1d5d2334138c4eacda4851
|
File details
Details for the file pydca-0.2-py3-none-any.whl.
File metadata
- Download URL: pydca-0.2-py3-none-any.whl
- Upload date:
- Size: 60.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24a0d423661162f34f4848495c268055f1fbbb724cd13a22a23ab8ee1d2641ba
|
|
| MD5 |
065d2c902c9b4a1a5240d076af6d359b
|
|
| BLAKE2b-256 |
ba52078fab5f52c7cf1923774cdc1f3f71e5b3d5d532174d8c2024e36e193254
|