Skip to main content

Predicting protein-protein interactions and structures from multiple sequence alignments.

Project description

🍐 yunta

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

Predicting a pairwise protein-protein interactions and structures from multiple sequence alignments. Now with interspecies interactions and automatic chunking of large sequences!

yunta provides several implementations of protein-protein interaction evaluation. In increasing computational cost:

  • GPU-accelerated direct coupling analysis (DCA) (in Tensorflow and PyTorch)
  • RoseTTAFold-2track via the rf2t-micro package
  • AlphaFold2 for protein-protein structure prediction

yunta has streamlined installation, a command-line interface, a Python API, and some resilience to GPU out-of-memory error though chunking of long sequences and CPU-fallback. It takes as input unpaired multiple-sequence alignments in A3M format (as generated by tools like hhblits), and outputs a matrix of inter-residue contacts.

Rough timings for a pair of ~200 amino-acid proteins (S. cerevisiae DHFR and WW domain-containing protein) on CPU:

  • DCA: 5 seconds
  • RosettaFold-2track: 10 seconds
  • AlphaFold2: 1 hour

Note that these times will increase quadratically with the total length of the proteins.

Installation

Obtaining and setting up yunta is easy.

$ pip install yunta

If you want to enable GPU, use

$ pip install yunta[cuda12]

If you want to use a local CUDA installation instead, use

$ pip install yunta[cuda12_local]

Using the embedded model requires using the RoseTTAFold-2track and AlphaFold2 weights. These are automatically downloaded, but by using yunta you agree that the trained weights for RoseTTAFold are made available for non-commercial use only under the terms of the Rosetta-DL Software license and AlphaFold2's pretrained parameters fall under the CC BY 4.0 license.

Credit

yunta is a fork of SpeedPPI, which is itself inspired by FoldDock. This method used AlphaFold2 to evaluate 65,484 protein-protein interactions from the human proteome in Towards a structurally resolved human protein interaction network.

The idea of using DCA, RoseTTAFold-2track, and AlphaFold2 in a cascade of increasingly expensive and specific PPI detection methods has been explored in a series of papers from David Baker's lab:

yunta puts these algorithms in one place with easy installation, a command-line interface, and a Python API. It also enables inerspecies co-evolutionary analysis using a built-in interspecies interaction mapping.

Command-line usage

You can always get more help by running

$ yunta --help
usage: yunta [-h] {dca-single,dca-many,rf2t-single,af2-single,af2-many} ...

Screening protein-protein interactions using DCA and AlphaFold2.

options:
  -h, --help            show this help message and exit

Sub-commands:
  {dca-single,dca-many,rf2t-single,af2-single,af2-many}
                        Use these commands to specify the tool you want to use.
    dca-single          Calculate DCA for one protein-protein interaction.
    dca-many            Calculate DCA between two sets of proteins, or all pairs in one set of proteins.
    rf2t-single         Calculate RF-2track contacts for between one protein and a series of others.
    af2-single          Model one protein-protein interaction.
    af2-many            Model all interactions between two sets of proteins, or all pairs in one set of proteins.

Generating multiple-sequence alignments

All the algorithms depend on pre-computed multiple-sequence alignments (MSAs) between a protein of interest and as many other proteins as possible. This allows computations to be sped up by separating out this phase of the calculation. You can generate MSAs using a dedicated tool like hhblits, which will speed up the process by using pre-clustered datasbes like UniClust. We typically use a command like:

hhblits -e 0.01 -v 3 -d /path/to/UniClust-database -i input.fasta -oa3m output-msa.a3m -o /dev/null -cov 60 -n 3 -realign -realign_max 10000

In our experience, this can take 1-40 min depending on the complexity of the query. Check the hhsuite documentation for more details.

Once you have your MSAs, you can use the information contained within them using tools in yunta to calculate contact maps and predict structures of protein complexes with AlphaFold2.

Calculating contact maps

Given two MSAs, yunta will calculate the contact map using DCA, RF2t, or AlphaFold2, and produce a summary table for each pair provided as input.

Using DCA or RF2t will produce a table like this:

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc
ID uniprot_id_1 uniprot_id_2 seq_len chain_a_len chain_b_len msa1_depth msa2_depth msa_depth n_eff apc mean median maximum minimum
O13297-D6VTK4 O13297 D6VTK4 980 549 431 14246 1546 670 2 False 0.01830857 0.014683756 0.07428725 2.284808e-06

If you also give the --plot option, then the contact maps for the entore complex and only the inter-chain contacts will be saved, along with CSV files containing the numerical values as matrix. For example,

$ yunta dca-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/dca-single.tsv --apc --plot test/outputs/DYR_YEAST-CAPZA_YEAST
0 1 2 ... 420 421
-0.0 0.0009014737 0.0010275221 ... 0.0005961701 -1.9190367e-05
...

Predicting protein complex structures

yunta can also feed your MSAs into the AlphaFold2 model to predict structures of binary protein complexes.

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single

This will also generate a table

Using --plot will generate the contact maps as with the other commands.

$ yunta af2-single test/inputs/DYR_YEAST.a3m -2 test/inputs/CAPZA_YEAST.a3m -o test/outputs/af2-single --plot test/outputs/af2-single-plot

Command-line tools

You can run 1-vs-many with the *-single commands. For example:

$ yunta dca-single --help
usage: yunta dca-single [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] [--output [OUTPUT]] [--plot PLOT] [--apc] [msa1]

positional arguments:
  msa1                  MSA file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames. Default: treat as MSA filenames.
  --interspecies, -i    Whether the MSAs are from the same species. Default: Not inter-species.
  --output [OUTPUT], -o [OUTPUT]
                        Output filename. Default: STDOUT.
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.
  --apc, -a             Whether to use APC correction in DCA. Default: don't apply correction.

If one MSA is provided, then homodimeric interactions are probed. For convenience, you can use the --list-file option to provide a single file + containing a list of MSA files (one per line).

You can run many-vs-many with the *-many commands. For example:

$ yunta af2-many --help
usage: yunta af2-many [-h] [--msa2 [MSA2 ...]] [--list-file] [--interspecies] --output OUTPUT [--params PARAMS] [--recycles RECYCLES] [--plot PLOT] [msa1 ...]

positional arguments:
  msa1                  MSA file(s). Default: "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>".

options:
  -h, --help            show this help message and exit
  --msa2 [MSA2 ...], -2 [MSA2 ...]
                        Second MSA file(s). Default: if not provided, all pairwise from msa1.
  --list-file, -l       Treat inputs as plain-text list of MSA files, rather than MSA filenames. Default: treat as MSA filenames.
  --interspecies, -i    Whether the MSAs are from the same species. Default: Not inter-species.
  --output OUTPUT, -o OUTPUT
                        Output directory. Required.
  --params PARAMS, -w PARAMS
                        Path to AlphaFold2 params file (.npz).
  --recycles RECYCLES, -x RECYCLES
                        Maximum number of recyles through the model. Default: "10".
  --plot PLOT, -p PLOT  Directory for saving plots. Default: don't plot.

Python API

We provide an API for using MSAs in your own programs.

>>> from yunta.structs.msa import *
>>> msa = MSA.from_file("my-msa-file.a3m")
>>> msa.neff()
6

In case you prefer, you can also import a PyTorch implementation of DCA (adapted from the Tensorflow of Humpreys, Science, 2021).

>>> from yunta.dca_torch import calculate_dca
>>> from yunta.structs.msa import *
>>> paired_msa = PairedMSA.from_file("my-msa-file1.a3m", "my-msa-file2.a3m")
>>> calculate_dca(paired_msa.sequence_token_ids, apc=True, gpu=False)

(More documentation coming soon!)

... if you want to scale up

While the *-many commands can deal with processing multiple possible protein-protein interactions, if you want to screen more than a few and have access to a HPC cluster then using our nf-ggi Nextflow pipeline will be more efficient, and can generate the MSAs for you.

Issues, problems, suggestions

Add to the issue tracker.

Further help

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yunta-0.0.4.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yunta-0.0.4-py3-none-any.whl (5.7 MB view details)

Uploaded Python 3

File details

Details for the file yunta-0.0.4.tar.gz.

File metadata

  • Download URL: yunta-0.0.4.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for yunta-0.0.4.tar.gz
Algorithm Hash digest
SHA256 a666d37b30b6d19e9428a78cfa3d3a85ca8d34a9e4ffdc04e888dd48bb6112c2
MD5 4de44854228bee4a289aafa5e326553f
BLAKE2b-256 72a6107ce45e703183cbc31792a90d4558cce763084dbb372384335227918d9a

See more details on using hashes here.

File details

Details for the file yunta-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: yunta-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.23

File hashes

Hashes for yunta-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 1802976e0825954fbb76e6443b2255d8c935f3a843bde3e02d93a68246367322
MD5 68779b2f38d260229b041b4b3f1a1769
BLAKE2b-256 94b62af868e9685ba33178c4e35dffccc0748683b1e5288e1ae27e3236ba2104

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page