bio2Byte software suite to predict protein biophysical properties from their amino-acid sequences
Project description
Bio2Byte Tools
This package provides you structural predictions for protein sequences made by Bio2Byte group.
🧪 List of available predictors
Predictor | Usage |
---|---|
Dynamine | Fast predictor of protein backbone dynamics using only sequence information as input. The version here also predicts side-chain dynamics and secondary structure predictors using the same principle. |
Disomine | Predicts protein disorder with recurrent neural networks not directly from the amino acid sequence, but instead from more generic predictions of key biophysical properties, here protein dynamics, secondary structure and early folding. |
EfoldMine | Predicts from the primary amino acid sequence of a protein, which amino acids are likely involved in early folding events. |
AgMata | Single-sequence based predictor of protein regions that are likely to cause beta-aggregation. |
🔗 Related link: These listed tools and others are described on the Bio2Byte website inside the Tools section.
⚡️Quick start
First of all, download and install the package:
$ pip install b2bTools
Single Sequence predictions
Use this example as an entry point:
import matplotlib.pyplot as plt
from b2bTools import SingleSeq
single_seq = SingleSeq("/path/to/example.fasta")
single_seq.predict(tools=['dynamine', 'agmata'])
predictions = single_seq.get_all_predictions('SEQ001')
backbone_pred = predictions['SEQ001']['backbone']
sidechain_pred = predictions['SEQ001']['sidechain']
agmata_pred = predictions['SEQ001']['agmata']
plt.plot(range(len(backbone_pred)), backbone_pred, label = "Backbone")
plt.plot(range(len(backbone_pred)), sidechain_pred, label = "Sidechain")
plt.plot(range(len(backbone_pred)), agmata_pred, label = "Agmata")
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
There is a live demo available on Google Colab: link
Multiple Sequences Alignment predictions
Use the following example as an entry point. Keep in mind the available tools to run are 'agmata', 'eFoldMine', 'disoMine' (case sensitive) on top of the default one which is Dynamine.
import matplotlib.pyplot as plt
from b2bTools import MultipleSeq
msaSeq = MultipleSeq()
msaSeq.from_aligned_file("/path/to/example.fasta")
predictions = msaSeq.get_all_predictions_msa("SEQ001")
backbone_pred = predictions['backbone']
sidechain_pred = predictions['sidechain']
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
In case you need to run another tool, replace the 4th line with:
msaSeq.from_aligned_file("/path/to/example.fasta", tools=['agmata', 'eFoldMine', 'disoMine'])
There is a live demo available on Google Colab: link
⚙️ First time setup
The following steps are required in order to install the b2bTools package in your local environment:
📦 Pip package installation
From the official documentation:
pip is the package installer for Python. You can use pip to install packages from the Python Package Index and other indexes.
🔗 Related link: Pip official documentation.
$ pip install b2bTools
💡 Relevant idea: Using the package from Jupyter Notebooks is a good idea to test the package. If you are using Google Colab, install the package directly from pip
inside a code block:
!pip install b2bTools
Important notes for MSA analysis
The PyPI repository does not contains a package for t_coffee
which is a dependency to run predictions on MSA when using Blast, UniRed ID, among others. Despite this situation, there is a workaround installing this dependency from conda:
conda install -c bioconda t-coffee
📦 Conda package installation
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
🔗 Related link: Conda official documentation.
To install this package with conda, run:
$ conda install -c Bio2Byte b2bTools
⚠️ Important note: some Linux users might experience dependency conflicts during the conda installation. Please use the pip installation (described below) if you encounter them.
If you must use conda, use the following command:
$ conda install --override-channels --channel defaults --channel conda-forge --channel Bio2Byte --channel pytorch b2btools
🐳 Docker-way to quick start
Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. With Docker, you can manage your infrastructure in the same ways you manage your applications. By taking advantage of Docker’s methodologies for shipping, testing, and deploying code quickly, you can significantly reduce the delay between writing code and running it in production.
🔗 Related link: Docker official documentation.
Preconditions
For the moment, Windows users can only use this Docker image using the Windows Linux sub-system feature.
Steps
In order to import/export files from your host to the container and viceversa create a volume using the -v $(pwd)/swap:/data
parameter.
⚠️ Important note: Be sure your input files are inside $(pwd)/swap
.
$ docker pull diazadriang/b2b-tools-public
$ docker run -it -v $(pwd)/swap:/data diazadriang/b2b-tools-public -disomine -file /data/input_example.fasta -output /data/result.json -identifier test
⚠️ Important note:
- The output file titled
result.json
will be stored inshde$(pwd)/swap
. - The available parameters after
diazadriang/b2b-tools-public
are:
Parameter | Purpose | Example |
---|---|---|
-file |
Path to the input file | -input /path/to/input/file.fasta |
-output |
Path to the output file (a JSON file with the results) | -output /path/to/output/results.json |
-dynamine |
Run Dynamine predictor | -dynamine |
-disomine |
Run Disomine predictor | -disomine |
-efoldmine |
Run EfoldMine predictor | -efoldmine |
-agmata |
Run AgMata predictor | -agmata |
🐍 Package content
🔧 General Tools
Besides the prediction tools, this package includes general bioinformatics tools useful to manipulate files.
Single Sequences files
The class FastaIO
provides the following static methods:
- read_fasta_from_file
- read_fasta_from_string
- write_fasta
Usage:
from b2bTools.general.parsers.fasta import FastaIO
Multiple Sequences Alignments files
The class AlignmentsIO
provides the following static methods:
- read_alignments
- read_alignments_fasta
- read_alignments_A3M
- read_alignments_blast
- read_alignments_balibase
- read_alignments_clustal
- read_alignments_psi
- read_alignments_phylip
- read_alignments_stockholm
- write_fasta_from_alignment
- write_fasta_from_seq_alignment_dict
- json_preds_to_csv_singleseq
- json_preds_to_csv_msa
Usage:
from b2bTools.general.parsers.alignments import AlignmentsIO
NEF files
The class NefIO
provides the following static methods:
- read_nef_file
- read_nef_file_sequence_shifts
Usage:
from b2bTools.general.parsers.alignments import AlignmentsIO
NMR-STAR files
The class NMRStarIO
provides the following static methods:
- read_nmr_star_project
- read_nmr_star_sequence_shifts
Usage:
from b2bTools.general.parsers.nmr_star import NMRStarIO
🔍 About predictors
Given a predictor could be built on top of other, it is usual to get more output predictions than the expected:
Predictor | Depends on |
---|---|
Dynamine | None |
EfoldMine | [Dynamine] |
Disomine | [EfoldMine, Dynamine] |
AgMata | [EfoldMine, Dynamine] |
🔬 Single Sequence
🧭 Basic flow
This section will explain you in details the script mentioned inside the Quick start section.
- Import the
SingleSeq
class from theb2bTools
package:
from b2bTools import SingleSeq
- Instantiate an object by passing the path to the input file in FASTA format:
single_seq = SingleSeq("/path/to/example.fasta")
- Run the predictions you want to:
single_seq.predict(tools=['dynamine', 'efoldmine'])
⚠️ Important note: These are all the available options to use inside the tools array parameter:
Predictor | string value |
---|---|
Dynamine | "dynamine" |
EfoldMine | "efoldmine" |
Disomine | "disomine" |
AgMata | "agmata" |
- Get the prediction values after running the selected predictors for a specific sequence identifier:
predictions = single_seq.get_all_predictions('SEQ001')
⚠️ Important note: The method get_all_predictions
will return a dictionary with the following structure:
{
"SEQUENCE_ID_000": {
"seq": "the input sequence 0",
"result001": [0.001, 0.002, ..., 0.00],
"result002": [0.001, 0.002, ..., 0.00],
"...": [...],
"resultN": [0.001, 0.002, ..., 0.00]
},
"SEQUENCE_ID_001": {
"seq": "the input sequence 1",
"result001": [0.001, 0.002, ..., 0.00],
"result002": [0.001, 0.002, ..., 0.00],
"...": [...],
"resultN": [0.001, 0.002, ..., 0.00]
},
"...": { ... },
"SEQUENCE_ID_N": {
"seq": "the input sequence N",
"result001": [0.001, 0.002, ..., 0.00],
"result002": [0.001, 0.002, ..., 0.00],
"...": [...],
"resultN": [0.001, 0.002, ..., 0.00]
},
}
To know all the available result keys, please review this table:
Predictor | Output key | Output values (type) | Output values (example) |
---|---|---|---|
None | "seq" |
[Char] | ['M', 'A', ..., 'S', 'T'] |
Dynamine | "backbone" |
[Float] | [0.6786, 0.71, ..., 0.7219] |
Dynamine | "sidechain" |
[Float] | [0.5823, 0.23, ..., 0.1995] |
Dynamine | "helix" |
[Float] | [0.0122, 0.84, ..., 0.2345] |
Dynamine | "ppII" |
[Float] | [0.0420, 0.69, ..., 0.5566] |
Dynamine | "coil" |
[Float] | [0.6666, 0.13, ..., 0.9954] |
Dynamine | "sheet" |
[Float] | [0.1992, 0.12, ..., 0.0020] |
EfoldMine | "earlyFolding" |
[Float] | [0.1989, 0.08, ..., 0.0031] |
Disomine | "disoMine" |
[Float] | [0.1996, 0.12, ..., 0.0019] |
AgMata | "agmata" |
[Float] | [0.1954, 0.06, ..., 0.0007] |
- You are ready to use the sequence and predictions to work with them. Here is an example of plotting the data.
backbone_pred = predictions['SEQ001']['backbone']
sidechain_pred = predictions['SEQ001']['sidechain']
agmata_pred = predictions['SEQ001']['agmata']
plt.plot(range(len(backbone_pred)), backbone_pred, label = "Backbone")
plt.plot(range(len(backbone_pred)), sidechain_pred, label = "Sidechain")
plt.plot(range(len(backbone_pred)), agmata_pred, label = "Agmata")
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
⌨️ Running as Python module (no Python code involved)
You are able to use this package directly from your console session with no Python code involved. Further details available on the official Python documentation site
$ python -m b2bTools -file ./swap/input_example.fasta -dynamics -disomine -identifier test -output ./swap/result-from-package.json
⚠️ Important note:
- The output file titled
result.json
will be stored inshde$(pwd)/swap
. - The available parameters after
b2b-tools
are:
Parameter | Purpose | Example |
---|---|---|
-file |
Path to the input file | -input /path/to/input/file.fasta |
-output |
Path to the output file (a JSON file with the results) | -output /path/to/output/results.json |
-dynamine |
Run Dynamine predictor | -dynamine |
-disomine |
Run Disomine predictor | -disomine |
-efoldmine |
Run EfoldMine predictor | -efoldmine |
-agmata |
Run AgMata predictor | -agmata |
🔬 Multiple Sequences Alignment
If your input data is a MSA file, there are many ways to predict the biophysical features of the sequences.
🧭 Basic flow
⚠️ Important note: These are all the available options to use inside the tools array parameter (Dynamine runs always):
Predictor | string value |
---|---|
EfoldMine | "eFoldMine" |
Disomine | "eFoldMine" |
AgMata | "agmata" |
The tools array parameter is available for all the input methods of the class MultipleSeq:
# From an aligned file
msaSeq = MultipleSeq()
msaSeq.from_aligned_file("/path/to/example.fasta", tools=['agmata', 'eFoldMine', 'disoMine'])
# From two MSA files
msaSeq = MultipleSeq()
msaSeq.from_two_msa("/path/to/example_a.fasta", "/path/to/example_b.fasta", tools=['agmata', 'eFoldMine', 'disoMine'])
# From a JSON with variations file
msaSeq = MultipleSeq()
msaSeq.from_json("/path/to/example.json", tools=['agmata', 'eFoldMine', 'disoMine'])
# From a sequence performing a BLAST before running the predictions
msaSeq = MultipleSeq()
msaSeq.from_blast("path/to/example.fasta", mut_option="y", mut_position=1, mut_residue="A", tools=['agmata', 'eFoldMine', 'disoMine'])
# From an UniRef ID performing a BLAST before running the predictions
msaSeq = MultipleSeq()
msaSeq.from_uniref("A2R2V4", tools=['agmata', 'eFoldMine', 'disoMine'])
To know all the available result keys, please review this table:
Predictor | Output key | Output values (type) | Output values (example) |
---|---|---|---|
Dynamine | "backbone" |
[Float] | [0.6786, 0.123, ..., 0.2523] |
Dynamine | "sidechain" |
[Float] | [0.1234, 0.532, ..., 0.8764] |
Dynamine | "helix" |
[Float] | [0.4321, 0.425, ..., 0.8334] |
Dynamine | "ppII" |
[Float] | [0.4577, 0.754, ..., 0.2343] |
Dynamine | "coil" |
[Float] | [0.5464, 0.675, ..., 0.6483] |
Dynamine | "sheet" |
[Float] | [0.1234, 0.432, ..., 0.8764] |
EfoldMine | "earlyFolding" |
[Float] | [0.3245, 0.234, ..., 0.2348] |
Disomine | "disoMine" |
[Float] | [0.4576, 0.235, ..., 0.6347] |
AgMata | "agmata" |
[Float] | [0.4323, 0.457, ..., 0.2372] |
From an aligned file
import matplotlib.pyplot as plt
from b2bTools import MultipleSeq
msaSeq = MultipleSeq()
msaSeq.from_aligned_file("/path/to/example.fasta")
predictions = msaSeq.get_all_predictions_msa("SEQ001")
backbone_pred = predictions['backbone']
sidechain_pred = predictions['sidechain']
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
From two MSA files
import matplotlib.pyplot as plt
from b2bTools import MultipleSeq
msaSeq = MultipleSeq()
msaSeq.from_two_msa("/path/to/example_a.fasta", "/path/to/example_b.fasta")
predictions = msaSeq.get_all_predictions_msa("SEQ001")
backbone_pred = predictions['backbone']
sidechain_pred = predictions['sidechain']
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
From a JSON with variations file
In this case, we support a JSON format to introduce variants in a sequence. For instance:
{
"metadata": { "name": "target_fasta_file" },
"WT": "MAKSTILALLALVLVAHASAMRRERGRQGDSSSCERQVDRVNLKPCEQHIMQRIMGEQEQYDSYDIRSTRSSDQQQRCCDELNEMENTQRCMCEALQQIMENQCDRLQDRQMVQQFKRELMNLPQQCNFRAPQRCDLDVSGGRC",
"Variants": {
"Var1": ["A3S", "A11G"],
"Var2": ["A2G", "K3_S4insPH", "T5del"]
}
}
Where WT is the wild-type sequence, and the Variants key includes a dictionary of different variations. Each of them are handled by an array of replacements:
- (For example: Replace the A at the position 3 with a S would be
"A3S"
)
Regarding the input fasta file, the metadata
key contains the name of the input, remember it should stored in the same directory than the json file.
The code snippet is:
import matplotlib.pyplot as plt
from b2bTools import MultipleSeq
msaSeq = MultipleSeq()
msaSeq.from_json("/path/to/example.json")
predictions = msaSeq.get_all_predictions_msa("SEQ001")
backbone_pred = predictions['backbone']
sidechain_pred = predictions['sidechain']
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
From a sequence performing a BLAST before running the predictions
In case you want to perform a mutation of a residue at one specific position, you have the parameters mut_position
, mut_residue
and the value of mut_option
must be "y"
.
import matplotlib.pyplot as plt
from b2bTools import MultipleSeq
msaSeq = MultipleSeq()
msaSeq.from_blast("path/to/example.fasta", mut_option="y", mut_position=1, mut_residue="A")
predictions = msaSeq.get_all_predictions_msa("SEQ001")
backbone_pred = predictions['backbone']
sidechain_pred = predictions['sidechain']
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
From an UniRef ID performing a BLAST before running the predictions
import matplotlib.pyplot as plt
from b2bTools import MultipleSeq
msaSeq = MultipleSeq()
msaSeq.from_uniref("A2R2V4")
predictions = msaSeq.get_all_predictions_msa("SEQ001")
backbone_pred = predictions['backbone']
sidechain_pred = predictions['sidechain']
plt.legend()
plt.xlabel('aa_position')
plt.ylabel('pred_values')
plt.show()
⚠️ Note: the query using the UniRef ID was limited to 25 results to increase the time performance.
📚 Package classes & methods
If you are interested in further details, please read the full documentation on the Bio2Byte website.
To generate locally the documentation you can follow the next steps described in this section.
Preconditions
You have downloaded the source code of the Bio2Byte Tools in your local environment:
$ git clone git@bitbucket.org:bio2byte/b2btools.git && cd b2btools
Steps
- Run the following command:
$ make generate-docs
- And then open folder
./wrapped_documentation
💡 Relevant idea: At any moment, you can read the docs of a method invoking the __doc__
method (e.g. print(SingleSeq.predict.__doc__)
).
📖 How to cite
If you use this package or data in this package, please cite:
Predictor | Cite | Digital Object Identifier (DOI) |
---|---|---|
Dynamine | Elisa Cilia, Rita Pancsa, Peter Tompa, Tom Lenaerts, and Wim Vranken. From protein sequence to dynamics and disorder with DynaMine Nature Communications 4:2741 (2013) | https://www.nature.com/articles/ncomms3741 |
Disomine | Gabriele Orlando, Daniele Raimondi, Francesco Codice, Francesco Tabaro, Wim Vranken. Prediction of disordered regions in proteins with recurrent Neural Networks and protein dynamics. bioRxiv 2020.05.25.115253 (2020) | https://www.biorxiv.org/content/10.1101/2020.05.25.115253v1 |
EfoldMine | Raimondi, D., Orlando, G., Pancsa, R. et al. Exploring the Sequence-based Prediction of Folding Initiation Sites in Proteins. Sci Rep 7, 8826 (2017) | https://doi.org/10.1038/s41598-017-08366-3 |
AgMata | Gabriele Orlando, Alexandra Silva, Sandra Macedo-Ribeiro, Daniele Raimondi, Wim Vranken. Accurate prediction of protein beta-aggregation with generalized statistical potentials Bioinformatics , Volume 36, Issue 7, 1 April 2020, Pages 2076–2081 (2020) | https://academic.oup.com/bioinformatics/article/36/7/2076/5670527 |
📝 Terms of use
- The Bio2Byte group aims to promote open science by providing freely available online services, database and software relating to the life sciences, with focus on proteins. Where we present scientific data generated by others we impose no additional restriction on the use of the contributed data than those provided by the data owner.
- The Bio2Byte group expects attribution (e.g. in publications, services or products) for any of its online services, databases or software in accordance with good scientific practice. The expected attribution will be indicated in 'How to cite' sections (or equivalent).
- The Bio2Byte group is not liable to you or third parties claiming through you, for any loss or damage.
- Any questions or comments concerning these Terms of Use can be addressed to Wim Vranken.
© Wim Vranken, Bio2Byte group, VUB
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for b2bTools-3.0.5b2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 904379b36af3868ae0072d311d8001ce8b9bed19a458e5e7312119c0e90d4f45 |
|
MD5 | c7030bf3b9d23d99c232b07b4809c969 |
|
BLAKE2b-256 | 77f19e5150cbf9a401222ef648ba48f46c3de7bebf20a114c256f22012629217 |