Versatile pipeline for processing protein structure data for deep learning applications.
Project description
ProteinFlow
A data processing pipeline for all your protein design needs.
Installation
Recommended: create a new conda
environment and install proteinflow
with pip
.
conda create --name proteinflow -y
conda activate proteinflow
python -m pip install proteinflow
If you are using python 3.10
and encountering installation problems, try running python -m pip install prody==2.4.0
before installing proteinflow
.
Additional requirements
In most cases, running the commands is enough. However, if you are planning to generate a new dataset, there is a couple additional requirements.
First, you will need to install mmseqs
. The recommended way is to run the following command in your conda
environment but there are alternative methods you can see here.
conda install -y -c conda-forge -c bioconda mmseqs2
In addition, proteinflow
depends on the rcsbsearch
package and the latest release v0.2.3 is currently not working correctly. Follow the recommended fix:
python -m pip install "rcsbsearch @ git+https://github.com/sbliven/rcsbsearch@dbdfe3880cc88b0ce57163987db613d579400c8e"
Finally, you can use our docker image as an alternative.
docker run -it -v /path/to/data:/media adaptyvbio/proteinflow bash
Usage
Downloading pre-computed datasets (stable)
Already precomputed datasets with consensus set of parameters and can be accessed and downloaded using the proteinflow
. package. Check the output of proteinflow check_tags
for a list of available tags.
proteinflow download --tag 20221110
Running the pipeline
You can also run proteinflow
with your own parameters. Check the output of proteinflow check_snapshots
for a list of available PDB snapshots (naming rule: yyyymmdd
).
For instance, let's generate a dataset with the following description:
- resolution threshold: 5 angstrom,
- PDB snapshot: 20190101,
- structure methods accepted: all (x-ray christolography, NRM, Cryo-EM),
- sequence identity threshold for clustering: 40% sequence similarity,
- maximum length per sequence: 1000 residues,
- minimum length per sequence: 5 residues,
- maximum fraction of missing values at the ends: 10%,
- size of validation subset: 10%.
proteinflow generate --tag new --resolution_thr 5 --pdb_snapshot 20190101 --not_filter_methods --min_seq_id 0.4 --max_length 1000 --min_length 5 --missing_ends_thr 0.1 --valid_split 0.1
See the docs (or proteinflow generate --help
) for the full list of parameters and more information.
A registry of all the files that are removed during the filtering as well as description with the reason for their removal is created automatically for each generate
command. The log files are save (at data/logs
by default) and a summary can be accessed running proteinflow get_summary {log_path}
.
Splitting
By default, both proteinflow generate
and proteinflow download
will also split your data into training, test and validation according to MMseqs2 clustering and homomer/heteromer/single chain proportions. However, you can skip this step with a --skip_splitting
flag and then perform it separately with the proteinflow split
command.
The following command will perform the splitting with a 10% validation set, a 5% test set and a 50% threshold for sequence identity clusters.
proteinflow split --tag new --valid_split 0.1 --test_split 0.5 --min_seq_id 0.5
Using the data
The output files are pickled nested dictionaries where first-level keys are chain Ids and second-level keys are the following:
'crd_bb'
: anumpy
array of shape(L, 4, 3)
with backbone atom coordinates (N, C, CA, O),'crd_sc'
: anumpy
array of shape(L, 10, 3)
with sidechain atom coordinates (checkproteinflow.sidechain_order()
for the order of atoms),'msk'
: anumpy
array of shape(L,)
where ones correspond to residues with known coordinates and zeros to missing values,'seq'
: a string of lengthL
with residue types.
Once your data is ready, you can open the files directly with pickle
to access this data.
import pickle
import os
train_folder = "./data/proteinflow_new/training"
for filename in os.listdir(train_folder):
with open(os.path.join(train_folder, filename), "rb") as f:
data = pickle.load(f)
crd_bb = data["crd_bb"]
seq = data["seq"]
...
Alternatively, you can use our ProteinDataset
or ProteinLoader
classes
for convenient processing. Among other things, they allow for feature extraction, single chain / homomer / heteromer filtering and randomized sampling from sequence identity clusters.
For example, here is how we can create a data loader that:
- samples a different cluster representative at every epoch,
- extracts dihedral angles, sidechain orientation and secondary structure features,
- only loads pairs of interacting proteins (larger biounits are broken up into pairs),
- has batch size 8.
from proteinflow import ProteinLoader
train_loader = ProteinLoader.from_args(
"./data/proteinflow_new/training",
clustering_dict_path="./data/proteinflow_new/splits_dict/train.pickle",
node_features_type="dihedral+sidechain_orientation+secondary_structure",
entry_type="pair",
batch_size=8,
)
for batch in train_loader:
crd_bb = batch["X"] #(B, L, 4, 3)
seq = batch["S"] #(B, L)
sse = batch["secondary_structure"] #(B, L, 3)
to_predict = batch["masked_res"] #(B, L), 1 where the residues should be masked, 0 otherwise
...
See more details on available parameters and the data format in the docs + this repository for a use case.
ProteinFlow Stable Releases
You can download them with proteinflow download --tag {tag}
in the command line or browse in the interface.
Tag | Date | Snapshot | Size | Min res | Min len | Max len | MMseqs thr | Split (train/val/test) | Missing thr (ends/middle) | Note |
---|---|---|---|---|---|---|---|---|---|---|
paper | 10.11.22 | 20220103 | 24G | 3.5 | 30 | 10'000 | 0.3 | 90/5/5 | 0.3/0.1 | first release, no mmCIF files |
20230102_stable | 27.02.23 | 20230102 | 28G | 3.5 | 30 | 10'000 | 0.3 | 90/5/5 | 0.3/0.1 | v1.1.1 |
License
The proteinflow
package and data are released and distributed under the BSD 3-Clause License
Contributions
This is an open source project supported by Adaptyv Bio. Contributions, suggestions and bug-fixes are welcomed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for proteinflow-1.2.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92ef6ccd1139040dbca222ef6aa7fc10556a87791df57f5eb8024253b7e1c3a3 |
|
MD5 | 8981ad5ba9df927f0416876aa99867ca |
|
BLAKE2b-256 | cadb94c28299794edd8a0b9f3325b1a0a5e8a7021ce4f7e0dce3b104f64982b8 |