Versatile pipeline for processing protein structure data for deep learning applications.

These details have not been verified by PyPI

Project description

ProteinFlow

Generic badge

A data processing pipeline for all your protein design needs.

overview

Installation

Recommended: create a new conda environment and install proteinflow and mmseqs. Note that the python version has to be between 3.8 and 3.10.

conda create --name proteinflow -y python=3.9
conda activate proteinflow
conda install -y -c conda-forge -c bioconda mmseqs2
python -m pip install proteinflow

In addition, proteinflow depends on the rcsbsearch package and the latest release v0.2.3 is currently not functioning . Follow the recommended fix:

python -m pip install "rcsbsearch @ git+https://github.com/sbliven/rcsbsearch@dbdfe3880cc88b0ce57163987db613d579400c8e"

You do not need to install mmseqs or rcsbsearch if you are planning to

Usage

Downloading pre-computed datasets (stable)

Already precomputed datasets with consensus set of parameters and can be accessed and downloaded using the proteinflow. package. Check the output of proteinflow check_tags for a list of available tags.

proteinflow download --tag 20221110

Running the pipeline

You can also run proteinflow with your own parameters. Check the output of proteinflow check_snapshots for a list of available PDB snapshots (naming rule: {year}{month}{day}).

For instance, let's generate a dataset with the following description:

resolution threshold: 5 angstrom,
PDB snapshot: 20190101,
structure methods accepted: all (x-ray christolography, NRM, Cryo-EM),
sequence identity threshold for clustering: 40% sequence similarity,
maximum length per sequence: 1000 residues,
minimum length per sequence: 5 residues,
maximum fraction of missing values at the ends: 10%,
size of validation subset: 10%.

proteinflow generate --tag new --resolution_thr 5 --pdb_snapshot 20190101 --not_filter_methods --min_seq_id 0.4 --max_length 1000 --min_length 5 --missing_ends_thr 0.1 --valid_split 0.1

See the docs (or proteinflow generate --help) for the full list of parameters and more information.

A registry of all the files that are removed during the filtering as well as description with the reason for their removal is created automatically for each generate command. The log files are save (at data/logs by default) and a summary can be accessed running proteinflow get_summary {log_path}.

Splitting

By default, both proteinflow generate and proteinflow download will also split your data into training, test and validation according to MMseqs2 clustering and homomer/heteromer/single chain proportions. However, you can skip this step with a --skip_splitting flag and then perform it separately with the proteinflow split command.

The following command will perform the splitting with a 10% validation set, a 5% test set and a 50% threshold for sequence identity clusters.

proteinflow split --tag new --valid_split 0.1 --test_split 0.5 --min_seq_id 0.5

Using the data

The output files are pickled nested dictionaries where first-level keys are chain Ids and second-level keys are the following:

'crd_bb': a numpy array of shape (L, 4, 3) with backbone atom coordinates (N, C, CA, O),
'crd_sc': a numpy array of shape (L, 10, 3) with sidechain atom coordinates (check proteinflow.sidechain_order() for the order of atoms),
'msk': a numpy array of shape (L,) where ones correspond to residues with known coordinates and zeros to missing values,
'seq': a string of length L with residue types.

Once your data is ready, you can open the files directly with pickle to access this data.

import pickle
import os

train_folder = "./data/proteinflow_new/training"
for filename in os.listdir(train_folder):
    with open(os.path.join(train_folder, filename), "rb") as f:
        data = pickle.load(f)
    crd_bb = data["crd_bb"]
    seq = data["seq"]
    ...

Alternatively, you can use our ProteinDataset or ProteinLoader classes for convenient processing. Among other things, they allow for feature extraction, single chain / homomer / heteromer filtering and randomized sampling from sequence identity clusters.

For example, here is how we can create a data loader that:

samples a different cluster representative at every epoch,
extracts dihedral angles, sidechain orientation and secondary structure features,
only loads pairs of interacting proteins (larger biounits are broken up into pairs),
has batch size 8.

from proteinflow import ProteinLoader
train_loader = ProteinLoader(
    "./data/proteinflow_new/training", 
    clustering_dict_path="./data/proteinflow_new/splits_dict/train.pickle",
    node_features_type="dihedral+sidechain_orientation+secondary_structure",
    entry_type="pair",
    batch_size=8,
)
for batch in train_loader:
    crd_bb = batch["X"] #(B, L, 4, 3)
    seq = batch["S"] #(B, L)
    sse = batch["secondary_structure"] #(B, L, 3)
    to_predict = batch["masked_res"] #(B, L), 1 where the residues should be masked, 0 otherwise
    ...

See more details on available parameters and the data format in the docs.

ProteinFlow Stable Releases

Tag	Date	Location (S3)	Size	Min res	Min len	Max len	MMseqs thr	Split (train/val/test)	Missing thr (ends/middle)
paper	10.11.22	data split	24G	3.5	30	10000	0.3	90/5/5	0.3/0.1

License

The proteinflow package and data are released and distributed under the BSD 3-Clause License

Contributions

This is an open source project supported by Adaptyv Bio. Contributions, suggestions and bug-fixes are welcomed.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.8.0

Feb 8, 2024

2.7.0

Dec 29, 2023

2.6.3

Dec 28, 2023

2.6.2

Dec 26, 2023

2.6.1

Dec 26, 2023

2.6.0

Dec 21, 2023

2.5.2

Nov 29, 2023

2.5.1

Nov 21, 2023

2.5.0

Nov 15, 2023

2.4.1

Nov 14, 2023

2.4.0

Nov 10, 2023

2.3.1

Oct 12, 2023

2.3.0

Sep 14, 2023

2.2.3

Aug 11, 2023

2.2.2

Aug 7, 2023

2.2.1

Aug 3, 2023

2.2.0

Aug 2, 2023

2.1.1

Jul 27, 2023

2.1.0

Jul 25, 2023

2.0.0

Jul 19, 2023

1.4.1

Jun 26, 2023

1.4.0

Jun 23, 2023

1.3.6

Jun 5, 2023

1.3.5

May 26, 2023

1.3.4

May 22, 2023

1.3.3

May 17, 2023

1.3.2

May 17, 2023

1.3.1

May 10, 2023

1.3.0

May 9, 2023

1.2.11

Apr 25, 2023

1.2.10

Apr 20, 2023

1.2.9

Apr 17, 2023

1.2.8

Apr 17, 2023

1.2.7

Mar 27, 2023

1.2.6

Mar 16, 2023

1.2.5

Mar 15, 2023

1.2.4 yanked

Mar 15, 2023

Reason this release was yanked:

Maximum length parameter does not work correctly in some cases

1.2.3

Mar 14, 2023

1.2.2

Mar 7, 2023

1.2.1

Mar 6, 2023

1.2.0

Mar 3, 2023

1.1.2

Feb 27, 2023

1.1.1

Feb 24, 2023

This version

1.1.0

Feb 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proteinflow-1.1.0.tar.gz (40.5 kB view details)

Uploaded Feb 24, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

proteinflow-1.1.0-py3-none-any.whl (43.4 kB view details)

Uploaded Feb 24, 2023 Python 3

File details

Details for the file proteinflow-1.1.0.tar.gz.

File metadata

Download URL: proteinflow-1.1.0.tar.gz
Upload date: Feb 24, 2023
Size: 40.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for proteinflow-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6918c25a400ef142efac8d9f825ca11a39903309802c1f03ba7ea4d86780985b`
MD5	`c6f242a67a68647ac3c3221c9d180a05`
BLAKE2b-256	`18e4ae2190c7b496f43afa4f8e0af3afe18f619866e6fa7240be66c67c8a9147`

See more details on using hashes here.

File details

Details for the file proteinflow-1.1.0-py3-none-any.whl.

File metadata

Download URL: proteinflow-1.1.0-py3-none-any.whl
Upload date: Feb 24, 2023
Size: 43.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for proteinflow-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7064bff7cc63e8a7ba1c6e722da146676f17aa15c9e7c4e4652645a13d9b1953`
MD5	`f0a1f0a4c42e8b926d3f6f38c8d297d5`
BLAKE2b-256	`9e228973de61c9189ea0d70ba5b4fa5b85f5bce3211d799dccace8b66fd5efb6`

See more details on using hashes here.

proteinflow 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ProteinFlow

Installation

Usage

Downloading pre-computed datasets (stable)

Running the pipeline

Splitting

Using the data

ProteinFlow Stable Releases

License

Contributions

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes