Deep learning ready datasets of 3D protein structures.
Project description
ML-ready protein 3D structure datasets
- Fetch clean protein datasets in one line
- Convert proteins to graphs, point clouds, voxels, and surfaces (coming soon).
- Work in your favorite deep learning framework (pytorch, tensorflow, pytorch-geometric, dgl, networkx)
proteinshake is a collection of protein structure datasets built from PDB and AlphaFold. After installing, datasets can be passed directly to ML loaders for model training.
Demo
How to load an AlphaFold dataset as of pytorch-geometric graphs.
>>> from proteinshake.datasets import AlphaFoldDataset
>>> data = AlphaFoldDataset(root='.', organism='escherichia_coli').to_graph(k=5).pyg()
>>> protein_tensor, protein_data = data[0]
>>> protein_tensor
Data(x=[196], edge_index=[2, 0], edge_attr=[0, 1])
>>> protein_data['protein']['ID']
'P0A9H5'
>>> protein_data['protein']['sequence']
'MSDERYQQRQQRVKEKVDARVAQAQDERGIIIVFTGNGKGKTTAAFGTATRAVGHGKKVGVVQFIKGTWPNGERNLLEPHGVEFQVMATGFTWDTQNRESDTAACREVWQHAKRMLADSSLDMVLLDELTYMVAYDYLPLEEVVQALNERPHQQTVIITGRGCHRDILELADTVSELRPVKHAFDAGVKAQIGIDY'
PDB Datasets
name | num_proteins | avg size (# residues) | property | values | type |
---|---|---|---|---|---|
RCSBDataset | 21989 | 56.8898 | - | - | - |
PfamDataset | 18696 | 59.4297 | Protein Family (Pfam) | 5215 (root) | Categorical, Hierarchical |
GODataset | 19267 | 58.8485 | Gene Ontology (GO) | 101 (root) | Categorical, Hierarchical |
ECDataset | 8150 | 74.9618 | Enzyme Classification (EC ) |
2173 | Categorical |
PDBBindRefined | 4642 | 108.806 | Small Mol. Binding Site (residue-level) | 2 | Binary |
TMScoreBenchmark | 200 | 49.458 | TM Score | [0-1] | Real-valued, Pairwise |
AlphaFold Datasets
name | num_proteins | avg size (# residues) | property | values | type |
---|---|---|---|---|---|
SwissProt | 512.231 | 79.334 | - | - | - |
arabidopsis_thaliana | 27434 | 66.1312 | - | - | - |
caenorhabditis_elegans | 19694 | 65.0678 | - | - | - |
candida_albicans | 5974 | 62.782 | - | - | - |
danio_rerio | 24664 | 75.2797 | - | - | - |
dictyostelium_discoideum | 12622 | 85.9275 | - | - | - |
drosophila_melanogaster | 13458 | 81.2947 | - | - | - |
escherichia_coli | 4363 | 51.5408 | - | - | - |
glycine_max | 55799 | 58.0664 | - | - | - |
homo_sapiens | 23391 | 105.457 | - | - | - |
methanocaldococcus_jannaschii | 1773 | 46.7467 | - | - | - |
mus_musculus | 21615 | 83.0434 | - | - | - |
oryza_sativa | 43649 | 44.1931 | - | - | - |
rattus_norvegicus | 21272 | 78.1547 | - | - | - |
saccharomyces_cerevisiae | 6040 | 80.0745 | - | - | - |
schizosaccharomyces_pombe | 5128 | 76.2427 | - | - | - |
zea_mays | 39299 | 46.1618 | - | - | - |
Installation
$ pip install proteinshake
From source
$ git clone https://github.com/BorgwardtLab/proteinshake
$ cd proteinshake
$ pip install .
Usage
See the quickstart guide on our documentation site to get started.
Legal Note
We obtained and modified data from the following sources:
The AlphaFold protein structures were downloaded from the AlphaFold Structure Database, licensed under CC-BY-4.0.
The RCSB protein structures were downloaded from RCSB, licensed under CC0 1.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
proteinshake-0.2.1.tar.gz
(31.7 kB
view hashes)
Built Distribution
Close
Hashes for proteinshake-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b251b6b592613b8d6e4b7337e2f00d71b843c01343ca17099175a46db316466a |
|
MD5 | aa9c897df884d431fd09d974a6f5b2cd |
|
BLAKE2b-256 | 52be25f55cbfc0d1cc1db86a2b7d55686a7418bcead4a66976fd0960a0781390 |