Deep learning ready datasets of 3D protein structures.
Project description
proteinshake
: the largest repository of ML-ready protein 3D structure datasets
This is a collection of torch-geometric datasets built from PDB and AlphaFold. After installing, datasets can be passed directly to ML loaders for model training.
name | num_proteins | avg size (# residues) | property | values | type |
---|---|---|---|---|---|
RCSBDataset | 21989 | 56.8898 | - | - | - |
PfamDataset | 18696 | 59.4297 | Protein Family (Pfam) | 5215 (root) | Categorical, Hierarchical |
GODataset | 19267 | 58.8485 | Gene Ontology (GO) | 101 (root) | Categorical, Hierarchical |
ECDataset | 8150 | 74.9618 | Enzyme Classification (EC ) |
2173 | Categorical |
PDBBindRefined | 4642 | 108.806 | Small Mol. Binding Site (residue-level) | 2 | Binary |
TMScoreBenchmark | 200 | 49.458 | TM Score | [0-1] | Real-valued, Pairwise |
AlphaFoldDataset_arabidopsis_thaliana | 27434 | 66.1312 | - | - | - |
AlphaFoldDataset_caenorhabditis_elegans | 19694 | 65.0678 | - | - | - |
AlphaFoldDataset_candida_albicans | 5974 | 62.782 | - | - | - |
AlphaFoldDataset_danio_rerio | 24664 | 75.2797 | - | - | - |
AlphaFoldDataset_dictyostelium_discoideum | 12622 | 85.9275 | - | - | - |
AlphaFoldDataset_drosophila_melanogaster | 13458 | 81.2947 | - | - | - |
AlphaFoldDataset_escherichia_coli | 4363 | 51.5408 | - | - | - |
AlphaFoldDataset_glycine_max | 55799 | 58.0664 | - | - | - |
AlphaFoldDataset_homo_sapiens | 23391 | 105.457 | - | - | - |
AlphaFoldDataset_methanocaldococcus_jannaschii | 1773 | 46.7467 | - | - | - |
AlphaFoldDataset_mus_musculus | 21615 | 83.0434 | - | - | - |
AlphaFoldDataset_oryza_sativa | 43649 | 44.1931 | - | - | - |
AlphaFoldDataset_rattus_norvegicus | 21272 | 78.1547 | - | - | - |
AlphaFoldDataset_saccharomyces_cerevisiae | 6040 | 80.0745 | - | - | - |
AlphaFoldDataset_schizosaccharomyces_pombe | 5128 | 76.2427 | - | - | - |
AlphaFoldDataset_zea_mays | 39299 | 46.1618 | - | - | - |
Installation
$ pip install proteinshake
Note: ensure that you are using the correct versions of torch-[scatter,sparse]
according to your hardware and cuda version. See this page for more info.
From source
$ git clone https://github.com/BorgwardtLab/proteinshake
$ cd proteinshake
$ pip install .
Usage
See the quickstart guide on our documentation site to get started.
Licenses
We make our code available under the MIT License. The datasets are distributed under CC-BY-4.0.
We obtained and modified data from the following sources:
The AlphaFold protein structures were downloaded from the AlphaFold Structure Database, licensed under CC-BY-4.0.
The RCSB protein structures were downloaded from RCSB, licensed under CC0 1.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for proteinshake-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 237f1447cfa7bdc8ab613e7c65070d0ec735aabfb23348829208f4c5a7a663cb |
|
MD5 | f960846a90dcbc050cc97204f11693e4 |
|
BLAKE2b-256 | bb7a3ebc561a882754ecb77bde6bb92b5039e76fb31c9489bebea6efa159da36 |