Deep learning ready datasets of 3D protein structures.
Project description
The largest repository of ML-ready protein 3D structure datasets
This is a collection of protein structure datasets built from PDB and AlphaFold. After installing, datasets can be passed directly to ML loaders for model training.
PDB Datasets
name | num_proteins | avg size (# residues) | property | values | type |
---|---|---|---|---|---|
RCSBDataset | 21989 | 56.8898 | - | - | - |
PfamDataset | 18696 | 59.4297 | Protein Family (Pfam) | 5215 (root) | Categorical, Hierarchical |
GODataset | 19267 | 58.8485 | Gene Ontology (GO) | 101 (root) | Categorical, Hierarchical |
ECDataset | 8150 | 74.9618 | Enzyme Classification (EC ) |
2173 | Categorical |
PDBBindRefined | 4642 | 108.806 | Small Mol. Binding Site (residue-level) | 2 | Binary |
TMScoreBenchmark | 200 | 49.458 | TM Score | [0-1] | Real-valued, Pairwise |
AlphaFold Datasets
name | num_proteins | avg size (# residues) | property | values | type |
---|---|---|---|---|---|
SwissProt | 512.231 | 79.334 | - | - | - |
arabidopsis_thaliana | 27434 | 66.1312 | - | - | - |
caenorhabditis_elegans | 19694 | 65.0678 | - | - | - |
candida_albicans | 5974 | 62.782 | - | - | - |
danio_rerio | 24664 | 75.2797 | - | - | - |
dictyostelium_discoideum | 12622 | 85.9275 | - | - | - |
drosophila_melanogaster | 13458 | 81.2947 | - | - | - |
escherichia_coli | 4363 | 51.5408 | - | - | - |
glycine_max | 55799 | 58.0664 | - | - | - |
homo_sapiens | 23391 | 105.457 | - | - | - |
methanocaldococcus_jannaschii | 1773 | 46.7467 | - | - | - |
mus_musculus | 21615 | 83.0434 | - | - | - |
oryza_sativa | 43649 | 44.1931 | - | - | - |
rattus_norvegicus | 21272 | 78.1547 | - | - | - |
saccharomyces_cerevisiae | 6040 | 80.0745 | - | - | - |
schizosaccharomyces_pombe | 5128 | 76.2427 | - | - | - |
zea_mays | 39299 | 46.1618 | - | - | - |
Installation
$ pip install proteinshake
From source
$ git clone https://github.com/BorgwardtLab/proteinshake
$ cd proteinshake
$ pip install .
Usage
See the quickstart guide on our documentation site to get started.
Legal Note
We obtained and modified data from the following sources:
The AlphaFold protein structures were downloaded from the AlphaFold Structure Database, licensed under CC-BY-4.0.
The RCSB protein structures were downloaded from RCSB, licensed under CC0 1.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
proteinshake-0.0.3.tar.gz
(26.8 kB
view hashes)
Built Distribution
Close
Hashes for proteinshake-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf8f942f997077058b30e739fbf60a8b0596238865e00a02a79b9b338b123dd7 |
|
MD5 | a43e93214fc68fb6c78b7c20f3ec9a5e |
|
BLAKE2b-256 | 07f343877c9feb864eb71d82be69d5621c5eba4073502c2f3d71fd902e4096b8 |