Skip to main content

Deep learning ready datasets of 3D protein structures.

Project description

ML-ready protein 3D structure datasets

test workflow Documentation Status PyPI PyPI - Downloads License: MIT visitors

  • Fetch clean protein datasets in one line
  • Convert proteins to graphs, point clouds, voxels, and surfaces (coming soon).
  • Work in your favorite deep learning framework (pytorch, tensorflow, pytorch-geometric, dgl, networkx)

proteinshake is a collection of protein structure datasets built from PDB and AlphaFold. After installing, datasets can be passed directly to ML loaders for model training.

PDB Datasets

name num_proteins avg size (# residues) property values type
RCSBDataset 21989 56.8898 - - -
PfamDataset 18696 59.4297 Protein Family (Pfam) 5215 (root) Categorical, Hierarchical
GODataset 19267 58.8485 Gene Ontology (GO) 101 (root) Categorical, Hierarchical
ECDataset 8150 74.9618 Enzyme Classification (EC) 2173 Categorical
PDBBindRefined 4642 108.806 Small Mol. Binding Site (residue-level) 2 Binary
TMScoreBenchmark 200 49.458 TM Score [0-1] Real-valued, Pairwise

AlphaFold Datasets

name num_proteins avg size (# residues) property values type
SwissProt 512.231 79.334 - - -
arabidopsis_thaliana 27434 66.1312 - - -
caenorhabditis_elegans 19694 65.0678 - - -
candida_albicans 5974 62.782 - - -
danio_rerio 24664 75.2797 - - -
dictyostelium_discoideum 12622 85.9275 - - -
drosophila_melanogaster 13458 81.2947 - - -
escherichia_coli 4363 51.5408 - - -
glycine_max 55799 58.0664 - - -
homo_sapiens 23391 105.457 - - -
methanocaldococcus_jannaschii 1773 46.7467 - - -
mus_musculus 21615 83.0434 - - -
oryza_sativa 43649 44.1931 - - -
rattus_norvegicus 21272 78.1547 - - -
saccharomyces_cerevisiae 6040 80.0745 - - -
schizosaccharomyces_pombe 5128 76.2427 - - -
zea_mays 39299 46.1618 - - -

Installation

$ pip install proteinshake

From source

$ git clone https://github.com/BorgwardtLab/proteinshake
$ cd proteinshake
$ pip install .

Usage

See the quickstart guide on our documentation site to get started.

Legal Note

We obtained and modified data from the following sources:

The AlphaFold protein structures were downloaded from the AlphaFold Structure Database, licensed under CC-BY-4.0.

The RCSB protein structures were downloaded from RCSB, licensed under CC0 1.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proteinshake-0.1.0.tar.gz (30.2 kB view hashes)

Uploaded Source

Built Distribution

proteinshake-0.1.0-py3-none-any.whl (39.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page