Skip to main content

Programmatic access to ProteinGym datasets with visualization tools.

Project description

ProteinGymPy

Documentation Status

Overview

ProteinGym comprises a collection of benchmarks for evaluating the performance of models predicting the effect of point mutations generated by Notin et al., 2023. These datasets include Deep Mutational Scanning Assays (DMS) for 186 proteins, as well as performance scores for several models in both zero-shot and semi-supervised settings.

ProteinGymPy provides analysis-ready data resources from ProteinGym (Notin et al., 2023) and built-in functionality to visualize the data in Python. ProteinGym comprises a collection of benchmarks for evaluating the performance of models predicting the effect of point mutations. This package provides access to:

  1. Deep mutational scanning (DMS) scores from 217 assays measuring the impact of all possible amino acid substitutions across 186 proteins, and
  2. Model performance metrics and prediction scores from 79 variant prediction models in a zero-shot setting and 12 models in a semi-supervised setting.

ProteinGymPy_infographic

Installation

To install the software, we recommend using uv for python package management:

uv venv --python=3.13  #create a venv
source .venv/bin/activate 
uv pip install `ProteinGymPy`

Addtional packages are required for jupyter notebook visualization functions:

uv pip install 'ProteinGymPy[visualization]`

To pull in the latest commits directly from github, use the following:

uv pip install 'ProteinGymPy[visualization] @ git+https://github.com/ccb-hms/ProteinGymPy.git'

Usage

Quick Start

Load DMS substitution data (217 assays):

from proteingympy import get_dms_substitution_data

# Load all DMS assays with UniProt IDs  
dms_data = get_dms_substitution_data()
print(f"Loaded {len(dms_data)} DMS assays")

# Access specific assay
assay_name = list(dms_data.keys())[0]
df = dms_data[assay_name]
print(df.head())

Load other datasets:

from proteingympy import (
    get_alphamissense_proteingym_data,
    get_zero_shot_metrics,
    get_supervised_substitution_data
)

# AlphaMissense pathogenicity scores
am_data = get_alphamissense_proteingym_data()

# Zero-shot benchmarking metrics  
benchmarks = get_zero_shot_metrics()
print(f"Available metrics: {list(benchmarks.keys())}")

# Supervised model predictions
supervised_data, summary = get_supervised_substitution_data("random_5")

Available Functions

Function Description
get_dms_substitution_data() Load 217 DMS substitution assays
get_dms_metadata() Load DMS assay metadata/reference file
get_alphamissense_proteingym_data() Load AlphaMissense pathogenicity scores
get_supervised_substitution_data() Load supervised model predictions for DMS subsitutions
get_zero_shot_substitution_data() Load zero-shot model predictions for DMS subsitutions
get_zero_shot_metrics() Load zero-shot benchmarking metrics
get_supervised_metrics() Load supervised benchmarking metrics
available_supervised_models() Get list of available supervised models
available_zero_shot_models() Get list of available zero-shot models
create_complete_metadata_table() Generate comprehensive metadata
benchmark_models() Benchmark multiple variant effect prediction models
dms_corr_plot() Correlate model performance and DMS scores
model_corr_plot() Compare two model performance scores
plot_dms_heatmap() Visualize DMS scores along a protein as a heatmap
plot_structure() Visualize DMS or model scores on 3D protein structure

Running Examples

Run the full example script (includes data downloads):

source .venv/bin/activate
python examples/proteingym_pipeline_examples.py

This demonstrates all available data loading functions and shows the structure of each dataset.

Running Tests

Run the test suite to verify functionality:

source .venv/bin/activate
python -m pytest tests/ -v

Or run specific test files:

source .venv/bin/activate  
python tests/test_data_pipelines.py
python tests/test_basic.py

The tests include:

  • Unit tests for all data loading functions
  • Integration tests for complete workflows
  • Mock tests that don't require network access
  • Validation of data structure and consistency

Documentation building

This project uses Mkdocs as a project document builder. Static docs can be built by installing the extra dependencies after cloning the repo

#git clone https://github.com/ccb-hms/ProteinGymPy.git
cd ProteinGymPy
uv sync --extra mkdocs
uv run mkdocs build

Citation

Notin, P., A. Kollasch, D. Ritter, L. van Niekerk, S. Paul, H. Spinner, N. Rollins, et al. 2023. “ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design.” In Advances in Neural Information Processing Systems, edited by A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, 36:64331–79. Curran Associates, Inc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proteingympy-0.9.3.tar.gz (36.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proteingympy-0.9.3-py3-none-any.whl (61.1 kB view details)

Uploaded Python 3

File details

Details for the file proteingympy-0.9.3.tar.gz.

File metadata

  • Download URL: proteingympy-0.9.3.tar.gz
  • Upload date:
  • Size: 36.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for proteingympy-0.9.3.tar.gz
Algorithm Hash digest
SHA256 eb141c944bf0fad86c42ad4a3b74211fc6a1f471b4bf8ba74aa74678689971fd
MD5 4b9cead32b0a3641086bb11c50973394
BLAKE2b-256 134cceca987ffb530c1bc3507f31cc7f4e243644f58d95ad35bc036813bea234

See more details on using hashes here.

Provenance

The following attestation bundles were made for proteingympy-0.9.3.tar.gz:

Publisher: publish-to-test-pypi.yml on ccb-hms/ProteinGymPy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file proteingympy-0.9.3-py3-none-any.whl.

File metadata

  • Download URL: proteingympy-0.9.3-py3-none-any.whl
  • Upload date:
  • Size: 61.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for proteingympy-0.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cb59a5169b82f0df12ac98d9a895e0ada72a27b11b0121460b1e70de697ed89f
MD5 d791f333eca69d1f36ffe434225dce10
BLAKE2b-256 0652c76f99682e00db063a1ae299f949aeb97dc093f9577da9b95263ab8efb0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for proteingympy-0.9.3-py3-none-any.whl:

Publisher: publish-to-test-pypi.yml on ccb-hms/ProteinGymPy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page