Programmatic access to ProteinGym datasets with visualization tools.
Project description
ProteinGymPy
Overview
ProteinGym comprises a collection of benchmarks for evaluating the performance of models predicting the effect of point mutations generated by Notin et al., 2023. These datasets include Deep Mutational Scanning Assays (DMS) for 186 proteins, as well as performance scores for several models in both zero-shot and semi-supervised settings.
ProteinGymPy provides analysis-ready data resources from ProteinGym (Notin et al., 2023) and built-in functionality to visualize the data in Python. ProteinGym comprises a collection of benchmarks for evaluating the performance of models predicting the effect of point mutations. This package provides access to:
- Deep mutational scanning (DMS) scores from 217 assays measuring the impact of all possible amino acid substitutions across 186 proteins, and
- Model performance metrics and prediction scores from 79 variant prediction models in a zero-shot setting and 12 models in a semi-supervised setting.
Installation
To install the software, we recommend using uv for python package management:
uv venv --python=3.13 #create a venv
source .venv/bin/activate
uv pip install `ProteinGymPy`
Addtional packages are required for jupyter notebook visualization functions:
uv pip install 'ProteinGymPy[visualization]`
To pull in the latest commits directly from github, use the following:
uv pip install 'ProteinGymPy[visualization] @ git+https://github.com/ccb-hms/ProteinGymPy.git'
Usage
Quick Start
Load DMS substitution data (217 assays):
from proteingympy import get_dms_substitution_data
# Load all DMS assays with UniProt IDs
dms_data = get_dms_substitution_data()
print(f"Loaded {len(dms_data)} DMS assays")
# Access specific assay
assay_name = list(dms_data.keys())[0]
df = dms_data[assay_name]
print(df.head())
Load other datasets:
from proteingympy import (
get_alphamissense_proteingym_data,
get_zero_shot_metrics,
get_supervised_substitution_data
)
# AlphaMissense pathogenicity scores
am_data = get_alphamissense_proteingym_data()
# Zero-shot benchmarking metrics
benchmarks = get_zero_shot_metrics()
print(f"Available metrics: {list(benchmarks.keys())}")
# Supervised model predictions
supervised_data, summary = get_supervised_substitution_data("random_5")
Available Functions
| Function | Description |
|---|---|
get_dms_substitution_data() |
Load 217 DMS substitution assays |
get_dms_metadata() |
Load DMS assay metadata/reference file |
get_alphamissense_proteingym_data() |
Load AlphaMissense pathogenicity scores |
get_supervised_substitution_data() |
Load supervised model predictions for DMS subsitutions |
get_zero_shot_substitution_data() |
Load zero-shot model predictions for DMS subsitutions |
get_zero_shot_metrics() |
Load zero-shot benchmarking metrics |
get_supervised_metrics() |
Load supervised benchmarking metrics |
available_supervised_models() |
Get list of available supervised models |
available_zero_shot_models() |
Get list of available zero-shot models |
create_complete_metadata_table() |
Generate comprehensive metadata |
benchmark_models() |
Benchmark multiple variant effect prediction models |
dms_corr_plot() |
Correlate model performance and DMS scores |
model_corr_plot() |
Compare two model performance scores |
plot_dms_heatmap() |
Visualize DMS scores along a protein as a heatmap |
plot_structure() |
Visualize DMS or model scores on 3D protein structure |
Running Examples
Run the full example script (includes data downloads):
source .venv/bin/activate
python examples/proteingym_pipeline_examples.py
This demonstrates all available data loading functions and shows the structure of each dataset.
Running Tests
Run the test suite to verify functionality:
source .venv/bin/activate
python -m pytest tests/ -v
Or run specific test files:
source .venv/bin/activate
python tests/test_data_pipelines.py
python tests/test_basic.py
The tests include:
- Unit tests for all data loading functions
- Integration tests for complete workflows
- Mock tests that don't require network access
- Validation of data structure and consistency
Documentation building
This project uses Mkdocs as a project document builder. Static docs can be built by installing the extra dependencies after cloning the repo
#git clone https://github.com/ccb-hms/ProteinGymPy.git
cd ProteinGymPy
uv sync --extra mkdocs
uv run mkdocs build
Citation
Notin, P., A. Kollasch, D. Ritter, L. van Niekerk, S. Paul, H. Spinner, N. Rollins, et al. 2023. “ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design.” In Advances in Neural Information Processing Systems, edited by A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, 36:64331–79. Curran Associates, Inc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proteingympy-0.9.3.tar.gz.
File metadata
- Download URL: proteingympy-0.9.3.tar.gz
- Upload date:
- Size: 36.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb141c944bf0fad86c42ad4a3b74211fc6a1f471b4bf8ba74aa74678689971fd
|
|
| MD5 |
4b9cead32b0a3641086bb11c50973394
|
|
| BLAKE2b-256 |
134cceca987ffb530c1bc3507f31cc7f4e243644f58d95ad35bc036813bea234
|
Provenance
The following attestation bundles were made for proteingympy-0.9.3.tar.gz:
Publisher:
publish-to-test-pypi.yml on ccb-hms/ProteinGymPy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proteingympy-0.9.3.tar.gz -
Subject digest:
eb141c944bf0fad86c42ad4a3b74211fc6a1f471b4bf8ba74aa74678689971fd - Sigstore transparency entry: 714184613
- Sigstore integration time:
-
Permalink:
ccb-hms/ProteinGymPy@611cffab178df45d65ab2d91c5f71e96f62e7337 -
Branch / Tag:
refs/tags/v0.9.3 - Owner: https://github.com/ccb-hms
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-test-pypi.yml@611cffab178df45d65ab2d91c5f71e96f62e7337 -
Trigger Event:
push
-
Statement type:
File details
Details for the file proteingympy-0.9.3-py3-none-any.whl.
File metadata
- Download URL: proteingympy-0.9.3-py3-none-any.whl
- Upload date:
- Size: 61.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb59a5169b82f0df12ac98d9a895e0ada72a27b11b0121460b1e70de697ed89f
|
|
| MD5 |
d791f333eca69d1f36ffe434225dce10
|
|
| BLAKE2b-256 |
0652c76f99682e00db063a1ae299f949aeb97dc093f9577da9b95263ab8efb0d
|
Provenance
The following attestation bundles were made for proteingympy-0.9.3-py3-none-any.whl:
Publisher:
publish-to-test-pypi.yml on ccb-hms/ProteinGymPy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proteingympy-0.9.3-py3-none-any.whl -
Subject digest:
cb59a5169b82f0df12ac98d9a895e0ada72a27b11b0121460b1e70de697ed89f - Sigstore transparency entry: 714184617
- Sigstore integration time:
-
Permalink:
ccb-hms/ProteinGymPy@611cffab178df45d65ab2d91c5f71e96f62e7337 -
Branch / Tag:
refs/tags/v0.9.3 - Owner: https://github.com/ccb-hms
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-test-pypi.yml@611cffab178df45d65ab2d91c5f71e96f62e7337 -
Trigger Event:
push
-
Statement type: