A visualisation tool for protein embeddings from pLMs
Project description
ProtSpace
ProtSpace is a powerful visualization tool for exploring protein embeddings and structures. It allows users to interactively visualize high-dimensional protein language model data in 2D or 3D space, color-code proteins based on various features, and view protein structures when available.
Table of Contents
- ProtSpace
Quick Start with Google Colab
To quickly try out ProtSpace without installing anything on your local machine, you can use our Google Colab notebook. This notebook provides a dummy example that demonstrates the basic functionality of ProtSpace.
Click on the "Open In Colab" button above to open the notebook in Google Colab. You can then run the cells in the notebook to see ProtSpace in action with a sample dataset.
This notebook includes:
- Installation of required dependencies
- Generation of a dummy dataset
- Data preparation using the
prepare_json.pyscript - Visualization of the data using ProtSpace
It's a great way to get familiar with ProtSpace before setting it up on your local machine or using it with your own data.
Example Outputs
To give you an idea of what ProtSpace can produce, here are some example outputs:
2D Scatter Plot (SVG)
Below is an example of a 2D scatter plot generated by ProtSpace, showing protein embeddings colored by a selected feature:
This SVG image is a static representation of the interactive plot you'll see in the ProtSpace app. In the actual app, you can hover over points to see details, zoom in/out, and pan around the plot.
3D Interactive Plot (HTML)
For 3D projections, ProtSpace generates interactive HTML plots. You can view an example of such here:
Running Protspace
ProtSpace uses uv for dependency management and packaging. Make sure you have uv installed on your system. If not, you can install it by following the instructions on the uv website.
Quickly running
uvx protspace
Permanent installation
uv tool install protspace
uv tool update-shell
If you are looking for the latest stable version on GitHub, please use:
uv tool install git+https://github.com/tsenoner/ProtSpace.git
uv tool update-shell
Usage
Preparing Data
Before using the ProtSpace app, you need to prepare your data using the prepare_json.py script. This script takes protein embedding data and feature information as input and generates a JSON file that the ProtSpace app can read.
To prepare your data:
uvx --from protspace protspace-json -H path/to/embeddings.h5 -c path/to/features.csv -o output.json --methods pca3 umap2 tsne2
For more details on the prepare_json.py script, see the Data Preparation Script section below.
Running the ProtSpace App
To run the ProtSpace app:
uv tool run protspace <path/to/output.json> [--pdb_dir path/to/pdb/files] [--port 8050]
path/to/output.json: Path to the JSON file generated by theprepare_json.pyscript.--pdb_dir(optional): Path to a directory containing PDB files for protein structure visualization.--port(optional): Port number to run the server on (default is 8050).
After running the command, open a web browser and navigate to http://localhost:8050 (or the port you specified) to use the ProtSpace app.
Features
-
Interactive Visualization: Explore protein embeddings in 2D or 3D space using various dimensionality reduction techniques (PCA, UMAP, t-SNE).
-
Feature-based Coloring: Color-code proteins based on different features to identify patterns and relationships.
-
Protein Structure Visualization: If PDB files are provided, view 3D structures of selected proteins.
-
Search and Highlight: Search for specific proteins and highlight them in the visualization.
-
Downloadable Plots: Save high-quality images of the visualizations for use in presentations or publications.
-
Responsive Design: The app adapts to different screen sizes and layouts.
Data Preparation Script: prepare_json.py
The prepare_json.py script is used to preprocess protein embedding data and feature information into a format that the ProtSpace app can use.
Usage
uvx --from protspace protspace-json -H <hdf_file> -c <csv_file> -o <output_json> [options]
Arguments
-H,--hdf: Path to the HDF file containing protein embeddings.-c,--csv: Path to the CSV file containing protein features.-o,--output: Path to save the output JSON file.--methods: Dimensionality reduction techniques to apply. Options: pca2, pca3, umap2, umap3, tsne2, tsne3. (Default: pca3)-v,--verbose: Increase output verbosity. Use -v for INFO, -vv for DEBUG.
Additional Parameters
-
UMAP parameters:
--n_neighbors: UMAP n_neighbors parameter (default: 15)--min_dist: UMAP min_dist parameter (default: 0.1)--metric: UMAP metric parameter (default: euclidean)
-
t-SNE parameters:
--perplexity: t-SNE perplexity parameter (default: 30)--learning_rate: t-SNE learning_rate parameter (default: 200)
Example
uvx --from protspace protspace-json -H data/3FTx/3FTx_prott5.h5 -c data/3FTx.csv -o data/3FTx.json --methods pca2 pca3 -v
This command will process the embeddings from data/3FTx/3FTx_prott5.h5, combine them with features from data/3FTx.csv, apply PCA (2D) and PCA (3D) dimensionality reduction, and save the result to data/3FTx.json. It will also provide verbose output during processing.
Adding Custom Feature Colors
ProtSpace allows you to customize the colors used for different feature values in your visualizations. You can use the add_feature_colors.py script to add or update feature colors in your ProtSpace JSON file.
Usage of add_feature_colors.py
uvx --from protspace protspace-feature-colors <input_json_file> <output_json_file> --feature_colors <feature_colors_input>
Arguments
<input_json_file>: Path to the input JSON file (generated byprepare_json.py).<output_json_file>: Path to save the updated JSON file.--feature_colors: JSON string of feature colors or path to a JSON file containing feature colors.
Feature Colors Input Format
The feature colors can be provided in two ways:
-
As a JSON string:
'{"feature1": {"value1": "#FF0000", "value2": "#00FF00"}}' -
As a path to a JSON file containing the feature colors:
path/to/colors.json
The JSON structure should be:
{
"feature1": {
"value1": "#FF0000",
"value2": "#00FF00"
},
"feature2": {
"valueA": "#0000FF",
"valueB": "#FFFF00"
}
}
Example
To add custom colors for the "major_group" feature:
uvx --from protspace protspace-feature-colorsy data/3FTx/3FTx.json data/3FTx/3FTx_colored.json --feature_colors '{"major_group": {"3FTx": "#FF0000", "PLA2": "#00FF00", "SVMP": "#0000FF"}}'
This command will update the data/3FTx/3FTx.json file with the specified colors for the "major_group" feature and save the result to data/3FTx/3FTx_colored.json.
Notes
- If a feature or value doesn't exist in the protein data, the script will raise an error.
- You can update colors for multiple features in a single run.
- If you're updating an existing color scheme, only the specified colors will be changed or added; existing colors for other values will be preserved.
After adding custom colors, you can use the updated JSON file with the ProtSpace app to visualize your data with the new color scheme.
File Formats
Input Files
-
HDF File (Embeddings)
- Format: HDF5
- Contents: Protein embeddings, where each key is a protein identifier, and the corresponding value is the embedding vector.
-
CSV File (Features)
- Format: CSV
- Contents: Protein features, with one row per protein and columns for different features. Must include an 'identifier' column matching the protein IDs in the HDF file.
-
PDB Files (Optional)
- Format: PDB
- Contents: 3D structure information for proteins. File names should match the protein identifiers (with underscores instead of dots).
Output File
- JSON File
- Format: JSON
- Contents: Processed data including protein features and dimensionality-reduced coordinates for each projection method.
We hope you find ProtSpace useful for your protein data visualization needs! For any additional questions or support, please contact us or open an issue on our GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protspace-0.13.0.tar.gz.
File metadata
- Download URL: protspace-0.13.0.tar.gz
- Upload date:
- Size: 68.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dd1a58ba591b13b395dbbb3dfacb02cb0e28e38503a95e0a5bf677ebf879972
|
|
| MD5 |
bc405b1eed2750859654fbf8a34659e3
|
|
| BLAKE2b-256 |
cfee50dcbebe8cfa336c3d8380a56bb55da0c00dd9ca96707bbde3137c77b57e
|
Provenance
The following attestation bundles were made for protspace-0.13.0.tar.gz:
Publisher:
python.yml on tsenoner/ProtSpace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
protspace-0.13.0.tar.gz -
Subject digest:
4dd1a58ba591b13b395dbbb3dfacb02cb0e28e38503a95e0a5bf677ebf879972 - Sigstore transparency entry: 148860707
- Sigstore integration time:
-
Permalink:
tsenoner/ProtSpace@a30243cb7107abd2cc5677060d904c0de31a7fdf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsenoner
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@a30243cb7107abd2cc5677060d904c0de31a7fdf -
Trigger Event:
repository_dispatch
-
Statement type:
File details
Details for the file protspace-0.13.0-py3-none-any.whl.
File metadata
- Download URL: protspace-0.13.0-py3-none-any.whl
- Upload date:
- Size: 36.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5eab2ed5aa7af119321d859a69003c259d49ba5d02089c15bfa5772110ad428
|
|
| MD5 |
946e17e51939195a2ab76d236b61cd9b
|
|
| BLAKE2b-256 |
7801f99fb0df96511652d5faaac53e8d7df1785152468ddfa00a9aabf49001d2
|
Provenance
The following attestation bundles were made for protspace-0.13.0-py3-none-any.whl:
Publisher:
python.yml on tsenoner/ProtSpace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
protspace-0.13.0-py3-none-any.whl -
Subject digest:
e5eab2ed5aa7af119321d859a69003c259d49ba5d02089c15bfa5772110ad428 - Sigstore transparency entry: 148860708
- Sigstore integration time:
-
Permalink:
tsenoner/ProtSpace@a30243cb7107abd2cc5677060d904c0de31a7fdf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tsenoner
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python.yml@a30243cb7107abd2cc5677060d904c0de31a7fdf -
Trigger Event:
repository_dispatch
-
Statement type: