Domain-adaptive protein point-cloud binding-site prediction.
Project description
ProtCross
ProtCross is a domain-adaptive protein point-cloud learning framework for binding-site prediction across experimentally solved PDB structures and predicted AlphaFold2 (AF2) structures. The model accepts structures from AlphaFold and can write per-residue binding probabilities to the B-factor column of a new PDB output file.
Published paper (JCIM): Zhong, S., & Jiang, Y. (2026). ProtCross: Bridging the PDB-AlphaFold Gap for Binding Site Prediction with Protein Point Clouds. Journal of chemical information and modeling, 66(7), 3688-3701. https://doi.org/10.1021/acs.jcim.5c03224
The codebase combines:
- residue-level structural geometry (C-alpha coordinates),
- language-model residue embeddings (ESM-C), and
- confidence-aware domain adaptation (pLDDT-weighted DANN)
to improve robustness when transferring from PDB (source domain) to AF2 (target domain).
Quick Start
Use this path when you only want to predict binding sites for one structure.
pip install "protcross[predict]"
protcross setup-assets
protcross predict input.pdb --output input.protcross.pdb
protcross setup-assets installs runtime assets into ~/.cache/protcross/assets/v0.1.2 by default:
protcross-0.1.2-binding-moad-final.ckpt # recommended release checkpoint
pca_esmc_128_binding_moad_0.1.2.pkl # matching PCA reducer
esmc_600m_2024_12_v0.pth # ESM-C weights from Hugging Face
PyPI packages ship code only. The checkpoint and PCA file must be attached to the v0.1.2 GitHub release with the exact filenames above, or supplied with custom URLs:
protcross setup-assets \
--checkpoint-url https://example.org/protcross-0.1.2-binding-moad-final.ckpt \
--pca-url https://example.org/pca_esmc_128_binding_moad_0.1.2.pkl
Use the 0.1.2 release checkpoint for practical binding-site prediction and when reporting ProtCross as a benchmark method. If ProtCross is used as a benchmark, report the probability threshold used for the final predictions.
To reproduce the original ProtCross paper results, use the 0.1.1 checkpoint best-epoch=59.ckpt and its matching pca_esmc_128.pkl. The original paper experiments were based on the PDBbind v2020 refined set.
If your system already has ESM-C weights, skip that large download and pass the path at prediction time:
protcross setup-assets --skip-esm
protcross predict input.pdb \
--esm-weights /absolute/path/to/esmc_600m_2024_12_v0.pth \
--output input.protcross.pdb
Table of Contents
- Quick Start
- 1. Project Overview
- 2. Installation
- 3. Usage
- 4. Configuration Guide (Hydra)
- 5. Repository Layout
- 6. Troubleshooting
- 7. Changelog
- 8. License
1. Project Overview
Core capabilities
- Binding-site segmentation on protein point clouds with PointNet++.
- Domain adaptation (DANN-style) via gradient reversal and domain discriminator.
- AF2 confidence-aware weighting based on pLDDT.
- ESM-C embeddings + PCA reduction for residue features.
- Hydra-driven experiment control with easy command-line overrides.
Primary stack
- PyTorch + PyTorch Lightning
- Torch Geometric
- Hydra
- ESM (EvolutionaryScale)
2. Installation
2.1 System Requirements
- Linux (recommended) or WSL2
- Python 3.10
- Conda (Miniconda or Anaconda)
- NVIDIA GPU + CUDA 12.1 (recommended for training and ESM-C preprocessing)
CPU-only runs are possible for debugging/small tests but will be significantly slower.
2.2 Create Environment
For development or full paper reproduction:
conda env create -f environment.yml
conda activate evopoint_da
pip install -e ".[test,esm]"
For the lightweight prediction interface from PyPI:
pip install "protcross[predict]"
protcross setup-assets
The provided environment includes:
pytorch==2.3.0pytorch-cuda==12.1- Torch Geometric and companion packages
esm>=3.1.0for ESM-C APIs
CPU-only notes
If you do not have a CUDA-capable GPU:
- Remove/replace
pytorch-cuda=12.1inenvironment.yml. - Install CPU-compatible PyTorch/Torch Geometric wheels.
- Run preprocessing/training with
--device cpuor CPU trainer settings.
2.3 Runtime Assets
ProtCross separates code and large runtime assets. PyPI distributions include the Python package and command-line tools, while the pretrained checkpoint, PCA reducer, and ESM-C weights are downloaded after installation.
Recommended setup:
protcross setup-assets
By default this downloads:
- ESM-C 600M weights from https://huggingface.co/EvolutionaryScale/esmc-600m-2024-12
protcross-0.1.2-binding-moad-final.ckptfrom the ProtCrossv0.1.2GitHub releasepca_esmc_128_binding_moad_0.1.2.pklfrom the ProtCrossv0.1.2GitHub release
The default install location is ~/.cache/protcross/assets/v0.1.2. You can override it with PROTCROSS_ASSETS_DIR or --output-dir:
PROTCROSS_ASSETS_DIR=/data/protcross-assets protcross setup-assets
protcross setup-assets --output-dir /data/protcross-assets
After setup, prediction can discover assets automatically:
protcross predict input.pdb --output input.protcross.pdb
For source checkouts or custom releases, explicit paths are still supported:
protcross predict input.pdb \
--checkpoint checkpoints/protcross-0.1.2-binding-moad-final.ckpt \
--esm-weights /absolute/path/to/esmc_600m_2024_12_v0.pth \
--pca data/pca_esmc_128_binding_moad_0.1.2.pkl \
--output input.protcross.pdb
What is ESM-C?
ESM-C is EvolutionaryScale's protein language model family for extracting residue-level sequence representations. In ProtCross, ESM-C embeddings are used as per-residue features.
Recommended checkpoint for this project:
- ESM-C 600M (2024-12): https://huggingface.co/EvolutionaryScale/esmc-600m-2024-12
The ESM-C model repository uses a custom non-commercial license. Review the model terms before downloading or redistributing derived assets.
Manual ESM-C download fallback
If the automatic downloader is unavailable in your environment, you can download the model weights from Hugging Face in either of the following ways.
Option A - Git LFS clone
# 1) Install Git LFS once (if needed)
git lfs install
# 2) Clone the model repository
git clone https://huggingface.co/EvolutionaryScale/esmc-600m-2024-12
Option B - Hugging Face CLI
# 1) Install CLI
pip install -U "huggingface_hub[cli]"
# 2) Download repository files to a local directory
huggingface-cli download EvolutionaryScale/esmc-600m-2024-12 \
--local-dir ./esmc-600m-2024-12
After downloading, locate data/weights/esmc_600m_2024_12_v0.pth and pass its absolute path to --model_name / --esm_weights.
Example:
python scripts/preprocess_esm.py \
--data_dir data/raw_pdb \
--output_dir data/processed_pdb \
--fit_pca \
--model_name /absolute/path/to/esmc_600m_2024_12_v0.pth \
--pca_model_path pca_esmc_128.pkl
Important details:
--model_nameis treated as a local file path in current code.- The script truncates sequences to length 1022 for ESM-C context compatibility.
2.4 Verify Installation
python -c "import torch; import torch_geometric; import pytorch_lightning; import hydra; import esm; print('OK')"
pytest -q
3. Usage
This section is split into two workflows:
- Apply ProtCross: use the released checkpoint for inference.
- Reproduce ProtCross: rebuild datasets/features and retrain/evaluate the model.
For most users, the 0.1.2 release checkpoint is the recommended model. It was trained as a release model for external generalization evaluation using Binding MOAD-derived source structures and matched AF2 target structures. If you use ProtCross as a benchmark, use the release checkpoint and report the threshold used for evaluation.
For reproducing the original ProtCross paper, use the 0.1.1 checkpoint best-epoch=59.ckpt and matching pca_esmc_128.pkl; those assets correspond to the original PDBbind v2020 refined-set experiments.
3.1 Apply ProtCross (Inference with Existing Model)
3.1.1 Single-structure Prediction
You can directly run inference on one PDB structure and write per-residue probabilities to the B-factor column of a new PDB file.
The recommended 0.1.2 path for PyPI users is:
protcross setup-assets
protcross predict examples/6fhu.pdb --output examples/6fhu.pred.pdb
Add a score table or a custom threshold when needed:
protcross predict examples/6fhu.pdb \
--output examples/6fhu.pred.pdb \
--scores-tsv examples/6fhu.scores.tsv \
--threshold 0.5
You can also keep model assets in an explicit directory:
protcross predict examples/6fhu.pdb \
--assets-dir /path/to/protcross-assets \
--output examples/6fhu.pred.pdb
The asset directory should contain protcross-0.1.2-binding-moad-final.ckpt, esmc_600m_2024_12_v0.pth, and pca_esmc_128_binding_moad_0.1.2.pkl. Older v0.1.1 asset directories with best-epoch=59.ckpt and pca_esmc_128.pkl are still accepted for paper reproduction. Alternatively, set PROTCROSS_CHECKPOINT, PROTCROSS_ESM_WEIGHTS, and PROTCROSS_PCA.
The standalone entry point is equivalent:
protcross-predict examples/6fhu.pdb --output examples/6fhu.pred.pdb
The legacy 0.1.0 command is still supported:
python run_Predict_ProtCross.py \
--pdb_file examples/6fhu.pdb \
--ckpt_path checkpoints/protcross-0.1.2-binding-moad-final.ckpt \
--esm_weights /absolute/path/to/esmc_600m_2024_12_v0.pth \
--pca_path data/pca_esmc_128_binding_moad_0.1.2.pkl \
--output_pdb examples/6fhu.pred.pdb
Python API:
from evopoint_da.inference import ProtCrossPredictor, predict_pdb
result = predict_pdb("examples/6fhu.pdb", output_pdb="examples/6fhu.pred.pdb")
print(result.format_summary())
predictor = ProtCrossPredictor.from_default_assets()
result = predictor.predict("examples/6fhu.pdb", output_pdb="examples/6fhu.pred.pdb")
result = predict_pdb(
"examples/6fhu.pdb",
ckpt_path="checkpoints/protcross-0.1.2-binding-moad-final.ckpt",
esm_weights="/absolute/path/to/esmc_600m_2024_12_v0.pth",
pca_path="data/pca_esmc_128_binding_moad_0.1.2.pkl",
output_pdb="examples/6fhu.pred.pdb",
)
print(result.format_summary())
3.1.2 Batch Prediction (Multiple Structures)
protcross-predict predicts one structure each run. For batch inference, iterate over PDB files in a shell loop:
protcross setup-assets
mkdir -p batch_outputs
for pdb in /path/to/pdb_dir/*.pdb; do
base="$(basename "${pdb}" .pdb)"
protcross-predict "${pdb}" \
--output "batch_outputs/${base}_pred.pdb" \
--threshold 0.5
done
3.2 Reproduce ProtCross (Training Pipeline)
3.2.1 Data Preparation
Expected layout:
data/
|--- raw_pdb/ # input PDB/CIF structures (source)
|--- raw_af2/ # input AF2 PDB structures (target)
|--- processed_pdb/ # generated .pt files for source
`--- processed_af2/ # generated .pt files for target
Optional AF2 retrieval helper:
protcross-download-af2 \
--raw-pdb-dir data/raw_pdb \
--output-dir data/raw_af2 \
--mapping-file pdb_uniprot_mapping.json
3.2.2 Preprocess Source (PDB) with PCA Fit
protcross-preprocess \
--data_dir data/raw_pdb \
--output_dir data/processed_pdb \
--fit_pca \
--model_name ~/.cache/protcross/assets/v0.1.2/esmc_600m_2024_12_v0.pth \
--pca_model_path pca_esmc_128.pkl \
--pca_dim 128
3.2.3 Preprocess Target (AF2) with Shared PCA
protcross-preprocess \
--data_dir data/raw_af2 \
--output_dir data/processed_af2 \
--model_name ~/.cache/protcross/assets/v0.1.2/esmc_600m_2024_12_v0.pth \
--pca_model_path pca_esmc_128.pkl \
--is_af2
3.2.4 Map Labels from PDB to AF2
protcross-map-labels \
--processed-pdb-dir data/processed_pdb \
--processed-af2-dir data/processed_af2 \
--raw-pdb-dir data/raw_pdb \
--raw-af2-dir data/raw_af2 \
--mapping-file pdb_uniprot_mapping.json
3.2.5 Train
Default training:
python train.py
Equivalent installed CLI:
protcross-train
Common Hydra overrides:
# Disable domain adaptation
python train.py model.use_da=False
# Disable ESM features
python train.py model.use_esm=False
# Short debugging run
python train.py trainer.max_epochs=5
# Custom data directories
python train.py \
data.data_dir_pdb=/abs/path/to/processed_pdb \
data.data_dir_af2=/abs/path/to/processed_af2
3.2.6 Evaluate / Test
python test_adaptive.py ckpt_path=checkpoints/protcross-0.1.2-binding-moad-final.ckpt
Additional analysis scripts are available (e.g., scripts/eval_run.py) for task-specific reporting.
3.2.7 Run Multi-seed Benchmark
python run_multiseed_benchmark.py
4. Configuration Guide (Hydra)
Main configuration files:
configs/train.yaml: global defaults and run-level settings.configs/data/protein_seg.yaml: data module paths and loading parameters.configs/model/da_module.yaml: architecture and adaptation hyperparameters.configs/trainer/default.yaml: PyTorch Lightning trainer options.
Hydra override syntax:
python train.py key1=value1 key2=value2
Tip: keep all experiment commands in shell scripts to ensure reproducibility.
5. Repository Layout
ProtCross/
|--- configs/
| |--- data/protein_seg.yaml
| |--- model/da_module.yaml
| |--- trainer/default.yaml
| `--- train.yaml
|--- data/
| |--- raw_pdb/
| |--- raw_af2/
| |--- processed_pdb/
| `--- processed_af2/
|--- scripts/
| |--- preprocess_esm.py # compatibility wrapper
| |--- get_af2.py # compatibility wrapper
| |--- map_labels.py # compatibility wrapper
| `--- ... # one-off analysis and plotting utilities
|--- src/evopoint_da/
| |--- cli/ # installed command entry points
| |--- data/
| |--- experiments/ # reproduction benchmark workflows
| |--- evaluation/
| |--- inference/ # lightweight predictor API
| `--- models/
|--- train.py
|--- test_adaptive.py
|--- run_multiseed_benchmark.py
|--- run_Predict_ProtCross.py
`--- environment.yml
6. Troubleshooting
FileNotFoundErrorfor ESM-C weights- Run
protcross setup-assets, or ensure--esm-weights/--model_namepoints to an existing local.pthcheckpoint file.
- Run
protcross setup-assetscannot find GitHub release assets- Attach
protcross-0.1.2-binding-moad-final.ckptandpca_esmc_128_binding_moad_0.1.2.pklto thev0.1.2GitHub release, or pass--checkpoint-urland--pca-url.
- Attach
- Torch Geometric install issues
- Verify that your torch version and wheel index URL match the environment (torch 2.3.0 + cu121).
- OOM during preprocessing/training
- Reduce batch size, use shorter runs, or switch to a smaller subset first.
7. Changelog
0.1.2
Release checkpoint update for external prediction and benchmark use.
- Adds the
0.1.2Binding MOAD-trained release checkpoint and matching PCA reducer as the default runtime assets. This release checkpoint was trained on Binding MOAD-derived labels from 41,409 PDB source structures spanning 20,387 unique ligand IDs, with 8,953 matched AF2 target structures. - Filters common crystallization additives, salts, ions, and caps from default ligand-adjacent residue labeling.
0.1.1
Engineering-focused reproducibility release.
- Adds installable package metadata and console commands:
protcross-predict,protcross-preprocess, andprotcross-train. - Adds a unified
protcrossCLI plusprotcross-setup-assets,protcross-download-af2, andprotcross-map-labels. - Adds one-command runtime asset setup for the ESM-C weights, released checkpoint, and PCA reducer.
- Splits prediction into a lightweight API (
evopoint_da.inference) that accepts one PDB/mmCIF and writes binding probabilities to the B-factor column. - Keeps 0.1.0 entry points (
run_Predict_ProtCross.py,train.py,scripts/preprocess_esm.py,scripts/get_af2.py,scripts/map_labels.py) as compatibility wrappers. - Splits shared structure parsing, ESM-C feature extraction, PCA reduction, AF2 downloads, label mapping, metrics, experiments, and domain weighting into reusable modules.
- Fixes the standard training DataModule so target-domain AF2 batches can be supplied to domain adaptation without requiring positive labels.
- Adds pytest smoke/unit tests, including checkpoint CPU forward on a published processed sample.
0.1.0
Initial public release of ProtCross.
- Provides the core ProtCross framework for domain-adaptive binding-site prediction across PDB and AlphaFold2 protein structures.
- Includes PointNet++-based protein point-cloud segmentation, ESM-C residue embeddings, PCA feature reduction, pLDDT-aware target weighting, and DANN-style domain adaptation.
- Documents the full training and evaluation workflow, including preprocessing, label mapping, Hydra configuration, testing, and multi-seed benchmarking.
8. License
This project is licensed under the MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protcross-0.1.2.tar.gz.
File metadata
- Download URL: protcross-0.1.2.tar.gz
- Upload date:
- Size: 51.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9ea87b297fd21e54ebb236587b6998c15b58e55848ddf7a020485f42cc1eed8
|
|
| MD5 |
f69b4e7aa6aa0cb3aac8f48bdb34a2f0
|
|
| BLAKE2b-256 |
86df581c53fd4c2067de07246993ea937badef5e8255055580818df6677d855d
|
File details
Details for the file protcross-0.1.2-py3-none-any.whl.
File metadata
- Download URL: protcross-0.1.2-py3-none-any.whl
- Upload date:
- Size: 54.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a53005835955dd9c63857c413b26f5bf14783ed17153c18fea623ab8931998a
|
|
| MD5 |
cf7f42b3c370bec445e854c263d9da9f
|
|
| BLAKE2b-256 |
597fec8dde785a77a8cb8adbf04400942b6424e0f4c3d3e250e5b3b2b6d7c4de
|