Protein Structure Tokenization via Geometric Byte Pair Encoding (GeoBPE)
Project description
Protein Geometric Byte Pair Encoding
This repo contains our implementation of Protein Structure Tokenization via Geometric Byte Pair Encoding (ICLR 2026).
Overview
Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences" of geometry while enforcing global constraints.
Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an SE(3) end-frame loss.
Run GeoBPE
GeoBPE supports two sub-commands: encode and induce. Run geobpe --help for a description.
Run geobpe encode --help and geobpe induce --help to see detailed arguments.
We include the following resources to make it easy to use GeoBPE:
- GeoBPE API and Usage Guidelines Doc -- descriptions, intuitions, and guidelines on how to effectively and efficiently use GeoBPE
- Experiment Logs -- collection of past experiments varying hyperparameters settings; quickly lookup settings & performance to save future iteration time.
Citation
If you use GeoBPE in your research, please cite our paper:
@inproceedings{sun2025protein,
title={Protein Structure Tokenization via Geometric Byte Pair Encoding},
author={Sun, Michael and Yuan, Weize and Liu, Gang and Matusik, Wojciech and Zitnik, Marinka},
booktitle={International Conference on Learning Representations},
year={2026},
url={https://arxiv.org/abs/2511.11758}
}
Contact
Please contact msun415@mit.edu if you have any questions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geobpe-0.1.0.tar.gz.
File metadata
- Download URL: geobpe-0.1.0.tar.gz
- Upload date:
- Size: 83.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d17ed4f30e3b495358fd3fdfdb6c83fcca5276657aeb005d312e5bfdbb5f37c1
|
|
| MD5 |
6c50dda067dd769db3573fec391d2984
|
|
| BLAKE2b-256 |
a64746a3b70195af725b73d5258116a84115682c8e22f72a0b9ad2956327da34
|
File details
Details for the file geobpe-0.1.0-py3-none-any.whl.
File metadata
- Download URL: geobpe-0.1.0-py3-none-any.whl
- Upload date:
- Size: 89.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9bb199d15dbd593b1fd03bab739f1cff5a12b99041d437fcd3707978f60bc8f
|
|
| MD5 |
54b85cb874afeae451abc7b17b838776
|
|
| BLAKE2b-256 |
3c7cbfe86fb6b0cf3ace21fef2dc1f98a00b98860bca8bca836e560e0d86ed28
|