Skip to main content

Protein Structure Tokenization via Geometric Byte Pair Encoding (GeoBPE)

Project description

Protein Geometric Byte Pair Encoding

Preprint OpenReview

GeoBPE

This repo contains our implementation of Protein Structure Tokenization via Geometric Byte Pair Encoding (ICLR 2026).

Overview

Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences" of geometry while enforcing global constraints.

Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an SE(3) end-frame loss.

GeoBPE

Run GeoBPE

GeoBPE supports two sub-commands: encode and induce. Run geobpe --help for a description.

Run geobpe encode --help and geobpe induce --help to see detailed arguments.

We include the following resources to make it easy to use GeoBPE:

  • GeoBPE API and Usage Guidelines Doc -- descriptions, intuitions, and guidelines on how to effectively and efficiently use GeoBPE
  • Experiment Logs -- collection of past experiments varying hyperparameters settings; quickly lookup settings & performance to save future iteration time.

Citation

If you use GeoBPE in your research, please cite our paper:

@inproceedings{sun2025protein,
  title={Protein Structure Tokenization via Geometric Byte Pair Encoding},
  author={Sun, Michael and Yuan, Weize and Liu, Gang and Matusik, Wojciech and Zitnik, Marinka},
  booktitle={International Conference on Learning Representations},
  year={2026},
  url={https://arxiv.org/abs/2511.11758}
}

Contact

Please contact msun415@mit.edu if you have any questions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geobpe-0.1.0.tar.gz (83.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geobpe-0.1.0-py3-none-any.whl (89.8 kB view details)

Uploaded Python 3

File details

Details for the file geobpe-0.1.0.tar.gz.

File metadata

  • Download URL: geobpe-0.1.0.tar.gz
  • Upload date:
  • Size: 83.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for geobpe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d17ed4f30e3b495358fd3fdfdb6c83fcca5276657aeb005d312e5bfdbb5f37c1
MD5 6c50dda067dd769db3573fec391d2984
BLAKE2b-256 a64746a3b70195af725b73d5258116a84115682c8e22f72a0b9ad2956327da34

See more details on using hashes here.

File details

Details for the file geobpe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: geobpe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 89.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for geobpe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9bb199d15dbd593b1fd03bab739f1cff5a12b99041d437fcd3707978f60bc8f
MD5 54b85cb874afeae451abc7b17b838776
BLAKE2b-256 3c7cbfe86fb6b0cf3ace21fef2dc1f98a00b98860bca8bca836e560e0d86ed28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page