Skip to main content

TEDBench: Large-Scale Protein Fold Classification Benchmark and MiAE Pretraining

Project description

TEDBench

TEDBench is a large-scale, non-redundant benchmark for protein fold classification, together with MiAE (Masked Invariant Autoencoders), a self-supervised pretraining framework for protein structure representations.

Paper: Protein Fold Classification at Scale: Benchmarking and Pretraining
Dexiong Chen, Andrei Manolache, Mathias Niepert, Karsten Borgwardt (ICML 2026 spotlight)


Overview

TEDBench is built from the Encyclopedia of Domains (TED) annotations projected onto the Foldseek-clustered AlphaFold Database.

Split Structures
Train 369,740
Val 46,217
Test 46,218
External test (CATH 4.4 experimental) 27,638

All structures are classified into 965 CATH topology (T-level) classes.

MiAE is an SE(3)-invariant masked autoencoder that masks up to 90 % of backbone frames, processes only the visible residues with a geometric encoder, and reconstructs the full backbone structure with a lightweight decoder.


Installation

From PyPI (recommended):

pip install TEDBench

From source (for training, baselines, or development):

# 1. Create and activate environment
micromamba create -n tedbench python=3.10 -y
micromamba activate tedbench

# 2. Install dependencies
uv pip install -r requirements.txt

# 3. Install the tedbench package (editable)
uv pip install -e .

Datasets

Datasets are available from two sources:

Dataset HuggingFace Direct download
TEDBench (AFDB + CATH labels) TEDBench/ted MPCDF datashare
AFDB pretraining corpus TEDBench/afdb MPCDF datashare
CATH 4.4 experimental test set TEDBench/cath MPCDF datashare

The HuggingFace repos require no local setup; the MPCDF archives are auto-downloaded and cached the first time a local dataset class is instantiated (default roots: ./datasets/ted/ and ./datasets/cath/).

Each sample contains: coords [L, 3, 3] (backbone N/Cα/C, float32), plddt [L], residue_index [L], seq_ids [L], sequence, and label (integer CATH topology index).

Load directly with datasets

from datasets import load_dataset
import torch

# TEDBench — train / val / test with CATH labels
ted = load_dataset("TEDBench/ted")
sample = ted["train"][0]
coords    = torch.tensor(sample["coords"])   # [L, 3, 3]
label     = sample["label"]                  # int index
cath_code = ted["train"].features["label"].int2str(label)  # e.g. "3.40.50.300"

# CATH 4.4 external test set
cath = load_dataset("TEDBench/cath", split="test")

# AFDB pretraining corpus
afdb = load_dataset("TEDBench/afdb", split="train")

Use with LightningStructureDataset

From HuggingFace (dataset_name="hf_ted" / "hf_cath4.4" / "hf_afdb"):

from tedbench.data import LightningStructureDataset

dm = LightningStructureDataset(
    root="TEDBench/ted",   # HF repo ID
    dataset_name="hf_ted",
    batch_size=32,
    num_workers=4,
)
dm.setup("fit")
for coords, res_idx, seq_ids, chain, label in dm.train_dataloader():
    ...

Auto-download from MPCDF (dataset_name="ted" / "cath4.4" / "afdb_stream"): the archive is fetched from the MPCDF datashare and cached under root on first use — no manual download needed:

dm = LightningStructureDataset(
    root="./datasets/ted",   # local cache directory
    dataset_name="ted",
    batch_size=32,
    num_workers=4,
)
dm.setup("fit")
for coords, res_idx, seq_ids, chain, label in dm.train_dataloader():
    ...

Pass datamodule=hf_ted (or datamodule=hf_cath_test, datamodule=hf_afdbfs) to any training script to use HuggingFace; omit it (or use the default config) for the auto-downloading local variant.


Pretrained Models

All models are available on HuggingFace and can be loaded with a single call:

from tedbench.utils.io import load_from_hf

model = load_from_hf("TEDBench/miae-b")  # pretrained MiAE-B
model.eval()

Pretrained MiAE (feature extractor / fine-tuning starting point)

Model HF repo Params
MiAE-S TEDBench/miae-s 29 M
MiAE-B TEDBench/miae-b 102 M
MiAE-B+seq TEDBench/miae-b-seq 102 M
MiAE-L TEDBench/miae-l 339 M

Fine-tuned on TEDBench (fold classifier)

Model HF repo TEDBench test acc CATH 4.4 test acc
MiAE-S (ft) TEDBench/miae-s-ft 72.28 76.08
MiAE-B (ft) TEDBench/miae-b-ft 73.71 75.72
MiAE-B+seq (ft) TEDBench/miae-b-seq-ft 74.56 77.34
MiAE-L (ft) TEDBench/miae-l-ft 73.47 76.46

Trained from scratch on TEDBench (no pretraining)

Model HF repo
MiAE-S (sc) TEDBench/miae-s-sc
MiAE-B (sc) TEDBench/miae-b-sc
MiAE-B+seq (sc) TEDBench/miae-b-seq-sc
MiAE-L (sc) TEDBench/miae-l-sc

Evaluation

Evaluate any model from the HuggingFace Hub without any local data setup:

# Test fine-tuned MiAE-B on TEDBench test split
python main_test_ted.py \
    pretrained_model_path=TEDBench/miae-b-ft

# Test on the CATH 4.4 external experimental test set
python main_test_ted.py \
    datamodule=hf_cath_test \
    pretrained_model_path=TEDBench/miae-b-ft

# Test supervised-from-scratch MiAE-B
python main_test_ted.py \
    pretrained_model_path=TEDBench/miae-b-sc

# Linear probing with pretrained MiAE-B
python main_linprobe_ted.py \
    pretrained_model_path=TEDBench/miae-b

Model Variants

Name Params Layers Hidden dim Attn heads
miae_s 29 M 6 512 8
miae_b 102 M 12 768 12
miae_l 339 M 24 1 024 16

Pass model.name=<variant> to any training script to select a size. Add model.use_seq_input=true to enable the +seq variant (structure + sequence).


Training and Reproducing Paper Results

See TRAINING.md for full pretraining, fine-tuning, linear probing, and baseline reproduction commands with hyperparameter tables.

The baselines/ directory contains scripts for ESM2, SaProt, and ProteinMPNN baselines. See TRAINING.md for usage.


Citation

@inproceedings{chen2026tedbench,
  title={Protein Fold Classification at Scale: Benchmarking and Pretraining},
  author={Chen, Dexiong and Manolache, Andrei and Niepert, Mathias and Borgwardt, Karsten},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}

License

BSD-3-Clause

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tedbench-0.1.0.tar.gz (124.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tedbench-0.1.0-py3-none-any.whl (91.7 kB view details)

Uploaded Python 3

File details

Details for the file tedbench-0.1.0.tar.gz.

File metadata

  • Download URL: tedbench-0.1.0.tar.gz
  • Upload date:
  • Size: 124.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tedbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9cec4681e2cf58fbad467487ffc59714526e835ed8d4fdb177cf818799192146
MD5 d0d8faa7e0f3bf6a7a3e883cfd965902
BLAKE2b-256 27b1db9ed6d396d65a897d03ee1561b73a74612b2f54e452df4cc4202e3eedd9

See more details on using hashes here.

Provenance

The following attestation bundles were made for tedbench-0.1.0.tar.gz:

Publisher: publish.yml on BorgwardtLab/TEDBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tedbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tedbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 91.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tedbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 436f4b48b0f9a4d8368ba6582506a4fa0d87ec901862d7a6123dc4faa4c7df68
MD5 a8b87b0b7bd5d7fb0de9de2b8145bb94
BLAKE2b-256 8cf9e37377989d47de2c5a05eb6aaa46c55b92e3b9d4be5123942044e7f3db90

See more details on using hashes here.

Provenance

The following attestation bundles were made for tedbench-0.1.0-py3-none-any.whl:

Publisher: publish.yml on BorgwardtLab/TEDBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page