TEDBench: Large-Scale Protein Fold Classification Benchmark and MiAE Pretraining
Project description
TEDBench
TEDBench is a large-scale, non-redundant benchmark for protein fold classification, together with MiAE (Masked Invariant Autoencoders), a self-supervised pretraining framework for protein structure representations.
Paper: Protein Fold Classification at Scale: Benchmarking and Pretraining
Dexiong Chen, Andrei Manolache, Mathias Niepert, Karsten Borgwardt (ICML 2026 spotlight)
Overview
TEDBench is built from the Encyclopedia of Domains (TED) annotations projected onto the Foldseek-clustered AlphaFold Database.
| Split | Structures |
|---|---|
| Train | 369,740 |
| Val | 46,217 |
| Test | 46,218 |
| External test (CATH 4.4 experimental) | 27,638 |
All structures are classified into 965 CATH topology (T-level) classes.
MiAE is an SE(3)-invariant masked autoencoder that masks up to 90 % of backbone frames, processes only the visible residues with a geometric encoder, and reconstructs the full backbone structure with a lightweight decoder.
Installation
From PyPI (recommended):
pip install tedbench
For running ESM2 / SaProt baselines, add the baselines extra:
pip install "tedbench[baselines]"
From source (for training, baselines, or development):
# 1. Create and activate environment
micromamba create -n tedbench python=3.12 -y
micromamba activate tedbench
# 2. Install dependencies
uv pip install -r requirements.txt
# 3. Install the tedbench package (editable)
uv pip install -e .
Datasets
Datasets are available from two sources:
| Dataset | HuggingFace | Direct download |
|---|---|---|
| TEDBench (AFDB + CATH labels) | TEDBench/ted |
MPCDF datashare |
| AFDB pretraining corpus | TEDBench/afdb |
MPCDF datashare |
| CATH 4.4 experimental test set | TEDBench/cath |
MPCDF datashare |
The HuggingFace repos require no local setup; the MPCDF archives are auto-downloaded and cached the first time a local dataset class is instantiated (default roots: ./datasets/ted/ and ./datasets/cath/).
Each sample contains: coords [L, 3, 3] (backbone N/Cα/C, float32), plddt [L], residue_index [L], seq_ids [L], sequence, and label (integer CATH topology index).
Load directly with datasets
from datasets import load_dataset
import torch
# TEDBench — train / val / test with CATH labels
ted = load_dataset("TEDBench/ted")
sample = ted["train"][0]
coords = torch.tensor(sample["coords"]) # [L, 3, 3]
label = sample["label"] # int index
cath_code = ted["train"].features["label"].int2str(label) # e.g. "3.40.50.300"
# CATH 4.4 external test set
cath = load_dataset("TEDBench/cath", split="test")
# AFDB pretraining corpus
afdb = load_dataset("TEDBench/afdb", split="train")
Use with LightningStructureDataset
From HuggingFace (dataset_name="hf_ted" / "hf_cath4.4" / "hf_afdb"):
from tedbench.data import LightningStructureDataset
dm = LightningStructureDataset(
root="TEDBench/ted", # HF repo ID
dataset_name="hf_ted",
batch_size=32,
num_workers=4,
)
dm.setup("fit")
for batch in dm.train_dataloader():
print(batch.keys())
# dict_keys(['coords', 'residue_index', 'seq_ids', 'protein_chain', 'mask', 'label'])
Auto-download from MPCDF (dataset_name="ted" / "cath4.4" / "afdb_stream"): the archive is fetched from the MPCDF datashare and cached under root on first use — no manual download needed:
dm = LightningStructureDataset(
root="./datasets/ted", # local cache directory
dataset_name="ted",
batch_size=32,
num_workers=4,
)
dm.setup("fit")
for batch in dm.train_dataloader():
print(batch.keys())
# dict_keys(['coords', 'residue_index', 'seq_ids', 'protein_chain', 'mask', 'label'])
Pass datamodule=hf_ted (or datamodule=hf_cath_test, datamodule=hf_afdbfs) to any
training script to use HuggingFace; omit it (or use the default config) for the
auto-downloading local variant.
Pretrained Models
All models are available on HuggingFace and can be loaded with a single call:
import tedbench
model = tedbench.load_model("miae-b") # pretrained MiAE-B (short name)
model = tedbench.load_model("miae-b-ft") # fine-tuned on TEDBench
# List all available models
for m in tedbench.list_models():
print(m["name"], m["type"], m["params"])
Pretrained MiAE (feature extractor / fine-tuning starting point)
| Model | HF repo | Params |
|---|---|---|
| MiAE-S | TEDBench/miae-s |
29 M |
| MiAE-B | TEDBench/miae-b |
102 M |
| MiAE-B+seq | TEDBench/miae-b-seq |
102 M |
| MiAE-L | TEDBench/miae-l |
339 M |
Fine-tuned on TEDBench (fold classifier)
| Model | HF repo | TEDBench test acc | CATH 4.4 test acc |
|---|---|---|---|
| MiAE-S (ft) | TEDBench/miae-s-ft |
72.28 | 76.08 |
| MiAE-B (ft) | TEDBench/miae-b-ft |
73.71 | 75.72 |
| MiAE-B+seq (ft) | TEDBench/miae-b-seq-ft |
74.56 | 77.34 |
| MiAE-L (ft) | TEDBench/miae-l-ft |
73.47 | 76.46 |
Trained from scratch on TEDBench (no pretraining)
| Model | HF repo |
|---|---|
| MiAE-S (sc) | TEDBench/miae-s-sc |
| MiAE-B (sc) | TEDBench/miae-b-sc |
| MiAE-B+seq (sc) | TEDBench/miae-b-seq-sc |
| MiAE-L (sc) | TEDBench/miae-l-sc |
Evaluation
Evaluate any model from the HuggingFace Hub without any local data setup:
# Test fine-tuned MiAE-B on TEDBench test split
python main_test_ted.py \
datamodule=hf_ted \
pretrained_model_path=TEDBench/miae-b-ft
# Test on the CATH 4.4 external experimental test set
python main_test_ted.py \
datamodule=hf_cath_test \
pretrained_model_path=TEDBench/miae-b-ft
# Test fine-tuned MiAE-B+seq on TEDBench test split
python main_test_ted.py \
datamodule=hf_ted \
+model.use_seq_input=true \
pretrained_model_path=TEDBench/miae-b-seq-ft
# Test supervised-from-scratch MiAE-B
python main_test_ted.py \
pretrained_model_path=TEDBench/miae-b-sc
# Linear probing with pretrained MiAE-B
python main_linprobe_ted.py \
pretrained_model_path=TEDBench/miae-b
Model Variants
| Name | Params | Layers | Hidden dim | Attn heads |
|---|---|---|---|---|
miae_s |
29 M | 6 | 512 | 8 |
miae_b |
102 M | 12 | 768 | 12 |
miae_l |
339 M | 24 | 1 024 | 16 |
Pass model.name=<variant> to any training script to select a size.
Add model.use_seq_input=true to enable the +seq variant (structure + sequence).
Training and Reproducing Paper Results
See TRAINING.md for full pretraining, fine-tuning, linear probing, and baseline reproduction commands with hyperparameter tables.
The baselines/ directory contains scripts for ESM2, SaProt, and ProteinMPNN baselines.
See TRAINING.md for usage.
Citation
@inproceedings{chen2026tedbench,
title={Protein Fold Classification at Scale: Benchmarking and Pretraining},
author={Chen, Dexiong and Manolache, Andrei and Niepert, Mathias and Borgwardt, Karsten},
booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year={2026}
}
License
BSD-3-Clause
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tedbench-0.2.0.tar.gz.
File metadata
- Download URL: tedbench-0.2.0.tar.gz
- Upload date:
- Size: 704.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30589696ba0f2f1b74782d0e8269614feba4c7458124c56f405d6c7fdaf61094
|
|
| MD5 |
710cded8849f3f21d252e707845b3bd4
|
|
| BLAKE2b-256 |
0153b2cd294c91344b3082d64f687ac21680f033d35f23de5dd632e9e2283072
|
Provenance
The following attestation bundles were made for tedbench-0.2.0.tar.gz:
Publisher:
publish.yml on BorgwardtLab/TEDBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tedbench-0.2.0.tar.gz -
Subject digest:
30589696ba0f2f1b74782d0e8269614feba4c7458124c56f405d6c7fdaf61094 - Sigstore transparency entry: 1572380231
- Sigstore integration time:
-
Permalink:
BorgwardtLab/TEDBench@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/BorgwardtLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tedbench-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tedbench-0.2.0-py3-none-any.whl
- Upload date:
- Size: 93.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c738bf3899b6bcfdd1c59104a4253cd4adc89128dfdae9902ac8a2a9e0c6a31
|
|
| MD5 |
5c3e47431809ae94d81460244cf0e201
|
|
| BLAKE2b-256 |
b731f155bb7a9e6a2cf66bc15b53ac97ee77d6003dd51b5f92e2dda687e3ad2d
|
Provenance
The following attestation bundles were made for tedbench-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on BorgwardtLab/TEDBench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tedbench-0.2.0-py3-none-any.whl -
Subject digest:
5c738bf3899b6bcfdd1c59104a4253cd4adc89128dfdae9902ac8a2a9e0c6a31 - Sigstore transparency entry: 1572380246
- Sigstore integration time:
-
Permalink:
BorgwardtLab/TEDBench@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/BorgwardtLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2 -
Trigger Event:
push
-
Statement type: