Skip to main content

TEDBench: Large-Scale Protein Fold Classification Benchmark and MiAE Pretraining

Project description

TedBench logo

TEDBench

arXiv  Hugging Face 

TEDBench is a large-scale, non-redundant benchmark for protein fold classification, together with MiAE (Masked Invariant Autoencoders), a self-supervised pretraining framework for protein structure representations.

Paper: Protein Fold Classification at Scale: Benchmarking and Pretraining
Dexiong Chen, Andrei Manolache, Mathias Niepert, Karsten Borgwardt (ICML 2026 spotlight)


Overview

TEDBench is built from the Encyclopedia of Domains (TED) annotations projected onto the Foldseek-clustered AlphaFold Database.

Split Structures
Train 369,740
Val 46,217
Test 46,218
External test (CATH 4.4 experimental) 27,638

All structures are classified into 965 CATH topology (T-level) classes.

MiAE is an SE(3)-invariant masked autoencoder that masks up to 90 % of backbone frames, processes only the visible residues with a geometric encoder, and reconstructs the full backbone structure with a lightweight decoder.


Installation

From PyPI (recommended):

pip install tedbench

For running ESM2 / SaProt baselines, add the baselines extra:

pip install "tedbench[baselines]"

From source (for training, baselines, or development):

# 1. Create and activate environment
micromamba create -n tedbench python=3.12 -y
micromamba activate tedbench

# 2. Install dependencies
uv pip install -r requirements.txt

# 3. Install the tedbench package (editable)
uv pip install -e .

Datasets

Datasets are available from two sources:

Dataset HuggingFace Direct download
TEDBench (AFDB + CATH labels) TEDBench/ted MPCDF datashare
AFDB pretraining corpus TEDBench/afdb MPCDF datashare
CATH 4.4 experimental test set TEDBench/cath MPCDF datashare

The HuggingFace repos require no local setup; the MPCDF archives are auto-downloaded and cached the first time a local dataset class is instantiated (default roots: ./datasets/ted/ and ./datasets/cath/).

Each sample contains: coords [L, 3, 3] (backbone N/Cα/C, float32), plddt [L], residue_index [L], seq_ids [L], sequence, and label (integer CATH topology index).

Load directly with datasets

from datasets import load_dataset
import torch

# TEDBench — train / val / test with CATH labels
ted = load_dataset("TEDBench/ted")
sample = ted["train"][0]
coords    = torch.tensor(sample["coords"])   # [L, 3, 3]
label     = sample["label"]                  # int index
cath_code = ted["train"].features["label"].int2str(label)  # e.g. "3.40.50.300"

# CATH 4.4 external test set
cath = load_dataset("TEDBench/cath", split="test")

# AFDB pretraining corpus
afdb = load_dataset("TEDBench/afdb", split="train")

Use with LightningStructureDataset

From HuggingFace (dataset_name="hf_ted" / "hf_cath4.4" / "hf_afdb"):

from tedbench.data import LightningStructureDataset

dm = LightningStructureDataset(
    root="TEDBench/ted",   # HF repo ID
    dataset_name="hf_ted",
    batch_size=32,
    num_workers=4,
)
dm.setup("fit")
for batch in dm.train_dataloader():
    print(batch.keys()) 
    # dict_keys(['coords', 'residue_index', 'seq_ids', 'protein_chain', 'mask', 'label'])

Auto-download from MPCDF (dataset_name="ted" / "cath4.4" / "afdb_stream"): the archive is fetched from the MPCDF datashare and cached under root on first use — no manual download needed:

dm = LightningStructureDataset(
    root="./datasets/ted",   # local cache directory
    dataset_name="ted",
    batch_size=32,
    num_workers=4,
)
dm.setup("fit")
for batch in dm.train_dataloader():
    print(batch.keys()) 
    # dict_keys(['coords', 'residue_index', 'seq_ids', 'protein_chain', 'mask', 'label'])

Pass datamodule=hf_ted (or datamodule=hf_cath_test, datamodule=hf_afdbfs) to any training script to use HuggingFace; omit it (or use the default config) for the auto-downloading local variant.


Pretrained Models

All models are available on HuggingFace and can be loaded with a single call:

import tedbench

model = tedbench.load_model("miae-b")     # pretrained MiAE-B (short name)
model = tedbench.load_model("miae-b-ft")  # fine-tuned on TEDBench

# List all available models
for m in tedbench.list_models():
    print(m["name"], m["type"], m["params"])

Pretrained MiAE (feature extractor / fine-tuning starting point)

Model HF repo Params
MiAE-S TEDBench/miae-s 29 M
MiAE-B TEDBench/miae-b 102 M
MiAE-B+seq TEDBench/miae-b-seq 102 M
MiAE-L TEDBench/miae-l 339 M

Fine-tuned on TEDBench (fold classifier)

Model HF repo TEDBench test acc CATH 4.4 test acc
MiAE-S (ft) TEDBench/miae-s-ft 72.28 76.08
MiAE-B (ft) TEDBench/miae-b-ft 73.71 75.72
MiAE-B+seq (ft) TEDBench/miae-b-seq-ft 74.56 77.34
MiAE-L (ft) TEDBench/miae-l-ft 73.47 76.46

Trained from scratch on TEDBench (no pretraining)

Model HF repo
MiAE-S (sc) TEDBench/miae-s-sc
MiAE-B (sc) TEDBench/miae-b-sc
MiAE-B+seq (sc) TEDBench/miae-b-seq-sc
MiAE-L (sc) TEDBench/miae-l-sc

Evaluation

Evaluate any model from the HuggingFace Hub without any local data setup:

# Test fine-tuned MiAE-B on TEDBench test split
python main_test_ted.py \
    datamodule=hf_ted \
    pretrained_model_path=TEDBench/miae-b-ft

# Test on the CATH 4.4 external experimental test set
python main_test_ted.py \
    datamodule=hf_cath_test \
    pretrained_model_path=TEDBench/miae-b-ft

# Test fine-tuned MiAE-B+seq on TEDBench test split
python main_test_ted.py \
    datamodule=hf_ted \
    +model.use_seq_input=true \
    pretrained_model_path=TEDBench/miae-b-seq-ft

# Test supervised-from-scratch MiAE-B
python main_test_ted.py \
    pretrained_model_path=TEDBench/miae-b-sc

# Linear probing with pretrained MiAE-B
python main_linprobe_ted.py \
    pretrained_model_path=TEDBench/miae-b

Model Variants

Name Params Layers Hidden dim Attn heads
miae_s 29 M 6 512 8
miae_b 102 M 12 768 12
miae_l 339 M 24 1 024 16

Pass model.name=<variant> to any training script to select a size. Add model.use_seq_input=true to enable the +seq variant (structure + sequence).


Training and Reproducing Paper Results

See TRAINING.md for full pretraining, fine-tuning, linear probing, and baseline reproduction commands with hyperparameter tables.

The baselines/ directory contains scripts for ESM2, SaProt, and ProteinMPNN baselines. See TRAINING.md for usage.


Citation

@inproceedings{chen2026tedbench,
  title={Protein Fold Classification at Scale: Benchmarking and Pretraining},
  author={Chen, Dexiong and Manolache, Andrei and Niepert, Mathias and Borgwardt, Karsten},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}

License

BSD-3-Clause

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tedbench-0.2.0.tar.gz (704.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tedbench-0.2.0-py3-none-any.whl (93.2 kB view details)

Uploaded Python 3

File details

Details for the file tedbench-0.2.0.tar.gz.

File metadata

  • Download URL: tedbench-0.2.0.tar.gz
  • Upload date:
  • Size: 704.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tedbench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 30589696ba0f2f1b74782d0e8269614feba4c7458124c56f405d6c7fdaf61094
MD5 710cded8849f3f21d252e707845b3bd4
BLAKE2b-256 0153b2cd294c91344b3082d64f687ac21680f033d35f23de5dd632e9e2283072

See more details on using hashes here.

Provenance

The following attestation bundles were made for tedbench-0.2.0.tar.gz:

Publisher: publish.yml on BorgwardtLab/TEDBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tedbench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tedbench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 93.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tedbench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5c738bf3899b6bcfdd1c59104a4253cd4adc89128dfdae9902ac8a2a9e0c6a31
MD5 5c3e47431809ae94d81460244cf0e201
BLAKE2b-256 b731f155bb7a9e6a2cf66bc15b53ac97ee77d6003dd51b5f92e2dda687e3ad2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for tedbench-0.2.0-py3-none-any.whl:

Publisher: publish.yml on BorgwardtLab/TEDBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page