TEDBench: Large-Scale Protein Fold Classification Benchmark and MiAE Pretraining

These details have not been verified by PyPI

Project description

TedBench logo

TEDBench

TEDBench is a large-scale, non-redundant benchmark for protein fold classification, together with MiAE (Masked Invariant Autoencoders), a self-supervised pretraining framework for protein structure representations.

Paper: Protein Fold Classification at Scale: Benchmarking and Pretraining
Dexiong Chen, Andrei Manolache, Mathias Niepert, Karsten Borgwardt (ICML 2026 spotlight)

Overview

TEDBench is built from the Encyclopedia of Domains (TED) annotations projected onto the Foldseek-clustered AlphaFold Database.

Split	Structures
Train	369,740
Val	46,217
Test	46,218
External test (CATH 4.4 experimental)	27,638

All structures are classified into 965 CATH topology (T-level) classes.

MiAE is an SE(3)-invariant masked autoencoder that masks up to 90 % of backbone frames, processes only the visible residues with a geometric encoder, and reconstructs the full backbone structure with a lightweight decoder.

Installation

From PyPI (recommended):

pip install tedbench

For running ESM2 / SaProt baselines, add the baselines extra:

pip install "tedbench[baselines]"

From source (for training, baselines, or development):

# 1. Create and activate environment
micromamba create -n tedbench python=3.12 -y
micromamba activate tedbench

# 2. Install dependencies
uv pip install -r requirements.txt

# 3. Install the tedbench package (editable)
uv pip install -e .

Datasets

Datasets are available from two sources:

Dataset	HuggingFace	Direct download
TEDBench (AFDB + CATH labels)	`TEDBench/ted`	MPCDF datashare
AFDB pretraining corpus	`TEDBench/afdb`	MPCDF datashare
CATH 4.4 experimental test set	`TEDBench/cath`	MPCDF datashare

The HuggingFace repos require no local setup; the MPCDF archives are auto-downloaded and cached the first time a local dataset class is instantiated (default roots: ./datasets/ted/ and ./datasets/cath/).

Each sample contains: coords [L, 3, 3] (backbone N/Cα/C, float32), plddt [L], residue_index [L], seq_ids [L], sequence, and label (integer CATH topology index).

Load directly with `datasets`

from datasets import load_dataset
import torch

# TEDBench — train / val / test with CATH labels
ted = load_dataset("TEDBench/ted")
sample = ted["train"][0]
coords    = torch.tensor(sample["coords"])   # [L, 3, 3]
label     = sample["label"]                  # int index
cath_code = ted["train"].features["label"].int2str(label)  # e.g. "3.40.50.300"

# CATH 4.4 external test set
cath = load_dataset("TEDBench/cath", split="test")

# AFDB pretraining corpus
afdb = load_dataset("TEDBench/afdb", split="train")

Use with `LightningStructureDataset`

From HuggingFace (dataset_name="hf_ted" / "hf_cath4.4" / "hf_afdb"):

from tedbench.data import LightningStructureDataset

dm = LightningStructureDataset(
    root="TEDBench/ted",   # HF repo ID
    dataset_name="hf_ted",
    batch_size=32,
    num_workers=4,
)
dm.setup("fit")
for batch in dm.train_dataloader():
    print(batch.keys()) 
    # dict_keys(['coords', 'residue_index', 'seq_ids', 'protein_chain', 'mask', 'label'])

Auto-download from MPCDF (dataset_name="ted" / "cath4.4" / "afdb_stream"): the archive is fetched from the MPCDF datashare and cached under root on first use — no manual download needed:

dm = LightningStructureDataset(
    root="./datasets/ted",   # local cache directory
    dataset_name="ted",
    batch_size=32,
    num_workers=4,
)
dm.setup("fit")
for batch in dm.train_dataloader():
    print(batch.keys()) 
    # dict_keys(['coords', 'residue_index', 'seq_ids', 'protein_chain', 'mask', 'label'])

Pass datamodule=hf_ted (or datamodule=hf_cath_test, datamodule=hf_afdbfs) to any training script to use HuggingFace; omit it (or use the default config) for the auto-downloading local variant.

Pretrained Models

All models are available on HuggingFace and can be loaded with a single call:

import tedbench

model = tedbench.load_model("miae-b")     # pretrained MiAE-B (short name)
model = tedbench.load_model("miae-b-ft")  # fine-tuned on TEDBench

# List all available models
for m in tedbench.list_models():
    print(m["name"], m["type"], m["params"])

Pretrained MiAE (feature extractor / fine-tuning starting point)

Model	HF repo	Params
MiAE-S	`TEDBench/miae-s`	29 M
MiAE-B	`TEDBench/miae-b`	102 M
MiAE-B+seq	`TEDBench/miae-b-seq`	102 M
MiAE-L	`TEDBench/miae-l`	339 M

Fine-tuned on TEDBench (fold classifier)

Model	HF repo	TEDBench test acc	CATH 4.4 test acc
MiAE-S (ft)	`TEDBench/miae-s-ft`	72.28	76.08
MiAE-B (ft)	`TEDBench/miae-b-ft`	73.71	75.72
MiAE-B+seq (ft)	`TEDBench/miae-b-seq-ft`	74.56	77.34
MiAE-L (ft)	`TEDBench/miae-l-ft`	73.47	76.46

Trained from scratch on TEDBench (no pretraining)

Model	HF repo
MiAE-S (sc)	`TEDBench/miae-s-sc`
MiAE-B (sc)	`TEDBench/miae-b-sc`
MiAE-B+seq (sc)	`TEDBench/miae-b-seq-sc`
MiAE-L (sc)	`TEDBench/miae-l-sc`

Evaluation

Evaluate any model from the HuggingFace Hub without any local data setup:

# Test fine-tuned MiAE-B on TEDBench test split
python main_test_ted.py \
    datamodule=hf_ted \
    pretrained_model_path=TEDBench/miae-b-ft

# Test on the CATH 4.4 external experimental test set
python main_test_ted.py \
    datamodule=hf_cath_test \
    pretrained_model_path=TEDBench/miae-b-ft

# Test fine-tuned MiAE-B+seq on TEDBench test split
python main_test_ted.py \
    datamodule=hf_ted \
    +model.use_seq_input=true \
    pretrained_model_path=TEDBench/miae-b-seq-ft

# Test supervised-from-scratch MiAE-B
python main_test_ted.py \
    pretrained_model_path=TEDBench/miae-b-sc

# Linear probing with pretrained MiAE-B
python main_linprobe_ted.py \
    pretrained_model_path=TEDBench/miae-b

Model Variants

Name	Params	Layers	Hidden dim	Attn heads
`miae_s`	29 M	6	512	8
`miae_b`	102 M	12	768	12
`miae_l`	339 M	24	1 024	16

Pass model.name=<variant> to any training script to select a size. Add model.use_seq_input=true to enable the +seq variant (structure + sequence).

Training and Reproducing Paper Results

See TRAINING.md for full pretraining, fine-tuning, linear probing, and baseline reproduction commands with hyperparameter tables.

The baselines/ directory contains scripts for ESM2, SaProt, and ProteinMPNN baselines. See TRAINING.md for usage.

Citation

@inproceedings{chen2026tedbench,
  title={Protein Fold Classification at Scale: Benchmarking and Pretraining},
  author={Chen, Dexiong and Manolache, Andrei and Niepert, Mathias and Borgwardt, Karsten},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}

License

BSD-3-Clause

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 19, 2026

0.1.0

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tedbench-0.2.0.tar.gz (704.3 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tedbench-0.2.0-py3-none-any.whl (93.2 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file tedbench-0.2.0.tar.gz.

File metadata

Download URL: tedbench-0.2.0.tar.gz
Upload date: May 19, 2026
Size: 704.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tedbench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`30589696ba0f2f1b74782d0e8269614feba4c7458124c56f405d6c7fdaf61094`
MD5	`710cded8849f3f21d252e707845b3bd4`
BLAKE2b-256	`0153b2cd294c91344b3082d64f687ac21680f033d35f23de5dd632e9e2283072`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tedbench-0.2.0.tar.gz:

Publisher: publish.yml on BorgwardtLab/TEDBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tedbench-0.2.0.tar.gz
- Subject digest: 30589696ba0f2f1b74782d0e8269614feba4c7458124c56f405d6c7fdaf61094
- Sigstore transparency entry: 1572380231
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: BorgwardtLab/TEDBench@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/BorgwardtLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2
- Trigger Event: push

File details

Details for the file tedbench-0.2.0-py3-none-any.whl.

File metadata

Download URL: tedbench-0.2.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 93.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tedbench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5c738bf3899b6bcfdd1c59104a4253cd4adc89128dfdae9902ac8a2a9e0c6a31`
MD5	`5c3e47431809ae94d81460244cf0e201`
BLAKE2b-256	`b731f155bb7a9e6a2cf66bc15b53ac97ee77d6003dd51b5f92e2dda687e3ad2d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tedbench-0.2.0-py3-none-any.whl:

Publisher: publish.yml on BorgwardtLab/TEDBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tedbench-0.2.0-py3-none-any.whl
- Subject digest: 5c738bf3899b6bcfdd1c59104a4253cd4adc89128dfdae9902ac8a2a9e0c6a31
- Sigstore transparency entry: 1572380246
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: BorgwardtLab/TEDBench@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/BorgwardtLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@1139e7f3b1b2307391e3d15ed6cfc2e7104581e2
- Trigger Event: push

TEDBench 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

TEDBench

Overview

Installation

Datasets

Load directly with datasets

Use with LightningStructureDataset

Pretrained Models

Pretrained MiAE (feature extractor / fine-tuning starting point)

Fine-tuned on TEDBench (fold classifier)

Trained from scratch on TEDBench (no pretraining)

Evaluation

Model Variants

Training and Reproducing Paper Results

Citation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Load directly with `datasets`

Use with `LightningStructureDataset`