open protein language model
Project description
OPLM
Open Protein Language Model — an encoder-only protein language model with a HuggingFace-native, ESM-C-style API. Load a pretrained checkpoint, hand it a list of sequences, and get back per-residue logits and embeddings.
Status: Pre-alpha (
v0.0.1). The model, tokenizer, inference API, and HuggingFace integration are stable; pretrained checkpoints and benchmark results are still landing.
Highlights
- ESM-C-style inference API —
model.logits(sequences, LogitsConfig(...))returns a structuredLogitsOutputwithsequence_logits,embeddings,hidden_states, andattentions. If you've used ESM-C, you already know it. - HuggingFace-native — every model is a
PreTrainedModel. UseOplmForMaskedLM.from_pretrained(...)or thetransformersAuto*classes, load from the Hub, andsave_pretrained/push_to_hublike any other model. - ESM-C-compatible tokenizer — a 33-token
OplmTokenizerFastwith the same vocabulary and special tokens as ESM-C. - Fast attention — built on PyTorch's
scaled_dot_product_attention(a fused FlashAttention / memory-efficient kernel on CUDA, no separateflash-attndependency), with a manual softmax path for returning attention weights. - Five sizes — from a 5M-parameter ablation model up to 13B parameters.
Installation
Requirements: Python ≥ 3.11 and PyTorch ≥ 2.10.
pip install oplm
Or from source:
git clone https://github.com/briney/oplm.git
cd oplm
pip install -e .
Inference needs no extras. The optional groups are for contributors:
pip install "oplm[train]" # distributed training (Accelerate, W&B, datasets)
pip install "oplm[dev]" # tests, linting, type checking
Quick start
Per-residue logits and embeddings
The primary entry point is OplmForMaskedLM. from_pretrained loads the weights
and attaches the matching tokenizer, so .logits() takes raw sequences directly:
import torch
from oplm import OplmForMaskedLM, LogitsConfig
model = OplmForMaskedLM.from_pretrained("brineylab/oplm-base").eval()
sequences = [
"MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRIL",
"MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVK",
]
with torch.no_grad():
out = model.logits(
sequences,
LogitsConfig(sequence=True, return_embeddings=True),
)
out.sequence_logits # (B, T, 33) per-residue amino-acid logits
out.embeddings # (B, T, hidden_size) per-residue embeddings
B is the batch size and T is the padded sequence length (each sequence is
wrapped with <cls> … <eos>).
Per-protein embeddings
For a single fixed-size vector per sequence, run the backbone (OplmModel) and
mask-aware mean-pool over the residue dimension:
import torch
from oplm import OplmModel
from oplm.model import mean_pool
model = OplmModel.from_pretrained("brineylab/oplm-base").eval()
batch = model.tokenize(sequences) # BatchEncoding on the model's device
with torch.no_grad():
out = model(
input_ids=batch["input_ids"],
attention_mask=batch["attention_mask"],
)
per_residue = out.last_hidden_state # (B, T, hidden_size)
per_protein = mean_pool(per_residue, batch["attention_mask"]) # (B, hidden_size)
oplm.model also exports cls_pool if you prefer the <cls> representation.
Using the transformers Auto* API
Importing oplm registers the config, models, and tokenizer with transformers,
so the standard Auto* classes work too:
import oplm # registers OPLM with transformers' Auto* classes
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("brineylab/oplm-base")
model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-base").eval()
batch = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**batch).logits # (B, T, 33)
AutoModel returns the bare encoder, and AutoModelForSequenceClassification /
AutoModelForTokenClassification return the corresponding fine-tuning heads.
Each model repo also bundles its modeling code, so consumers who don't have
oplm installed can load it with trust_remote_code=True:
model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-base", trust_remote_code=True)
Coming from ESM-C?
OPLM mirrors the ESM-C inference surface so existing pipelines port with minimal changes:
LogitsConfig(sequence=..., return_embeddings=...)is the same knob object.model.logits(...)returns aLogitsOutput; readoutput.embeddingsexactly as you would with ESM-C.- The tokenizer vocabulary (33 tokens,
<cls>/<pad>/<eos>/<mask>specials) matches ESM-C.
The main difference: OPLM's .logits() and .tokenize() accept a list[str] of
sequences directly, rather than pre-encoded protein tensors. Per-residue logits
live in output.sequence_logits.
Pretrained models
Checkpoints are published on the HuggingFace Hub under the brineylab org and
selectable on the command line via --preset.
| Preset / Hub id | Parameters | Layers | Hidden | Heads |
|---|---|---|---|---|
brineylab/oplm-small |
5.2M | 6 | 256 | 4 |
brineylab/oplm-medium |
85.6M | 12 | 768 | 12 |
brineylab/oplm-base |
309.5M | 24 | 1024 | 16 |
brineylab/oplm-large |
2.5B | 32 | 2560 | 32 |
brineylab/oplm-xlarge |
12.7B | 40 | 5120 | 40 |
All sizes share the 33-token tokenizer and a 1024-position context window.
Command line
oplm ships a small CLI; oplm --help lists every command.
Encode sequences to a file
oplm encode MKWVTFISLLLLFSSAYS MLPGLALLLLAAWTARA \
--model brineylab/oplm-base \
--output embeddings.pt
--model accepts a Hub id, a local HuggingFace export directory, or a training
checkpoint directory. The saved tensor holds the per-residue embeddings,
(num_sequences, T, hidden_size).
Inspect a model
oplm info --preset base
──────────────── OPLM Model Info ────────────────
Architecture
Parameters 309.5M (309,507,105)
Hidden size 1024
Layers 24
Attention heads 16
Head dim 64
Intermediate size 2816
FFN activation swiglu
...
Training
OPLM trains with HuggingFace Accelerate (FSDP, mixed precision, gradient checkpointing, optional Muon optimizer) over parquet sequence datasets, with a built-in eval harness for MLM metrics and structure-based contact prediction.
oplm train --preset base --config configs/my_run.yaml
See docs/TRAIN.md for the full training guide.
Configuration
Models and runs are configured through a layered system
(defaults → preset → YAML → CLI overrides) built on a HuggingFace
PretrainedConfig. Architecture toggles, optimizer/scheduler settings, and the
dataset schema are all set here.
See docs/CONFIG.md for the field-by-field reference.
Architecture
OPLM is a pre-norm, bidirectional encoder transformer with RoPE and QK-norm: LayerNorm by default (RMSNorm available), SwiGLU feed-forward, untied input/output embeddings, standard multi-head attention, depth-stable residual scaling, and a BERT-style MLM head. A curated set of independently togglable research features (Canon depthwise convolutions, partial-RoPE/NoPE, sandwich / hybrid / post-SDPA norm) layers on top, each off by default.
For the complete specification, see docs/MODEL_ARCHITECTURE.md.
Development
pip install -e ".[dev]"
pytest # run tests
pytest -m "not slow" # skip slow tests
pytest --cov=oplm # with coverage
ruff check src/ # lint
ruff format src/ # format
ty check src/ # type check (Astral's `ty`, not mypy)
Contributor and agent instructions live in AGENTS.md.
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oplm-0.1.1.tar.gz.
File metadata
- Download URL: oplm-0.1.1.tar.gz
- Upload date:
- Size: 16.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ae8a8e00e9021fe44d0759900ebcadcedb98003fa146053a3bed13a9f2e2de6
|
|
| MD5 |
4313a1c3d09bf48e6155e763832a8cd2
|
|
| BLAKE2b-256 |
d64f6d042554be9f640507be854926b70a427baad6759149bc26056e90655b7a
|
Provenance
The following attestation bundles were made for oplm-0.1.1.tar.gz:
Publisher:
python-publish.yaml on briney/oplm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
oplm-0.1.1.tar.gz -
Subject digest:
1ae8a8e00e9021fe44d0759900ebcadcedb98003fa146053a3bed13a9f2e2de6 - Sigstore transparency entry: 1763489015
- Sigstore integration time:
-
Permalink:
briney/oplm@d946356e2087f115a0888b79b14963da496e562d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/briney
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yaml@d946356e2087f115a0888b79b14963da496e562d -
Trigger Event:
release
-
Statement type:
File details
Details for the file oplm-0.1.1-py3-none-any.whl.
File metadata
- Download URL: oplm-0.1.1-py3-none-any.whl
- Upload date:
- Size: 120.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc14e2ab11880344c3b7e6f7761acd9af288e6ac37ef9b250e9d7ac7ce798244
|
|
| MD5 |
58326ba6b2b00b15fa9f5f5c70c88381
|
|
| BLAKE2b-256 |
7c3234949ed2e423f93b1baffc89d2e584a4335bd3c340ca52294fd957eb2d1a
|
Provenance
The following attestation bundles were made for oplm-0.1.1-py3-none-any.whl:
Publisher:
python-publish.yaml on briney/oplm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
oplm-0.1.1-py3-none-any.whl -
Subject digest:
dc14e2ab11880344c3b7e6f7761acd9af288e6ac37ef9b250e9d7ac7ce798244 - Sigstore transparency entry: 1763489162
- Sigstore integration time:
-
Permalink:
briney/oplm@d946356e2087f115a0888b79b14963da496e562d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/briney
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yaml@d946356e2087f115a0888b79b14963da496e562d -
Trigger Event:
release
-
Statement type: