Skip to main content

open protein language model

Project description

OPLM

Open Protein Language Model — an encoder-only protein language model with a HuggingFace-native, ESM-C-style API. Load a pretrained checkpoint, hand it a list of sequences, and get back per-residue logits and embeddings.

Status: Pre-alpha (v0.0.1). The model, tokenizer, inference API, and HuggingFace integration are stable; pretrained checkpoints and benchmark results are still landing.


Highlights

  • ESM-C-style inference APImodel.logits(sequences, LogitsConfig(...)) returns a structured LogitsOutput with sequence_logits, embeddings, hidden_states, and attentions. If you've used ESM-C, you already know it.
  • HuggingFace-native — every model is a PreTrainedModel. Use OplmForMaskedLM.from_pretrained(...) or the transformers Auto* classes, load from the Hub, and save_pretrained / push_to_hub like any other model.
  • ESM-C-compatible tokenizer — a 33-token OplmTokenizerFast with the same vocabulary and special tokens as ESM-C.
  • Fast attention — built on PyTorch's scaled_dot_product_attention (a fused FlashAttention / memory-efficient kernel on CUDA, no separate flash-attn dependency), with a manual softmax path for returning attention weights.
  • Five sizes — from a 5M-parameter ablation model up to 13B parameters.

Installation

Requirements: Python ≥ 3.11 and PyTorch ≥ 2.10.

pip install oplm

Or from source:

git clone https://github.com/briney/oplm.git
cd oplm
pip install -e .

Inference needs no extras. The optional groups are for contributors:

pip install "oplm[train]"   # distributed training (Accelerate, W&B, datasets)
pip install "oplm[dev]"     # tests, linting, type checking

Quick start

Per-residue logits and embeddings

The primary entry point is OplmForMaskedLM. from_pretrained loads the weights and attaches the matching tokenizer, so .logits() takes raw sequences directly:

import torch
from oplm import OplmForMaskedLM, LogitsConfig

model = OplmForMaskedLM.from_pretrained("brineylab/oplm-base").eval()

sequences = [
    "MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRIL",
    "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVK",
]

with torch.no_grad():
    out = model.logits(
        sequences,
        LogitsConfig(sequence=True, return_embeddings=True),
    )

out.sequence_logits   # (B, T, 33) per-residue amino-acid logits
out.embeddings        # (B, T, hidden_size) per-residue embeddings

B is the batch size and T is the padded sequence length (each sequence is wrapped with <cls> … <eos>).

Per-protein embeddings

For a single fixed-size vector per sequence, run the backbone (OplmModel) and mask-aware mean-pool over the residue dimension:

import torch
from oplm import OplmModel
from oplm.model import mean_pool

model = OplmModel.from_pretrained("brineylab/oplm-base").eval()

batch = model.tokenize(sequences)            # BatchEncoding on the model's device
with torch.no_grad():
    out = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
    )

per_residue = out.last_hidden_state          # (B, T, hidden_size)
per_protein = mean_pool(per_residue, batch["attention_mask"])  # (B, hidden_size)

oplm.model also exports cls_pool if you prefer the <cls> representation.

Using the transformers Auto* API

Importing oplm registers the config, models, and tokenizer with transformers, so the standard Auto* classes work too:

import oplm  # registers OPLM with transformers' Auto* classes
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("brineylab/oplm-base")
model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-base").eval()

batch = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**batch).logits   # (B, T, 33)

AutoModel returns the bare encoder, and AutoModelForSequenceClassification / AutoModelForTokenClassification return the corresponding fine-tuning heads.

Each model repo also bundles its modeling code, so consumers who don't have oplm installed can load it with trust_remote_code=True:

model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-base", trust_remote_code=True)

Coming from ESM-C?

OPLM mirrors the ESM-C inference surface so existing pipelines port with minimal changes:

  • LogitsConfig(sequence=..., return_embeddings=...) is the same knob object.
  • model.logits(...) returns a LogitsOutput; read output.embeddings exactly as you would with ESM-C.
  • The tokenizer vocabulary (33 tokens, <cls>/<pad>/<eos>/<mask> specials) matches ESM-C.

The main difference: OPLM's .logits() and .tokenize() accept a list[str] of sequences directly, rather than pre-encoded protein tensors. Per-residue logits live in output.sequence_logits.


Pretrained models

Checkpoints are published on the HuggingFace Hub under the brineylab org and selectable on the command line via --preset.

Preset / Hub id Parameters Layers Hidden Heads
brineylab/oplm-small 5.2M 6 256 4
brineylab/oplm-medium 85.6M 12 768 12
brineylab/oplm-base 309.5M 24 1024 16
brineylab/oplm-large 2.5B 32 2560 32
brineylab/oplm-xlarge 12.7B 40 5120 40

All sizes share the 33-token tokenizer and a 1024-position context window.


Command line

oplm ships a small CLI; oplm --help lists every command.

Encode sequences to a file

oplm encode MKWVTFISLLLLFSSAYS MLPGLALLLLAAWTARA \
  --model brineylab/oplm-base \
  --output embeddings.pt

--model accepts a Hub id, a local HuggingFace export directory, or a training checkpoint directory. The saved tensor holds the per-residue embeddings, (num_sequences, T, hidden_size).

Inspect a model

oplm info --preset base
──────────────── OPLM Model Info ────────────────
                Architecture
 Parameters        309.5M (309,507,105)
 Hidden size       1024
 Layers            24
 Attention heads   16
 Head dim          64
 Intermediate size 2816
 FFN activation    swiglu
 ...

Training

OPLM trains with HuggingFace Accelerate (FSDP, mixed precision, gradient checkpointing, optional Muon optimizer) over parquet sequence datasets, with a built-in eval harness for MLM metrics and structure-based contact prediction.

oplm train --preset base --config configs/my_run.yaml

See docs/TRAIN.md for the full training guide.


Configuration

Models and runs are configured through a layered system (defaults → preset → YAML → CLI overrides) built on a HuggingFace PretrainedConfig. Architecture toggles, optimizer/scheduler settings, and the dataset schema are all set here.

See docs/CONFIG.md for the field-by-field reference.


Architecture

OPLM is a pre-norm, bidirectional encoder transformer with RoPE and QK-norm: LayerNorm by default (RMSNorm available), SwiGLU feed-forward, untied input/output embeddings, standard multi-head attention, depth-stable residual scaling, and a BERT-style MLM head. A curated set of independently togglable research features (Canon depthwise convolutions, partial-RoPE/NoPE, sandwich / hybrid / post-SDPA norm) layers on top, each off by default.

For the complete specification, see docs/MODEL_ARCHITECTURE.md.


Development

pip install -e ".[dev]"

pytest                  # run tests
pytest -m "not slow"    # skip slow tests
pytest --cov=oplm       # with coverage

ruff check src/         # lint
ruff format src/        # format
ty check src/           # type check (Astral's `ty`, not mypy)

Contributor and agent instructions live in AGENTS.md.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oplm-0.1.0.tar.gz (16.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oplm-0.1.0-py3-none-any.whl (120.0 kB view details)

Uploaded Python 3

File details

Details for the file oplm-0.1.0.tar.gz.

File metadata

  • Download URL: oplm-0.1.0.tar.gz
  • Upload date:
  • Size: 16.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for oplm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d0162b52d57000d989c6d984a6f41d89e68deadb001ff0755654a8d408e7d7f9
MD5 8c6528aaa64354695bb9c7302e73a138
BLAKE2b-256 3ffcef3bc80ca8a3411c3e5c8218a39f0ab7e4278371e6cbcf2338b1c1658930

See more details on using hashes here.

Provenance

The following attestation bundles were made for oplm-0.1.0.tar.gz:

Publisher: python-publish.yaml on briney/oplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file oplm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: oplm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 120.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for oplm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c84a1b23bdf76bcf284f7a3fcbfb44cf95865c315ab5c0fe6a5cfbe43b606b0
MD5 c21d45491adc87207ea9d39ba2be19e9
BLAKE2b-256 59383aea791cec460296537aba7a4536d974b060d7cd639bfa0de97d2b965c31

See more details on using hashes here.

Provenance

The following attestation bundles were made for oplm-0.1.0-py3-none-any.whl:

Publisher: python-publish.yaml on briney/oplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page