Skip to main content

open protein language model

Project description

OPLM

Open Protein Language Model — an encoder-only protein language model with a HuggingFace-native, ESM-C-style API. Load a pretrained checkpoint, hand it a list of sequences, and get back per-residue logits and embeddings.

Status: Pre-alpha (v0.0.1). The model, tokenizer, inference API, and HuggingFace integration are stable; pretrained checkpoints and benchmark results are still landing.


Highlights

  • ESM-C-style inference APImodel.logits(sequences, LogitsConfig(...)) returns a structured LogitsOutput with sequence_logits, embeddings, hidden_states, and attentions. If you've used ESM-C, you already know it.
  • HuggingFace-native — every model is a PreTrainedModel. Use OplmForMaskedLM.from_pretrained(...) or the transformers Auto* classes, load from the Hub, and save_pretrained / push_to_hub like any other model.
  • ESM-C-compatible tokenizer — a 33-token OplmTokenizerFast with the same vocabulary and special tokens as ESM-C.
  • Fast attention — built on PyTorch's scaled_dot_product_attention (a fused FlashAttention / memory-efficient kernel on CUDA, no separate flash-attn dependency), with a manual softmax path for returning attention weights.
  • Five sizes — from a 5M-parameter ablation model up to 13B parameters.

Installation

Requirements: Python ≥ 3.11 and PyTorch ≥ 2.10.

pip install oplm

Or from source:

git clone https://github.com/briney/oplm.git
cd oplm
pip install -e .

Inference needs no extras. The optional groups are for contributors:

pip install "oplm[train]"   # distributed training (Accelerate, W&B, datasets)
pip install "oplm[dev]"     # tests, linting, type checking

Quick start

Per-residue logits and embeddings

The primary entry point is OplmForMaskedLM. from_pretrained loads the weights and attaches the matching tokenizer, so .logits() takes raw sequences directly:

import torch
from oplm import OplmForMaskedLM, LogitsConfig

model = OplmForMaskedLM.from_pretrained("brineylab/oplm-170M").eval()

sequences = [
    "MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRIL",
    "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVK",
]

with torch.no_grad():
    out = model.logits(
        sequences,
        LogitsConfig(sequence=True, return_embeddings=True),
    )

out.sequence_logits   # (B, T, 33) per-residue amino-acid logits
out.embeddings        # (B, T, hidden_size) per-residue embeddings

B is the batch size and T is the padded sequence length (each sequence is wrapped with <cls> … <eos>).

Per-protein embeddings

For a single fixed-size vector per sequence, run the backbone (OplmModel) and mask-aware mean-pool over the residue dimension:

import torch
from oplm import OplmModel
from oplm.model import mean_pool

model = OplmModel.from_pretrained("brineylab/oplm-170M").eval()

batch = model.tokenize(sequences)            # BatchEncoding on the model's device
with torch.no_grad():
    out = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
    )

per_residue = out.last_hidden_state          # (B, T, hidden_size)
per_protein = mean_pool(per_residue, batch["attention_mask"])  # (B, hidden_size)

oplm.model also exports cls_pool if you prefer the <cls> representation.

Using the transformers Auto* API

Importing oplm registers the config, models, and tokenizer with transformers, so the standard Auto* classes work too:

import oplm  # registers OPLM with transformers' Auto* classes
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("brineylab/oplm-170M")
model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-170M").eval()

batch = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**batch).logits   # (B, T, 33)

AutoModel returns the bare encoder, and AutoModelForSequenceClassification / AutoModelForTokenClassification return the corresponding fine-tuning heads.

Each model repo also bundles its modeling code, so consumers who don't have oplm installed can load it with trust_remote_code=True:

model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-170M", trust_remote_code=True)

Coming from ESM-C?

OPLM mirrors the ESM-C inference surface so existing pipelines port with minimal changes:

  • LogitsConfig(sequence=..., return_embeddings=...) is the same knob object.
  • model.logits(...) returns a LogitsOutput; read output.embeddings exactly as you would with ESM-C.
  • The tokenizer vocabulary (33 tokens, <cls>/<pad>/<eos>/<mask> specials) matches ESM-C.

The main difference: OPLM's .logits() and .tokenize() accept a list[str] of sequences directly, rather than pre-encoded protein tensors. Per-residue logits live in output.sequence_logits.


Model sizes

OPLM ships eight size presets, selected on the command line with --preset <name>. A preset is an architecture recipe (it sets the core dimensions only); it does not download weights. To load pretrained weights, pass a Hub id or a local checkpoint directory to --model (see below).

Preset Parameters Layers Hidden Heads
50M ~50M 16 512 8
170M ~170M 24 768 12
400M ~400M 32 1024 16
800M ~800M 40 1280 16
1B ~1.6B 50 1600 25
3B ~3.3B 64 2048 32
6B ~6B 80 2560 40
12B ~12.5B 100 3200 50

All sizes share the 33-token tokenizer and a 1024-position context window.


Command line

oplm ships a small CLI; oplm --help lists every command.

Encode sequences to a file

oplm encode MKWVTFISLLLLFSSAYS MLPGLALLLLAAWTARA \
  --model brineylab/oplm-170M \
  --output embeddings.pt

--model accepts a Hub id, a local HuggingFace export directory, or a training checkpoint directory. The saved tensor holds the per-residue embeddings, (num_sequences, T, hidden_size).

Inspect a model

oplm info --preset 400M
──────────────── OPLM Model Info ────────────────
                Architecture
 Parameters        412.3M (412,302,369)
 Hidden size       1024
 Layers            32
 Attention heads   16
 Head dim          64
 Intermediate size 2816
 FFN activation    swiglu
 ...

Training

OPLM trains with HuggingFace Accelerate (FSDP, mixed precision, gradient checkpointing, optional Muon optimizer) over parquet sequence datasets, with a built-in eval harness for MLM metrics and structure-based contact prediction.

oplm train --preset 400M --config configs/my_run.yaml

See docs/TRAIN.md for the full training guide.


Configuration

Models and runs are configured through a layered system (defaults → preset → YAML → CLI overrides) built on a HuggingFace PretrainedConfig. Architecture toggles, optimizer/scheduler settings, and the dataset schema are all set here.

See docs/CONFIG.md for the field-by-field reference.


Architecture

OPLM is a pre-norm, bidirectional encoder transformer with RoPE and QK-norm: LayerNorm by default (RMSNorm available), SwiGLU feed-forward, untied input/output embeddings, standard multi-head attention, depth-stable residual scaling, and a BERT-style MLM head. A curated set of independently togglable research features (Canon depthwise convolutions, partial-RoPE/NoPE, sandwich / hybrid / post-SDPA norm) layers on top, each off by default.

For the complete specification, see docs/MODEL_ARCHITECTURE.md.


Development

pip install -e ".[dev]"

pytest                  # run tests
pytest -m "not slow"    # skip slow tests
pytest --cov=oplm       # with coverage

ruff check src/         # lint
ruff format src/        # format
ty check src/           # type check (Astral's `ty`, not mypy)

Contributor and agent instructions live in AGENTS.md.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oplm-0.1.3.tar.gz (16.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oplm-0.1.3-py3-none-any.whl (130.1 kB view details)

Uploaded Python 3

File details

Details for the file oplm-0.1.3.tar.gz.

File metadata

  • Download URL: oplm-0.1.3.tar.gz
  • Upload date:
  • Size: 16.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for oplm-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d68da8f2832ba921a5af544181334cd19bd1aa7bff08d64ad3f7f600d622eb93
MD5 24210133d67c6a5768412b82320b07b8
BLAKE2b-256 9dbe8c6dbebdd13a513118221e8a074339bfee36a3018862f334cb29a5dd4cb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for oplm-0.1.3.tar.gz:

Publisher: python-publish.yaml on briney/oplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file oplm-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: oplm-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 130.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for oplm-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ec51d45355793390d067a991e7c9a3a9ef7127c02532fa47bd2e9133ea3b0da8
MD5 d26a6dd26d0cd82a1cf9ed1a90b656d1
BLAKE2b-256 07ccae146d1252f286a52b8d7b0bacd74a864f09281e07f39ad0812937baf691

See more details on using hashes here.

Provenance

The following attestation bundles were made for oplm-0.1.3-py3-none-any.whl:

Publisher: python-publish.yaml on briney/oplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page