open protein language model

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

OPLM

Open Protein Language Model — an encoder-only protein language model with a HuggingFace-native, ESM-C-style API. Load a pretrained checkpoint, hand it a list of sequences, and get back per-residue logits and embeddings.

Status: Pre-alpha (v0.0.1). The model, tokenizer, inference API, and HuggingFace integration are stable; pretrained checkpoints and benchmark results are still landing.

Highlights

ESM-C-style inference API — model.logits(sequences, LogitsConfig(...)) returns a structured LogitsOutput with sequence_logits, embeddings, hidden_states, and attentions. If you've used ESM-C, you already know it.
HuggingFace-native — every model is a PreTrainedModel. Use OplmForMaskedLM.from_pretrained(...) or the transformers Auto* classes, load from the Hub, and save_pretrained / push_to_hub like any other model.
ESM-C-compatible tokenizer — a 33-token OplmTokenizerFast with the same vocabulary and special tokens as ESM-C.
Fast attention — built on PyTorch's scaled_dot_product_attention (a fused FlashAttention / memory-efficient kernel on CUDA, no separate flash-attn dependency), with a manual softmax path for returning attention weights.
Five sizes — from a 5M-parameter ablation model up to 13B parameters.

Installation

Requirements: Python ≥ 3.11 and PyTorch ≥ 2.10.

pip install oplm

Or from source:

git clone https://github.com/briney/oplm.git
cd oplm
pip install -e .

Inference needs no extras. The optional groups are for contributors:

pip install "oplm[train]"   # distributed training (Accelerate, W&B, datasets)
pip install "oplm[dev]"     # tests, linting, type checking

Quick start

Per-residue logits and embeddings

The primary entry point is OplmForMaskedLM. from_pretrained loads the weights and attaches the matching tokenizer, so .logits() takes raw sequences directly:

import torch
from oplm import OplmForMaskedLM, LogitsConfig

model = OplmForMaskedLM.from_pretrained("brineylab/oplm-base").eval()

sequences = [
    "MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDIDTHTAKYDPSLKPLSVSYDQATSLRIL",
    "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVK",
]

with torch.no_grad():
    out = model.logits(
        sequences,
        LogitsConfig(sequence=True, return_embeddings=True),
    )

out.sequence_logits   # (B, T, 33) per-residue amino-acid logits
out.embeddings        # (B, T, hidden_size) per-residue embeddings

B is the batch size and T is the padded sequence length (each sequence is wrapped with <cls> … <eos>).

Per-protein embeddings

For a single fixed-size vector per sequence, run the backbone (OplmModel) and mask-aware mean-pool over the residue dimension:

import torch
from oplm import OplmModel
from oplm.model import mean_pool

model = OplmModel.from_pretrained("brineylab/oplm-base").eval()

batch = model.tokenize(sequences)            # BatchEncoding on the model's device
with torch.no_grad():
    out = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
    )

per_residue = out.last_hidden_state          # (B, T, hidden_size)
per_protein = mean_pool(per_residue, batch["attention_mask"])  # (B, hidden_size)

oplm.model also exports cls_pool if you prefer the <cls> representation.

Using the `transformers` Auto* API

Importing oplm registers the config, models, and tokenizer with transformers, so the standard Auto* classes work too:

import oplm  # registers OPLM with transformers' Auto* classes
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("brineylab/oplm-base")
model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-base").eval()

batch = tokenizer(sequences, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**batch).logits   # (B, T, 33)

AutoModel returns the bare encoder, and AutoModelForSequenceClassification / AutoModelForTokenClassification return the corresponding fine-tuning heads.

Each model repo also bundles its modeling code, so consumers who don't have oplm installed can load it with trust_remote_code=True:

model = AutoModelForMaskedLM.from_pretrained("brineylab/oplm-base", trust_remote_code=True)

Coming from ESM-C?

OPLM mirrors the ESM-C inference surface so existing pipelines port with minimal changes:

LogitsConfig(sequence=..., return_embeddings=...) is the same knob object.
model.logits(...) returns a LogitsOutput; read output.embeddings exactly as you would with ESM-C.
The tokenizer vocabulary (33 tokens, <cls>/<pad>/<eos>/<mask> specials) matches ESM-C.

The main difference: OPLM's .logits() and .tokenize() accept a list[str] of sequences directly, rather than pre-encoded protein tensors. Per-residue logits live in output.sequence_logits.

Pretrained models

Checkpoints are published on the HuggingFace Hub under the brineylab org and selectable on the command line via --preset.

Preset / Hub id	Parameters	Layers	Hidden	Heads
`brineylab/oplm-small`	5.2M	6	256	4
`brineylab/oplm-medium`	85.6M	12	768	12
`brineylab/oplm-base`	309.5M	24	1024	16
`brineylab/oplm-large`	2.5B	32	2560	32
`brineylab/oplm-xlarge`	12.7B	40	5120	40

All sizes share the 33-token tokenizer and a 1024-position context window.

Command line

oplm ships a small CLI; oplm --help lists every command.

Encode sequences to a file

oplm encode MKWVTFISLLLLFSSAYS MLPGLALLLLAAWTARA \
  --model brineylab/oplm-base \
  --output embeddings.pt

--model accepts a Hub id, a local HuggingFace export directory, or a training checkpoint directory. The saved tensor holds the per-residue embeddings, (num_sequences, T, hidden_size).

Inspect a model

oplm info --preset base

──────────────── OPLM Model Info ────────────────
                Architecture
 Parameters        309.5M (309,507,105)
 Hidden size       1024
 Layers            24
 Attention heads   16
 Head dim          64
 Intermediate size 2816
 FFN activation    swiglu
 ...

Training

OPLM trains with HuggingFace Accelerate (FSDP, mixed precision, gradient checkpointing, optional Muon optimizer) over parquet sequence datasets, with a built-in eval harness for MLM metrics and structure-based contact prediction.

oplm train --preset base --config configs/my_run.yaml

See docs/TRAIN.md for the full training guide.

Configuration

Models and runs are configured through a layered system (defaults → preset → YAML → CLI overrides) built on a HuggingFace PretrainedConfig. Architecture toggles, optimizer/scheduler settings, and the dataset schema are all set here.

See docs/CONFIG.md for the field-by-field reference.

Architecture

OPLM is a pre-norm, bidirectional encoder transformer with RoPE and QK-norm: LayerNorm by default (RMSNorm available), SwiGLU feed-forward, untied input/output embeddings, standard multi-head attention, depth-stable residual scaling, and a BERT-style MLM head. A curated set of independently togglable research features (Canon depthwise convolutions, partial-RoPE/NoPE, sandwich / hybrid / post-SDPA norm) layers on top, each off by default.

For the complete specification, see docs/MODEL_ARCHITECTURE.md.

Development

pip install -e ".[dev]"

pytest                  # run tests
pytest -m "not slow"    # skip slow tests
pytest --cov=oplm       # with coverage

ruff check src/         # lint
ruff format src/        # format
ty check src/           # type check (Astral's `ty`, not mypy)

Contributor and agent instructions live in AGENTS.md.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bryanbriney

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.6

Jun 25, 2026

0.1.5

Jun 16, 2026

0.1.4

Jun 13, 2026

0.1.3

Jun 11, 2026

0.1.2

Jun 10, 2026

This version

0.1.1

Jun 9, 2026

0.1.0

Jun 8, 2026

0.0.1

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oplm-0.1.1.tar.gz (16.7 MB view details)

Uploaded Jun 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oplm-0.1.1-py3-none-any.whl (120.8 kB view details)

Uploaded Jun 9, 2026 Python 3

File details

Details for the file oplm-0.1.1.tar.gz.

File metadata

Download URL: oplm-0.1.1.tar.gz
Upload date: Jun 9, 2026
Size: 16.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for oplm-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`1ae8a8e00e9021fe44d0759900ebcadcedb98003fa146053a3bed13a9f2e2de6`
MD5	`4313a1c3d09bf48e6155e763832a8cd2`
BLAKE2b-256	`d64f6d042554be9f640507be854926b70a427baad6759149bc26056e90655b7a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for oplm-0.1.1.tar.gz:

Publisher: python-publish.yaml on briney/oplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: oplm-0.1.1.tar.gz
- Subject digest: 1ae8a8e00e9021fe44d0759900ebcadcedb98003fa146053a3bed13a9f2e2de6
- Sigstore transparency entry: 1763489015
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: briney/oplm@d946356e2087f115a0888b79b14963da496e562d
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/briney
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yaml@d946356e2087f115a0888b79b14963da496e562d
- Trigger Event: release

File details

Details for the file oplm-0.1.1-py3-none-any.whl.

File metadata

Download URL: oplm-0.1.1-py3-none-any.whl
Upload date: Jun 9, 2026
Size: 120.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for oplm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc14e2ab11880344c3b7e6f7761acd9af288e6ac37ef9b250e9d7ac7ce798244`
MD5	`58326ba6b2b00b15fa9f5f5c70c88381`
BLAKE2b-256	`7c3234949ed2e423f93b1baffc89d2e584a4335bd3c340ca52294fd957eb2d1a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for oplm-0.1.1-py3-none-any.whl:

Publisher: python-publish.yaml on briney/oplm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: oplm-0.1.1-py3-none-any.whl
- Subject digest: dc14e2ab11880344c3b7e6f7761acd9af288e6ac37ef9b250e9d7ac7ce798244
- Sigstore transparency entry: 1763489162
- Sigstore integration time: Jun 9, 2026
Source repository:
- Permalink: briney/oplm@d946356e2087f115a0888b79b14963da496e562d
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/briney
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yaml@d946356e2087f115a0888b79b14963da496e562d
- Trigger Event: release

oplm 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

OPLM

Highlights

Installation

Quick start

Per-residue logits and embeddings

Per-protein embeddings

Using the transformers Auto* API

Coming from ESM-C?

Pretrained models

Command line

Encode sequences to a file

Inspect a model

Training

Configuration

Architecture

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Using the `transformers` Auto* API