Skip to main content

A protein language model that outputs amino acid sequence embeddings for use in clustering, classification, locality-sensitive hashing, and more.

Project description

ProtHash

A protein language model that outputs amino acid sequence embeddings for use in clustering, classification, locality-sensitive hashing, and more. Distilled from the ESMC family of models with deep comprehension of protein structure, ProtHash produces contextual embeddings that align in vector space according to the sequences' atomic structure. Trained to mimic its ESMC teacher model, ProtHash achieves near perfect similarity to ESMC embeddings but at a greatly reduced computational cost.

Key Features

  • Structurally-relevant embeddings: Structurally similar proteins will show up nearby in the embedding space enabling downstream tasks such as clustering, classification, and locality-sensitive hashing based on atomic structure.

  • Blazing fast and efficient: ProtHash uses only 3% of ESMC's parameters to achieve near perfect cosine similarity between the two embedding spaces.

  • Long context: With a context window of 2048 amino acid tokens you can embed proteins with long sequences.

Pretrained Models

Name Embedding Dimensionality Attention Heads (Q/KV) Encoder Layers Total Params
andrewdalpino/ProtHash-512-Tiny 512 16/4 4 13M
andrewdalpino/ProtHash-512 512 16/4 10 13M

Pretrained Example

First, you'll need the prothash and esm packages installed into your environment. We recommend using a virtual environment such as Python's venv module to prevent version conflicts with any system packages.

pip install prothash esm

Then, load the weights from HuggingFace Hub, tokenize a protein sequence, and pass it to the model. ProtHash adopts the ESM tokenizer as it's amino acids tokenization scheme. The output will be an embedding vector that can be used in downstream tasks such as comparing to other protein sequence embeddings, clustering, and near-duplicate detection.

import torch

from esm.tokenization import EsmSequenceTokenizer

from prothash.model import ProtHash

tokenizer = EsmSequenceTokenizer()

model_name = "andrewdalpino/ProtHash-512-Tiny"

model = ProtHash.from_pretrained(model_name)

sequence = input("Enter a sequence: ")

out = tokenizer(sequence, max_length=2048)

tokens = out["input_ids"]

x = torch.tensor(tokens, dtype=torch.int64).unsqueeze(0)

y_embed = model.embed(x)

print(y_embed)

References

  • The UniProt Consortium, UniProt: the Universal Protein Knowledgebase in 2025, Nucleic Acids Research, 2025, 53, D609–D617.
  • T. Hayes, et al. Simulating 500 million years of evolution with a language model, 2024.
  • B. Zhang, et al. Root Mean Square Layer Normalization. 33rd Conference on Neural Information Processing Systems, NeurIPS 2019.
  • J. Ainslie, et al. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Google Research, 2023.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prothash-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prothash-0.1.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file prothash-0.1.0.tar.gz.

File metadata

  • Download URL: prothash-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for prothash-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c92ebb85171c091b25fccbcb413817ad3598d41cc9bc38322f4e080bdbe5b9af
MD5 1234ce930ea5efdb6c2b8baa688ec058
BLAKE2b-256 584df597eb1481916c649d1cb06f392fb863baea7dc9d2190c4dbf1a4069cd9b

See more details on using hashes here.

File details

Details for the file prothash-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: prothash-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for prothash-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a7d282f1b9ff422e8902f43b7d6f5bdcafefd5b2e5cebdd7af8ae204ef7ac439
MD5 5f0623b58eb8bb97a94585e9f3b193ed
BLAKE2b-256 e85856f9f07449ff04849b3f7f4910a47677f1edfcc6aef149101d55e1a48268

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page