Skip to main content

Protein sequence annotation with language models

Project description

PSALM

This package contains code and pre-trained weights for Protein Sequence Annotation with Language Models (PSALM) from our 2024 preprint.

NOTE: PSALM-1 has not been trained on all Pfam families, as it has been trained and benchmarked on highly-curated datasets with strict sequence similarity guarantees between train and test data. PSALM-1b (trained on all families in Pfam 35.0) is coming soon

Abstract

Protein function inference relies on annotating protein domains via sequence similarity, often modeled through profile Hidden Markov Models (profile HMMs), which capture evolutionary diversity within related domains. However, profile HMMs make strong simplifying independence assumptions when modeling residues in a sequence. Here, we introduce PSALM (Protein Sequence Annotation with Language Models), a hierarchical approach that relaxes these assumptions and uses representations of protein sequences learned by protein language models to enable high-sensitivity, high-specificity residue-level protein sequence annotation. We validate PSALM's performance on a curated set of "ground truth" annotations determined by a profile HMM-based method and highlight PSALM as a promising alternative for protein sequence annotation.

Usage

PSALM requires Python>=3.10 and PyTorch>=2.2.0. Start a fresh conda environment to use PSALM:

conda create -n "psalm" python=3.10
conda activate psalm
pip install torch protein-sequence-annotation notebook ipykernel
python -m ipykernel install --user

OR just install PSALM alone by using the protein-sequence-annotation PyPI package.

pip install protein-sequence-annotation

After the pip install, you can load and use a pretrained model as follows:

import torch
from psalm import psalm

# Load PSALM clan and fam models
PSALM = psalm(clan_model_name="ProteinSequenceAnnotation/PSALM-1-clan",
             fam_model_name="ProteinSequenceAnnotation/PSALM-1-family",
             device = 'cpu') #cpu by default, replace with 'cuda' or 'mps' as needed

# Prepare data (use PSALM.read_fasta(fasta_file_path) to get data directly from a FASTA file)
data = [
    ("Human Beta Globin", "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH"),
    ("Flavohemoprotein", "MLDAQTIATVKATIPLLVETGPKLTAHFYDRMFTHNPELKEIFNMSNQRNGDQREALFNAIAAYASNIENLPALLPAVEKIAQKHTSFQIKPEQYNIVGEHLLATLDEMFSPGQEVLDAWGKAYGVLANVFINREAEIYNENASKAGGWEGTRDFRIVAKTPRSALITSFELEPVDGGAVAEYRPGQYLGVWLKPEGFPHQEIRQYSLTRKPDGKGYRIAVKREEGGQVSNWLHNHANVGDVVKLVAPAGDFFMAVADDTPVTLISAGVGQTPMLAMLDTLAKAGHTAQVNWFHAAENGDVHAFADEVKELGQSLPRFTAHTWYRQPSEADRAKGQFDSEGLMDLSKLEGAFSDPTMQFYLCGPVGFMQFTAKQLVDLGVKQENIHYECFGPHKVL")
]

# Visualize PSALM annotations (add optional save_path argument: PSALM.annotate(data,save_path="save_folder")
PSALM.annotate(data)

Cite

If you find PSALM useful in your research, please cite the following paper:

@article {sarkarkrishnan2024psalm,
	author = {Sarkar, Arpan and Krishnan, Kumaresh and Eddy, Sean R},
	title = {Protein Sequence Domain Annotation using Language Models},
	year = {2024},
	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596712},
	journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protein-sequence-annotation-1.0.7.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file protein-sequence-annotation-1.0.7.tar.gz.

File metadata

File hashes

Hashes for protein-sequence-annotation-1.0.7.tar.gz
Algorithm Hash digest
SHA256 db0a29d17612501c59f3d9cf171665ebfd0347cbbc0b1247d67510e7e0158131
MD5 1412ee40a88f28611278df07fdb7ee6c
BLAKE2b-256 1cf4009e5168399edb1b881064132998158338f670e3047a68a975b5e43ab343

See more details on using hashes here.

File details

Details for the file protein_sequence_annotation-1.0.7-py3-none-any.whl.

File metadata

File hashes

Hashes for protein_sequence_annotation-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 c7a494f17ad77217ec2026ede2a7429317ae13894a928f6a11a4c97bd5f97ce8
MD5 76851e8aa69d689d292034be57bfce49
BLAKE2b-256 f1cd5341629eb1cce2bb928298af649ab563ec9e820527022c8ca966458d609e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page