Skip to main content

Protein sequence annotation with language models

Project description

PSALM

This package contains code and pre-trained weights for Protein Sequence Annotation with Language Models (PSALM) from our 2024 preprint.

Abstract

Protein function inference relies on annotating protein domains via sequence similarity, often modeled through profile Hidden Markov Models (profile HMMs), which capture evolutionary diversity within related domains. However, profile HMMs make strong simplifying independence assumptions when modeling residues in a sequence. Here, we introduce PSALM (Protein Sequence Annotation with Language Models), a hierarchical approach that relaxes these assumptions and uses representations of protein sequences learned by protein language models to enable high-sensitivity, high-specificity residue-level protein sequence annotation. We validate PSALM's performance on a curated set of "ground truth" annotations determined by a profile HMM-based method and highlight PSALM as a promising alternative for protein sequence annotation.

Usage

PSALM requires Python>=3.10 and PyTorch>=2.2.0. Start a fresh conda environment to use PSALM:

conda create -n "psalm" python=3.10
conda activate psalm
pip install torch protein-sequence-annotation notebook ipykernel
python -m ipykernel install --user

OR just install PSALM alone by using the protein-sequence-annotation PyPI package.

pip install protein-sequence-annotation

After the pip install, you can load and use a pretrained model as follows:

import torch
from psalm import psalm

# Load PSALM clan and fam models
PSALM = psalm(clan_model_name="ProteinSequenceAnnotation/PSALM-1-clan",
             fam_model_name="ProteinSequenceAnnotation/PSALM-1-family",
             device = 'cpu') #cpu by default, replace with 'cuda' or 'mps' as needed

# Prepare data (use PSALM.read_fasta(fasta_file_path) to get data directly from a FASTA file)
data = [
    ("Human Beta Globin", "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH"),
    ("Flavohemoprotein", "MLDAQTIATVKATIPLLVETGPKLTAHFYDRMFTHNPELKEIFNMSNQRNGDQREALFNAIAAYASNIENLPALLPAVEKIAQKHTSFQIKPEQYNIVGEHLLATLDEMFSPGQEVLDAWGKAYGVLANVFINREAEIYNENASKAGGWEGTRDFRIVAKTPRSALITSFELEPVDGGAVAEYRPGQYLGVWLKPEGFPHQEIRQYSLTRKPDGKGYRIAVKREEGGQVSNWLHNHANVGDVVKLVAPAGDFFMAVADDTPVTLISAGVGQTPMLAMLDTLAKAGHTAQVNWFHAAENGDVHAFADEVKELGQSLPRFTAHTWYRQPSEADRAKGQFDSEGLMDLSKLEGAFSDPTMQFYLCGPVGFMQFTAKQLVDLGVKQENIHYECFGPHKVL")
]

# Visualize PSALM annotations (add optional save_path argument: PSALM.annotate(data,save_path="save_folder")
PSALM.annotate(data)

Cite

If you find PSALM useful in your research, please cite the following paper:

@article {sarkarkrishnan2024psalm,
	author = {Sarkar, Arpan and Krishnan, Kumaresh and Eddy, Sean R},
	title = {Protein Sequence Domain Annotation using Language Models},
	year = {2024},
	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596712},
	journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protein-sequence-annotation-1.0.5.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file protein-sequence-annotation-1.0.5.tar.gz.

File metadata

File hashes

Hashes for protein-sequence-annotation-1.0.5.tar.gz
Algorithm Hash digest
SHA256 78f9d3656be50d97be30eb9db6d0e688b951c469a7a3a9816ca639ce7b0b44a0
MD5 9d3cc6aeb97fba8910c465beb1db06c7
BLAKE2b-256 73936f9b97bf42d93b5ff5d8b1ef39c1c723396068a70f888cbad4dc82e86556

See more details on using hashes here.

File details

Details for the file protein_sequence_annotation-1.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for protein_sequence_annotation-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c221d6fae21d8d160ca4743f9ce749ebca1e2f5ce5e2123cf9ed7801721e963b
MD5 e7c9eaacffb71a79db743ae75a3d64d8
BLAKE2b-256 c38c368a8fa8a4eaea6327b60e3d8f4223a0520449dcc9f4bad87f7a8d4b3798

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page