Skip to main content

Protein sequence annotation with language models

Project description

PSALM

This package contains code and pre-trained weights for Protein Sequence Annotation with Language Models (PSALM) from our 2024 preprint.

Abstract

Protein function inference relies on annotating protein domains via sequence similarity, often modeled through profile Hidden Markov Models (profile HMMs), which capture evolutionary diversity within related domains. However, profile HMMs make strong simplifying independence assumptions when modeling residues in a sequence. Here, we introduce PSALM (Protein Sequence Annotation with Language Models), a hierarchical approach that relaxes these assumptions and uses representations of protein sequences learned by protein language models to enable high-sensitivity, high-specificity residue-level protein sequence annotation. We validate PSALM's performance on a curated set of "ground truth" annotations determined by a profile HMM-based method and highlight PSALM as a promising alternative for protein sequence annotation.

Usage

PSALM requires Python>=3.10 and PyTorch>=2.2.0. Start a fresh conda environment to use PSALM:

conda create -n "psalm" python=3.10
conda activate psalm
pip install torch protein-sequence-annotation notebook ipykernel
python -m ipykernel install --user

OR just install PSALM alone by using the protein-sequence-annotation PyPI package.

pip install protein-sequence-annotation

After the pip install, you can load and use a pretrained model as follows:

import torch
from psalm import psalm

# Load PSALM clan and fam models
PSALM = psalm(clan_model_name="ProteinSequenceAnnotation/PSALM-1-clan",
             fam_model_name="ProteinSequenceAnnotation/PSALM-1-family",
             device = 'cpu') #cpu by default, replace with 'cuda' or 'mps' as needed

# Prepare data (use PSALM.read_fasta(fasta_file_path) to get data directly from a FASTA file)
data = [
    ("Human Beta Globin", "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH"),
    ("Flavohemoprotein", "MLDAQTIATVKATIPLLVETGPKLTAHFYDRMFTHNPELKEIFNMSNQRNGDQREALFNAIAAYASNIENLPALLPAVEKIAQKHTSFQIKPEQYNIVGEHLLATLDEMFSPGQEVLDAWGKAYGVLANVFINREAEIYNENASKAGGWEGTRDFRIVAKTPRSALITSFELEPVDGGAVAEYRPGQYLGVWLKPEGFPHQEIRQYSLTRKPDGKGYRIAVKREEGGQVSNWLHNHANVGDVVKLVAPAGDFFMAVADDTPVTLISAGVGQTPMLAMLDTLAKAGHTAQVNWFHAAENGDVHAFADEVKELGQSLPRFTAHTWYRQPSEADRAKGQFDSEGLMDLSKLEGAFSDPTMQFYLCGPVGFMQFTAKQLVDLGVKQENIHYECFGPHKVL")
]

# Visualize PSALM annotations (add optional save_path argument: PSALM.annotate(data,save_path="save_folder")
PSALM.annotate(data)

Cite

If you find PSALM useful in your research, please cite the following paper:

@article {sarkarkrishnan2024psalm,
	author = {Sarkar, Arpan and Krishnan, Kumaresh and Eddy, Sean R},
	title = {Protein Sequence Domain Annotation using Language Models},
	year = {2024},
	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596712},
	journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protein-sequence-annotation-1.0.6.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file protein-sequence-annotation-1.0.6.tar.gz.

File metadata

File hashes

Hashes for protein-sequence-annotation-1.0.6.tar.gz
Algorithm Hash digest
SHA256 28cd374e899b7e873b08bdc88238c8f406d30dbb4d033af4a1eacb366654f685
MD5 9521acb16dd094172f8c3edb69202892
BLAKE2b-256 73bfbdc1910447317f0799105bda8be807f82a82f9c1d27af6da9f0e054c06c3

See more details on using hashes here.

File details

Details for the file protein_sequence_annotation-1.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for protein_sequence_annotation-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ed90b240e4f4ce92baf3384933aca71512b721880dee1c9729c39c6d4dd7d198
MD5 aafb1a754f62485f624ec2cf4f7041d4
BLAKE2b-256 15997e6f6bb4754bac24f5043fb9b00b855ea4f31213bb755ce7ab9a2252a949

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page