Protein sequence domain annotation with PSALM.
Project description
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ │
│ ██████╗ ███████╗ █████╗ ██╗ ███╗ ███╗ │
│ ██╔══██╗██╔════╝██╔══██╗██║ ████╗ ████║ │
│ ██████╔╝███████╗███████║██║ ██╔████╔██║ │
│ ██╔═══╝ ╚════██║██╔══██║██║ ██║╚██╔╝██║ │
│ ██║ ███████║██║ ██║███████╗██║ ╚═╝ ██║ │
│ ╚═╝ ╚══════╝╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝ │
│ Protein Sequence Annotation using a Language Model │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Persistent session mode (load model once, scan many times):
psalm -d auto
# inside shell:
# scan -f path/to/seqs.fa
# scan --sort -f path/to/seqs.fa -c 4 --to-tsv hits.tsv
# scan -s "MSTNPKPQR..."
# quit
Quick usage:
psalm-scan -f path/to/your_sequence.fasta
CLI behavior notes:
- Default model:
ProteinSequenceAnnotation/PSALM-2 - Default device:
auto(cuda->mps->cpu) - FASTA scans use fast batched scanning by default
--serialrestores the legacy serial FASTA behavior--sortremains opt-in-c/--cpu-workersis the number of fast-mode CPU decode helper processes- default behavior is equivalent to
-c 0 - if the interactive shell already has warmed workers, later default fast scans reuse that pool
- default behavior is equivalent to
--max-batch-sizecontrols the fast-mode embedding batch budget in tokens/amino acids--max-queue-sizecontrols the fast-mode decode queue in sequences- default:
128
- default:
-q/--quietsuppresses scan result output only; startup/status still prints--to-tsvand--to-txtwork for single or multi-sequence FASTA-v/--verboseenables detailed alignment and model tables- verbose FASTA scans use the serial path
- without
-v, PSALM prints the compact HITS report
-Tkeeps domains withScore >= threshold(default:0.5)-Ekeeps domains withE-value <= threshold(default:0.1)-Zsets dataset size for E-value scaling- if omitted for
-s:Z=1 - if omitted for
-f:Z=#sequences in FASTA
- if omitted for
--to-tsvis the supported machine-readable output format
Common shell usage:
psalm
scan --sort -f path/to/seqs.fa --to-tsv hits.tsv
Useful output modes:
# compact terminal report + TSV
scan -f path/to/seqs.fa --to-tsv hits.tsv
# with TSV only
scan -q --sort -f path/to/seqs.fa --to-tsv hits.tsv
# verbose per-domain output
scan -v -f path/to/seqs.fa
For the full option set, run psalm --help, psalm-scan --help, or scan --help.
Installation
Create a fresh Python 3.10 environment, install PyTorch for your hardware, then install PSALM.
conda create -n psalm python=3.10 -y
conda activate psalm
python -m pip install --upgrade pip
# 1) Install PyTorch for your hardware
# Apple Silicon (MPS):
python -m pip install torch
# CPU-only (Linux/Windows):
# python -m pip install torch
# NVIDIA CUDA 12.1:
# python -m pip install --index-url https://download.pytorch.org/whl/cu121 \
# torch
# 2) Install PSALM
python -m pip install protein-sequence-annotation==2.1.11
If you are unsure which PyTorch command matches your GPU/driver, use the official selector: https://pytorch.org/get-started/locally/
Intel Mac (x86_64) tested path:
conda create -n psalm python=3.10 -y
conda activate psalm
conda install -y -c conda-forge "llvmlite=0.44.*" "numba=0.61.*"
conda install -y -c conda-forge "pytorch=2.5" torchvision torchaudio
python -m pip install protein-sequence-annotation==2.1.11
Optional: run without activating conda manually:
conda run -n psalm psalm-scan -f path/to/seqs.fa
Python API
from psalm.psalm_model import PSALM
psalm = PSALM(model_name="ProteinSequenceAnnotation/PSALM-2")
# Scan FASTA
results = psalm.scan(fasta="path/to/your_sequence.fasta")
print(results)
# Scan sequence string
results = psalm.scan(sequence="MSTNPKPQR...AA")
Output options:
to_tsv="results.tsv"writes:Sequence,E-value,Score,Pfam,Start,Stop,Model,Len Frac,Statusto_txt="results.txt"saves console-style output- For multi-sequence FASTA, TSV rows are combined with the query id in the
Sequencecolumn
Scripts overview
The core workflow is:
scripts/data/augment_fasta.py→ slice sequences and generate augmented FASTA + domain dictscripts/data/data_processing.py→ tokenize, label, batch, and shard datasetsscripts/train/train_psalm.py→ train/evaluate the PSALM model on shards
scripts/data/augment_fasta.py
Splits long sequences into domain-preserving slices and optionally emits shuffled and negative variants. Produces a new FASTA and a new domain dict with aligned IDs.
Key inputs
--fasta,--domain-dict--output-fasta,--output-dict
Common flags
--max-length: slice length threshold--negative-prob: target fraction of negatives (approximate)--include-domain-slices,--shuffle-only,--no-shuffle,--domain-slices-only--large-datawith--p-shuffled,--domain-counts-tsv,--domain-slice-frac--seed,--verbose
scripts/data/data_processing.py
Tokenizes sequences, generates per-token labels from the domain dict and label mapping, batches by token budget, and saves shards.
Config handling
- This script is CLI-only; it does not read
config.yaml.
Required args
--fasta,--domain-dict,--output-dir,--ignore-label--model-name,--max-length,--max-tokens-per-batch--label-mapping-dict
Optional args
--chunk-size,--tmp-dir,--shard-size,--seed,--keep-tmp
Notes
- ID normalization uses the FASTA header segment between
>and the first space. --ignore-labelmust match the training--ignore-label.
scripts/train/train_psalm.py
Trains or evaluates PSALM on preprocessed shard datasets.
Config handling
- Training always uses a YAML config.
- If
--configis provided without a value, the script looks forpsalm/config.yaml. - If
--configis not provided, the script still looks forpsalm/config.yaml.
Required args
--val-dir,--ignore-label--train-diriftraining.total_steps > 0in config
Optional args
--label-mapping-dictto override configmodel.label_mapping_path
Checkpoint loading
- Supports
model.safetensorsorpytorch_model.binwithin a checkpoint directory, or a direct path to a.safetensors/.binfile.
Logging
report_to=["wandb"]is enabled by default.
scripts/train/train_cbm.py
Trains the CatBoost scoring model used by scan() (saved as score.cbm).
Required args
--pos,--neg: Pickle or JSON files containing a list of 7-tuples:(pfam, start, stop, bit_score, len_ratio, bias, status)(orscan()output dicts containing 8-tuples withcbm_score).
Example
python scripts/train/train_cbm.py \
--pos path/to/positives.pkl \
--neg path/to/negatives.pkl \
--outdir cbm_outputs \
--model-out score.cbm
Config format
The scripts expect a YAML config with these sections:
model
model_namemax_batch_sizeoutput_sizefreeze_esmuse_fapretrained_checkpoint_pathlabel_mapping_path
training
gradient_accumulation_steps,learning_rate,optimizer,gradient_clippinglr_scheduler,eval_strategy,eval_steps,total_steps,warmup_stepslogging_steps,save_steps,output_dirmixed_precision,dataloader_num_workers,dataloader_prefetch_factor,dataloader_pin_memory,seed
data
chunk_size,default_tmp_dir,default_shard_size
psalm/config.yaml is provided as a template with null values. Populate it
before use, or pass all required values via CLI without --config.
Training CLI examples
python scripts/data/augment_fasta.py \
--fasta input.fa \
--domain-dict domains.pkl \
--output-fasta augmented.fa \
--output-dict augmented.pkl
python scripts/data/data_processing.py \
--fasta augmented.fa \
--domain-dict augmented.pkl \
--label-mapping-dict labels.pkl \
--output-dir data/shards \
--model-name ProteinSequenceAnnotation/esm2_t33_650M_PFS90_leaky \
--max-length 4096 \
--max-tokens-per-batch 8196 \
--ignore-label -100
python scripts/train/train_psalm.py \
--config psalm/config.yaml \
--train-dir data/shards/train \
--val-dir data/shards/val \
--ignore-label -100
Dependencies
PyYAMLis required for config loading.faesmis required only ifuse_fa: truein config.- Core inference runtime uses
torch,transformers,biopython,pandas,numba, andcatboost.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file protein_sequence_annotation-2.1.11.tar.gz.
File metadata
- Download URL: protein_sequence_annotation-2.1.11.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0032d92b4382ed87e673c9107ecf97e7ae7e4116a2ac5c6caa962490d8c7f381
|
|
| MD5 |
9fed417b0027ad1a990f239138efab08
|
|
| BLAKE2b-256 |
4e01d75502fffcb5382bb6d16ef88328d83efbd6d0045105bd95e73e36c7aa54
|
File details
Details for the file protein_sequence_annotation-2.1.11-py3-none-any.whl.
File metadata
- Download URL: protein_sequence_annotation-2.1.11-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c819d1ce3a6c9964965383c8111c9fe38f36404b8b61295f755ff9746fd67f6
|
|
| MD5 |
118035a57e116309207ea719b4ee1363
|
|
| BLAKE2b-256 |
1ea4afb9568e31c447c3b341b4a4928d8fddd6bc01108a8841d0751c01b154b6
|