Skip to main content

Protein sequence domain annotation with PSALM.

Project description

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│                                                                              │
│                 ██████╗ ███████╗ █████╗ ██╗     ███╗   ███╗                  │
│                 ██╔══██╗██╔════╝██╔══██╗██║     ████╗ ████║                  │
│                 ██████╔╝███████╗███████║██║     ██╔████╔██║                  │
│                 ██╔═══╝ ╚════██║██╔══██║██║     ██║╚██╔╝██║                  │
│                 ██║     ███████║██║  ██║███████╗██║ ╚═╝ ██║                  │
│                 ╚═╝     ╚══════╝╚═╝  ╚═╝╚══════╝╚═╝     ╚═╝                  │
│              Protein Sequence Annotation using a Language Model              │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Persistent session mode (load model once, scan many times):

psalm -d auto
# inside shell:
#   scan -f path/to/seqs.fa
#   scan --sort -f path/to/seqs.fa -c 4 --to-tsv hits.tsv
#   scan -s "MSTNPKPQR..."
#   quit

Quick usage:

psalm-scan -f path/to/your_sequence.fasta

CLI behavior notes:

  • Default model: ProteinSequenceAnnotation/PSALM-2
  • Default device: auto (cuda -> mps -> cpu)
  • FASTA scans use fast batched scanning by default
  • --serial restores the legacy serial FASTA behavior
  • --sort remains opt-in
  • -c/--cpu-workers is the number of fast-mode CPU decode helper processes
    • default behavior is equivalent to -c 0
    • if the interactive shell already has warmed workers, later default fast scans reuse that pool
  • --max-batch-size controls the fast-mode embedding batch budget in tokens/amino acids
  • --max-queue-size controls the fast-mode decode queue in sequences
    • default: 128
  • -q/--quiet suppresses scan result output only; startup/status still prints
  • --to-tsv and --to-txt work for single or multi-sequence FASTA
  • -v/--verbose enables detailed alignment and model tables
    • verbose FASTA scans use the serial path
    • without -v, PSALM prints the compact HITS report
  • -T keeps domains with Score >= threshold (default: 0.5)
  • -E keeps domains with E-value <= threshold (default: 0.1)
  • -Z sets dataset size for E-value scaling
    • if omitted for -s: Z=1
    • if omitted for -f: Z=#sequences in FASTA
  • --to-tsv is the supported machine-readable output format

Common shell usage:

psalm

Fast shell usage with workers pre-warmed at startup:

psalm -c 4
# inside shell
scan --sort -f path/to/seqs.fa --to-tsv hits.tsv

Fast shell usage without pre-warming:

psalm -d auto
# inside shell
scan --sort -f path/to/seqs.fa -c 4 --to-tsv hits.tsv

Useful output modes:

# compact terminal report + TSV
scan -f path/to/seqs.fa --to-tsv hits.tsv

# fast mode with TSV only
scan -q --sort -f path/to/seqs.fa --to-tsv hits.tsv

# verbose per-domain output
scan -v -f path/to/seqs.fa

# explicit legacy serial FASTA mode
scan --serial -f path/to/seqs.fa

For the full option set, run psalm --help, psalm-scan --help, or scan --help.

Installation

Create a fresh Python 3.10 environment, install PyTorch for your hardware, then install PSALM.

conda create -n psalm python=3.10 -y
conda activate psalm
python -m pip install --upgrade pip

# 1) Install PyTorch for your hardware
# Apple Silicon (MPS):
python -m pip install torch

# CPU-only (Linux/Windows):
# python -m pip install torch

# NVIDIA CUDA 12.1:
# python -m pip install --index-url https://download.pytorch.org/whl/cu121 \
#   torch

# 2) Install PSALM
python -m pip install --index-url https://test.pypi.org/simple/ \
  --extra-index-url https://pypi.org/simple \
  protein-sequence-annotation==2.1.9

If you are unsure which PyTorch command matches your GPU/driver, use the official selector: https://pytorch.org/get-started/locally/

Intel Mac (x86_64) tested path:

conda create -n psalm python=3.10 -y
conda activate psalm

conda install -y -c conda-forge "llvmlite=0.44.*" "numba=0.61.*"
conda install -y -c conda-forge "pytorch=2.5" torchvision torchaudio

python -m pip install --index-url https://test.pypi.org/simple/ \
  --extra-index-url https://pypi.org/simple \
  protein-sequence-annotation==2.1.9

Optional: run without activating conda manually:

conda run -n psalm psalm-scan -f path/to/seqs.fa

Python API

from psalm.psalm_model import PSALM

psalm = PSALM(model_name="ProteinSequenceAnnotation/PSALM-2")

# Scan FASTA
results = psalm.scan(fasta="path/to/your_sequence.fasta")
print(results)

# Scan sequence string
results = psalm.scan(sequence="MSTNPKPQR...AA")

Output options:

  • to_tsv="results.tsv" writes: Sequence,E-value,Score,Pfam,Start,Stop,Model,Len Frac,Status
  • to_txt="results.txt" saves console-style output
  • For multi-sequence FASTA, TSV rows are combined with the query id in the Sequence column

Scripts overview

The core workflow is:

  1. scripts/data/augment_fasta.py → slice sequences and generate augmented FASTA + domain dict
  2. scripts/data/data_processing.py → tokenize, label, batch, and shard datasets
  3. scripts/train/train_psalm.py → train/evaluate the PSALM model on shards

scripts/data/augment_fasta.py

Splits long sequences into domain-preserving slices and optionally emits shuffled and negative variants. Produces a new FASTA and a new domain dict with aligned IDs.

Key inputs

  • --fasta, --domain-dict
  • --output-fasta, --output-dict

Common flags

  • --max-length: slice length threshold
  • --negative-prob: target fraction of negatives (approximate)
  • --include-domain-slices, --shuffle-only, --no-shuffle, --domain-slices-only
  • --large-data with --p-shuffled, --domain-counts-tsv, --domain-slice-frac
  • --seed, --verbose

scripts/data/data_processing.py

Tokenizes sequences, generates per-token labels from the domain dict and label mapping, batches by token budget, and saves shards.

Config handling

  • This script is CLI-only; it does not read config.yaml.

Required args

  • --fasta, --domain-dict, --output-dir, --ignore-label
  • --model-name, --max-length, --max-tokens-per-batch
  • --label-mapping-dict

Optional args

  • --chunk-size, --tmp-dir, --shard-size, --seed, --keep-tmp

Notes

  • ID normalization uses the FASTA header segment between > and the first space.
  • --ignore-label must match the training --ignore-label.

scripts/train/train_psalm.py

Trains or evaluates PSALM on preprocessed shard datasets.

Config handling

  • Training always uses a YAML config.
  • If --config is provided without a value, the script looks for psalm/config.yaml.
  • If --config is not provided, the script still looks for psalm/config.yaml.

Required args

  • --val-dir, --ignore-label
  • --train-dir if training.total_steps > 0 in config

Optional args

  • --label-mapping-dict to override config model.label_mapping_path

Checkpoint loading

  • Supports model.safetensors or pytorch_model.bin within a checkpoint directory, or a direct path to a .safetensors/.bin file.

Logging

  • report_to=["wandb"] is enabled by default.

scripts/train/train_cbm.py

Trains the CatBoost scoring model used by scan() (saved as score.cbm).

Required args

  • --pos, --neg: Pickle or JSON files containing a list of 7-tuples: (pfam, start, stop, bit_score, len_ratio, bias, status) (or scan() output dicts containing 8-tuples with cbm_score).

Example

python scripts/train/train_cbm.py \
  --pos path/to/positives.pkl \
  --neg path/to/negatives.pkl \
  --outdir cbm_outputs \
  --model-out score.cbm

Config format

The scripts expect a YAML config with these sections:

model

  • model_name
  • max_batch_size
  • output_size
  • freeze_esm
  • use_fa
  • pretrained_checkpoint_path
  • label_mapping_path

training

  • gradient_accumulation_steps, learning_rate, optimizer, gradient_clipping
  • lr_scheduler, eval_strategy, eval_steps, total_steps, warmup_steps
  • logging_steps, save_steps, output_dir
  • mixed_precision, dataloader_num_workers, dataloader_prefetch_factor, dataloader_pin_memory, seed

data

  • chunk_size, default_tmp_dir, default_shard_size

psalm/config.yaml is provided as a template with null values. Populate it before use, or pass all required values via CLI without --config.

Training CLI examples

python scripts/data/augment_fasta.py \
  --fasta input.fa \
  --domain-dict domains.pkl \
  --output-fasta augmented.fa \
  --output-dict augmented.pkl
python scripts/data/data_processing.py \
  --fasta augmented.fa \
  --domain-dict augmented.pkl \
  --label-mapping-dict labels.pkl \
  --output-dir data/shards \
  --model-name ProteinSequenceAnnotation/esm2_t33_650M_PFS90_leaky \
  --max-length 4096 \
  --max-tokens-per-batch 8196 \
  --ignore-label -100
python scripts/train/train_psalm.py \
  --config psalm/config.yaml \
  --train-dir data/shards/train \
  --val-dir data/shards/val \
  --ignore-label -100

Dependencies

  • PyYAML is required for config loading.
  • faesm is required only if use_fa: true in config.
  • Core inference runtime uses torch, transformers, biopython, pandas, numba, and catboost.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

protein_sequence_annotation-2.1.9.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

protein_sequence_annotation-2.1.9-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file protein_sequence_annotation-2.1.9.tar.gz.

File metadata

File hashes

Hashes for protein_sequence_annotation-2.1.9.tar.gz
Algorithm Hash digest
SHA256 4843d274a27869a6aebbb3783ef26d8357a88725ef512698ab5cc2565965b495
MD5 6beeabaed56347a1adb4d8d1579d1b2d
BLAKE2b-256 c173aa03e4083013a601b572e808dc10abf9895f68c636a9e596220fa0d8046a

See more details on using hashes here.

File details

Details for the file protein_sequence_annotation-2.1.9-py3-none-any.whl.

File metadata

File hashes

Hashes for protein_sequence_annotation-2.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 99f972ab09617962cfcbc4154668dd250e654c7391fb17890e02cdcb4be518d7
MD5 ed84d24bccc08e666e3b5789f716bf4c
BLAKE2b-256 c87d2b78ecd71f13850556f9ba03e4b76cdea17460fb23c7c63a69735c29096b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page