Skip to main content

PalmSite: RdRP catalytic center predictor

Project description

PalmSite — RdRP catalytic center predictor

PalmSite is a fast command-line tool that predicts the RNA-dependent RNA polymerase (RdRP) catalytic center from protein FASTA and outputs GFF3. As of v0.2.0, PalmSite can also optionally output per-residue attention weights and span parameters in JSON.


Highlights

  • One command from FASTA → GFF3:

    palmsite <fasta ...>
    
  • New: optional JSON output of residue-wise attention and span details:

    palmsite --attn-json details.json <fasta>
    
  • High precision and recall AUC (internal benchmarks):

Backbone (ESM-C) Positives vs. Negatives Positives vs. Rest
6b 0.9998 0.9848
600m 0.9992 0.9687
300m 0.9991 0.9755
  • Detects distant homologs (e.g., HSRV RdRP in Urayama et al., 2024).

Installation

conda create -n palmsite python=3.11
conda activate palmsite
pip install palmsite

Quickstart

# Basic (default backbone: 600m, local)
palmsite -o hsrv_rdrp-domain.gff examples/hsrv_proteins.fasta

# Or write to stdout
palmsite examples/hsrv_proteins.fasta > hsrv_rdrp-domain.gff

# Quiet mode
palmsite -q examples/sars-cov-2_proteins.fasta

# Increase reporting threshold
palmsite -p 0.9 examples/zikavirus_proteins.fasta

# Use 6B (Forge)
palmsite -b 6b -k <FORGE_TOKEN> examples/turnip-mosaic-virus_proteins.fasta

Notes:

  • -b/--backbone selects the ESM-C embedding model: 300m, 600m (local), or 6b (Forge).
  • For 6b, set -k <token> or export ESM_FORGE_TOKEN.

NEW: Attention JSON output

PalmSite now supports optional per-residue attention-weight output in JSON format:

palmsite \
  -o result.gff \
  --attn-json attention_details.json \
  examples/myproteins.fasta

Each entry corresponds to one embedded chunk and includes:

{
  "chunk_id": {
    "L": <length>,
    "orig_start": <absolute_start>,
    "orig_len": <protein_length>,
    "mu": <anchor_mu>,
    "sigma": <anchor_sigma>,
    "mu_attn": <gaussian_mu>,
    "sigma_attn": <gaussian_sigma>,
    "S_norm": <span_start_norm>,
    "E_norm": <span_end_norm>,
    "S_idx": <span_start_index>,
    "E_idx": <span_end_index>,
    "P": <probability>,
    "w": [... per-residue attention weights ...],
    "abs_pos": [... absolute positions ...]
  }
}

Command-line usage

Usage: palmsite [OPTIONS] [FASTAS]...

PalmSite — RdRP catalytic center predictor.
Usage: palmsite -p 0.5 [-o result.gff] [--attn-json details.json] <fasta ...>

Options

  --version                       Show version and exit
  -o, --gff-out PATH              Write GFF3; default: stdout
  -p, --min-p FLOAT               Minimum probability for GFF [default: 0.5]
  -b, --backbone [300m|600m|6b]   Embedding backbone (local or Forge)
  -m, --model-id TEXT             HF model repo for PalmSite weights (default: ryota-sugimoto/palmsite)
  -d, --device [auto|cpu|cuda]    Device for local models (ignored for 6b)
  -k, --token TEXT                Forge token for 6B (or set ESM_FORGE_TOKEN)
  -t, --tmp-dir PATH              Temp directory (default: auto-created)
  -q, --quiet                     Suppress logs
  -v, --verbose                   Debug logs (overrides quiet)
  --keep-tmp                      Keep temp files (sanitized FASTA + per-batch embeddings)
  --attn-json PATH                Write per-residue attention JSON (can be large)
  --micro-batch-seqs INTEGER      Micro-batch size in number of sequences
  --micro-batch-tokens INTEGER    Micro-batch size cap in ~tokens (sum(len(seq)+2))
  FASTAS...                       One or more FASTA files

What PalmSite does

1. Sanitize & merge FASTA

Removes unusual characters, replaces with X, drops sequences with too many corrections, and writes a clean merged FASTA. (src: sanitize.py)

2. Embed sequences

The embedding engine (_embed_impl.py) generates an HDF5 file containing token-wise ESM-C embeddings:

  • 300m / 600m — local Hugging Face models
  • 6B — via ESM Forge API

Streaming micro-batches (v0.2.0+): the CLI runs embedding and prediction in small micro-batches, emitting GFF3 rows incrementally and deleting each temporary embedding HDF5 right after it is consumed (unless you pass --keep-tmp). This avoids large peak disk usage for big FASTA inputs.

Tune with:

  • --micro-batch-tokens (default: ~80k for local backbones, ~120k for 6b)
  • --micro-batch-seqs (optional hard cap on number of sequences per batch)

3. Predict RdRP domains

Prediction code lives in:

  • _predict_impl.py (full engine with CSV, GFF3, HDF5 export, and JSON export)
  • infer_simple.py (minimal GFF3 generator, now with JSON support)

Outputs include:

  • GFF3 spans
  • (New) JSON with attention maps

Output files

1. GFF3 (default)

Contains one feature per protein:

Attribute Meaning
P RdRP probability
sigma attention span width
Chunk / ChunkOrWindow source chunk or window
SpanSource kSigma or HPD
AttnMass HPD mass used (if enabled)
AttnEntropy attention entropy

Environment variables

  • ESM_FORGE_TOKEN — token for Forge when using -b 6b
  • PALMSITE_MODEL_ID — override default HF repo
  • PALMSITE_MODEL_REV — optional model revision

Version: 0.2.0


Citation

(Coming soon.)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palmsite-0.2.0.tar.gz (37.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

palmsite-0.2.0-py3-none-any.whl (36.6 kB view details)

Uploaded Python 3

File details

Details for the file palmsite-0.2.0.tar.gz.

File metadata

  • Download URL: palmsite-0.2.0.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for palmsite-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fab65a1aeac0558a86fa23946a2f236b73a81e8a4e0d034911db053e26114ae4
MD5 155e940835cc5260d95da1737a6e665b
BLAKE2b-256 2b677048f845010c96694389fbd5301d8f0ab377806adeeefa7e49cd5d191f50

See more details on using hashes here.

File details

Details for the file palmsite-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: palmsite-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 36.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for palmsite-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 23679b69579b51b0c3b2d21ffb9916152ec1e34da670dc7d6a1b0c2826f89f4a
MD5 fea5a63231fb4d281fff6ced3ddfd255
BLAKE2b-256 bc73bda3ad3fe43ac2cc4263228e9ac53b703bfd557fb9838b732a0c6d2d0aa8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page