PalmSite: RdRP catalytic center predictor
Project description
PalmSite — RdRP catalytic center predictor
PalmSite is a fast command-line tool that predicts the RNA-dependent RNA polymerase (RdRP) catalytic center from protein FASTA and outputs GFF3. As of v0.2.0, PalmSite can also optionally output per-residue attention weights and span parameters in JSON.
Highlights
-
One command from FASTA → GFF3:
palmsite <fasta ...>
-
New: optional JSON output of residue-wise attention and span details:
palmsite --attn-json details.json <fasta>
-
High precision and recall AUC (internal benchmarks):
| Backbone (ESM-C) | Positives vs. Negatives | Positives vs. Rest |
|---|---|---|
| 6b | 0.9998 | 0.9848 |
| 600m | 0.9992 | 0.9687 |
| 300m | 0.9991 | 0.9755 |
- Detects distant homologs (e.g., HSRV RdRP in Urayama et al., 2024).
Installation
conda create -n palmsite python=3.11
conda activate palmsite
pip install palmsite
Quickstart
# Basic (default backbone: 600m, local)
palmsite -o hsrv_rdrp-domain.gff examples/hsrv_proteins.fasta
# Or write to stdout
palmsite examples/hsrv_proteins.fasta > hsrv_rdrp-domain.gff
# Quiet mode
palmsite -q examples/sars-cov-2_proteins.fasta
# Increase reporting threshold
palmsite -p 0.9 examples/zikavirus_proteins.fasta
# Use 6B (Forge)
palmsite -b 6b -k <FORGE_TOKEN> examples/turnip-mosaic-virus_proteins.fasta
Notes:
-b/--backboneselects the ESM-C embedding model: 300m, 600m (local), or 6b (Forge).- For
6b, set-k <token>or exportESM_FORGE_TOKEN.
NEW: Attention JSON output
PalmSite now supports optional per-residue attention-weight output in JSON format:
palmsite \
-o result.gff \
--attn-json attention_details.json \
examples/myproteins.fasta
Each entry corresponds to one embedded chunk and includes:
{
"chunk_id": {
"L": <length>,
"orig_start": <absolute_start>,
"orig_len": <protein_length>,
"mu": <anchor_mu>,
"sigma": <anchor_sigma>,
"mu_attn": <gaussian_mu>,
"sigma_attn": <gaussian_sigma>,
"S_norm": <span_start_norm>,
"E_norm": <span_end_norm>,
"S_idx": <span_start_index>,
"E_idx": <span_end_index>,
"P": <probability>,
"w": [... per-residue attention weights ...],
"abs_pos": [... absolute positions ...]
}
}
Command-line usage
Usage: palmsite [OPTIONS] [FASTAS]...
PalmSite — RdRP catalytic center predictor.
Usage: palmsite -p 0.5 [-o result.gff] [--attn-json details.json] <fasta ...>
Options
--version Show version and exit
-o, --gff-out PATH Write GFF3; default: stdout
-p, --min-p FLOAT Minimum probability for GFF [default: 0.5]
-b, --backbone [300m|600m|6b] Embedding backbone (local or Forge)
-m, --model-id TEXT HF model repo for PalmSite weights (default: ryota-sugimoto/palmsite)
-d, --device [auto|cpu|cuda] Device for local models (ignored for 6b)
-k, --token TEXT Forge token for 6B (or set ESM_FORGE_TOKEN)
-t, --tmp-dir PATH Temp directory (default: auto-created)
-q, --quiet Suppress logs
-v, --verbose Debug logs (overrides quiet)
--keep-tmp Keep temp files (sanitized FASTA + per-batch embeddings)
--attn-json PATH Write per-residue attention JSON (can be large)
--micro-batch-seqs INTEGER Micro-batch size in number of sequences
--micro-batch-tokens INTEGER Micro-batch size cap in ~tokens (sum(len(seq)+2))
FASTAS... One or more FASTA files
What PalmSite does
1. Sanitize & merge FASTA
Removes unusual characters, replaces with X, drops sequences with too many corrections, and writes a clean merged FASTA.
(src: sanitize.py)
2. Embed sequences
The embedding engine (_embed_impl.py) generates an HDF5 file containing token-wise ESM-C embeddings:
- 300m / 600m — local Hugging Face models
- 6B — via ESM Forge API
Streaming micro-batches (v0.2.0+): the CLI runs embedding and prediction in small micro-batches, emitting GFF3 rows incrementally and deleting each temporary embedding HDF5 right after it is consumed (unless you pass --keep-tmp). This avoids large peak disk usage for big FASTA inputs.
Tune with:
--micro-batch-tokens(default: ~80k for local backbones, ~120k for 6b)--micro-batch-seqs(optional hard cap on number of sequences per batch)
3. Predict RdRP domains
Prediction code lives in:
_predict_impl.py(full engine with CSV, GFF3, HDF5 export, and JSON export)infer_simple.py(minimal GFF3 generator, now with JSON support)
Outputs include:
- GFF3 spans
- (New) JSON with attention maps
Output files
1. GFF3 (default)
Contains one feature per protein:
| Attribute | Meaning |
|---|---|
P |
RdRP probability |
sigma |
attention span width |
Chunk / ChunkOrWindow |
source chunk or window |
SpanSource |
kSigma or HPD |
AttnMass |
HPD mass used (if enabled) |
AttnEntropy |
attention entropy |
Environment variables
ESM_FORGE_TOKEN— token for Forge when using-b 6bPALMSITE_MODEL_ID— override default HF repoPALMSITE_MODEL_REV— optional model revision
Version: 0.2.0
Citation
(Coming soon.)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file palmsite-0.2.0.tar.gz.
File metadata
- Download URL: palmsite-0.2.0.tar.gz
- Upload date:
- Size: 37.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fab65a1aeac0558a86fa23946a2f236b73a81e8a4e0d034911db053e26114ae4
|
|
| MD5 |
155e940835cc5260d95da1737a6e665b
|
|
| BLAKE2b-256 |
2b677048f845010c96694389fbd5301d8f0ab377806adeeefa7e49cd5d191f50
|
File details
Details for the file palmsite-0.2.0-py3-none-any.whl.
File metadata
- Download URL: palmsite-0.2.0-py3-none-any.whl
- Upload date:
- Size: 36.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23679b69579b51b0c3b2d21ffb9916152ec1e34da670dc7d6a1b0c2826f89f4a
|
|
| MD5 |
fea5a63231fb4d281fff6ced3ddfd255
|
|
| BLAKE2b-256 |
bc73bda3ad3fe43ac2cc4263228e9ac53b703bfd557fb9838b732a0c6d2d0aa8
|