Skip to main content

PalmSite: RdRP catalytic center predictor

Project description

PalmSite — RdRP catalytic center predictor

PalmSite is a simple, fast command-line tool that predicts the RNA-dependent RNA polymerase (RdRP) catalytic center from protein FASTA and outputs GFF3.

Highlights

  • One command from FASTA → GFF3: palmsite <fasta ...>
  • High precision and recall (internal benchmarks):
Backbone (ESM-C) Positives vs. Negatives Positives vs. Rest
6b 0.9998 0.9848
600m 0.9992 0.9687
300m 0.9991 0.9755
  • Detects distant homologs (e.g., HSRV RdRP).
  • Clear progress logging and fast batched embedding (local 300m/600m or Forge 6B).

Installation

pip install palmsite

Quickstart

# Basic (local 600m is default backbone)
palmsite examples/zikavirus_proteins.fasta

# Quiet mode (only errors)
palmsite -q examples/zikavirus_proteins.fasta

# Raise the reporting threshold
palmsite -p 0.9 examples/zikavirus_proteins.fasta

# Use 6B (Forge); requires a token
palmsite -b 6b -k <FORGE_TOKEN> examples/zikavirus_proteins.fasta
# or export ESM_FORGE_TOKEN and omit -k

Notes:

  • -b/--backbone chooses the embedding model: 300m, 600m (local), or 6b (Forge/cloud).

Command-line usage

Usage: palmsite [OPTIONS] [FASTAS]...

  PalmSite — RdRP catalytic center predictor. Usage: palmsite -p 0.5 [-o
  result.gff] [options] <fasta ...>

Options:
  --version                      Show the version and exit.
  -o, --gff-out TEXT             Write GFF3 to this path; default: stdout if
                                 omitted
  -p, --min-p FLOAT              Minimum probability to include a feature in
                                 GFF  [default: 0.5]
  -b, --backbone [300m|600m|6b]  Embedding backbone & size: '300m' (fast,
                                 local), '600m' (balanced, local), '6b'
                                 (highest quality via ESM Forge; requires
                                 --token or ESM_FORGE_TOKEN).  [default: 600m]
  -m, --model-id TEXT            Hugging Face model repo (default via
                                 PALMSITE_MODEL_ID env or palmsite/<backbone>)
  -d, --device [auto|cpu|cuda]   Device for local ESM-C (ignored for 6B Forge)
                                 [default: auto]
  -k, --token TEXT               Forge token (required for 6B if not set in
                                 ESM_FORGE_TOKEN)
  -t, --tmp-dir TEXT             Optional working directory for temp files
  -q, --quiet                    Reduce non-error logs
  -v, --verbose                  Verbose logs (DEBUG level; overrides -q)
  --keep-tmp                     Keep temporary files (sanitized FASTA &
                                 embeddings.h5) for debugging

About -b/--backbone

  • 300m – fast local ESM-C. Good for CPU/GPU prototyping.
  • 600m – balanced local ESM-C. Better quality; still lightweight.
  • 6b – highest quality via ESM Forge (cloud); requires -k <token> or ESM_FORGE_TOKEN.

What PalmSite does

  1. Sanitize & merge FASTA
    Replaces unusual residues with X, drops sequences if too many fixes were needed, and writes one merged FASTA.
  2. Embed sequences
    • Launches the embedding engine (batched, token-aware micro-batching; visible progress/ETA).
    • Backends:
      • Local ESM-C (300m/600m) via Hugging Face.
      • Forge (6B) via the ESM SDK (ESM3ForgeInferenceClient).
  3. Predict → GFF3
    Loads the checkpoint from Hugging Face, computes RdRP probabilities and spans, aggregates per protein, and writes GFF3.

Output

  • GFF3 (stdout or -o): one feature per protein (catalytic center span). Attributes include P, sigma, original length, and the chunk used.

Environment variables

  • ESM_FORGE_TOKEN — Forge API token for -b 6b (alternative to -k).

Project structure (user-side)

  • cli.py — top-level command: sanitize → embed → predict.
  • embed_shim.py — launches the embedding engine in a subprocess.
  • _embed_impl.py — embedding engine (batching, progress, HDF5 writer, Forge/local backends).
  • infer_simple.py — simple driver to produce GFF from embeddings.
  • _predict_impl.py — full predictor (model, dataset, collate).
  • hf.py — Hugging Face weight resolution.
  • sanitize.py — FASTA cleaner/merger.
  • __init__.py — version. Current: 0.1.0.

Version

0.1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palmsite-0.1.0.tar.gz (30.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

palmsite-0.1.0-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file palmsite-0.1.0.tar.gz.

File metadata

  • Download URL: palmsite-0.1.0.tar.gz
  • Upload date:
  • Size: 30.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for palmsite-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3028d387bc94a829473f2da9db51993e105c3c4e22b3cc1d37c7169ef00653a7
MD5 d8e3856aa090ec95d193433018e1052b
BLAKE2b-256 760dcd8b0de745ff3a1ab6c33b35e993fbf7df6a8c03212635028d9331cc731b

See more details on using hashes here.

File details

Details for the file palmsite-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: palmsite-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for palmsite-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fa383a9bb24089d88da7ca22806c4c3149472d0ae0707d2ada95869e0672994c
MD5 c37d7c69871ad69509f51b4130fc077c
BLAKE2b-256 63449f797cbcb97f7f7c9dbab595d1ca74bd280b7d6daf89e1c605d79bbb6033

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page