Skip to main content

ESM3-based Transformer Attention Protein classifier for binary protein sequence classification

Project description

ETAP — ESM3-based Transformer Attention Protein classifier

Binary protein sequence classifier built on ESM3 per-residue embeddings with a learned attention pooling layer.
Designed for any study requiring positive/negative classification of protein sequences.

Installation

pip install etap-clf

Or from source:

pip install "git+https://github.com/Sitgttish/summer26.git#subdirectory=eta_package"

ESM3 is a gated model. Before first use:

  1. Accept the license at https://huggingface.co/EvolutionaryScale/esm3-sm-open-v1
  2. Get a token at https://huggingface.co/settings/tokens
  3. Pass it via --hf-token or set HF_TOKEN in your environment.

Usage

Training

eta --train positive.fasta negative.fasta ./model_output/

Outputs saved to ./model_output/:

  • best_model.pth — model checkpoint
  • training_history.csv — epoch-level loss and val-AUC
  • test_metrics.csv — final accuracy, AUC, avg-precision, sensitivity, specificity

Optional flags:

--epochs 30          Max training epochs (default: 30)
--patience 10        Early stopping patience on val-AUC (default: 10)
--batch-size 64      ETA training batch size (default: 64)
--embed-batch-size 16  ESM3 embedding batch size; reduce if OOM (default: 16)
--lr 3e-4            Learning rate (default: 3e-4)
--proj-dim 256       Hidden dimension (default: 256)
--num-layers 4       Transformer encoder layers (default: 4)
--cache-dir ./cache  Reuse ESM3 embedding cache across runs
--hf-token TOKEN     HuggingFace token

Inference

eta --eval model_output/best_model.pth new_sequences.fasta ./results.csv

Output CSV columns: header, gene, prob_positive, predicted_label
If FASTA headers contain |label=1 or |label=0, full metrics are reported automatically.

Attention analysis (optional)

eta --eval best_model.pth sequences.fasta ./results.csv --gene-analyze

Saves five plots to ./analysis/:

  1. attn_aa_analysis.png — mean attention per amino acid type (+ class enrichment if labels present)
  2. attn_motifs.png — top high-attention 5-mer motifs
  3. attn_gene_heatmap.png — gene-level attention heatmap
  4. attn_position.png — positional attention profile (N→C terminus)

FASTA header conventions

Gene name is extracted automatically:

  • UniProt format >sp|P12345|GENE_HUMAN → gene = GENE
  • Generic: first token before space/|/_

Labels (for metric reporting in eval):

>SEQID|label=1    ferroptosis-positive
>SEQID|label=0    negative control

Python API

from eta import ETA, run_training, run_eval

# Training
ckpt_path, metrics = run_training(
    pos_fasta='positive.fasta',
    neg_fasta='negative.fasta',
    output_dir='./output/',
    hf_token='hf_...',
)

# Inference
results = run_eval(
    model_path='./output/best_model.pth',
    sequences_fasta='new_seqs.fasta',
    output_path='./results.csv',
    gene_analyze=True,
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etap_clf-0.1.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

etap_clf-0.1.0-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file etap_clf-0.1.0.tar.gz.

File metadata

  • Download URL: etap_clf-0.1.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for etap_clf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0d30a3eb6c465b6be2b3366ac8002a4c3feac8d962bec53bbaf92b9c323c0702
MD5 3f332a8e4c3d0122ac6ca230f8e655a6
BLAKE2b-256 6c7a5bdd3dc2e384240bfc8bf856425ba4c0c86b76645295c4aecd1756bd01c8

See more details on using hashes here.

File details

Details for the file etap_clf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: etap_clf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for etap_clf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b418f228bc4fae0e78eb7ced3b03ec98c749729a87383bf84a18cc26327c83e3
MD5 20baf20e4f43b809f3cfe532e080a5c9
BLAKE2b-256 b69109da4d6e925ad653ee7b91587438de72d8663252358a5ff1e59312f00c1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page