Skip to main content

ESM3-based Transformer Attention Protein classifier for binary protein sequence classification

Project description

ETAP — ESM3-based Transformer Attention Protein classifier

Binary protein sequence classifier built on ESM3 per-residue embeddings with a learned attention pooling layer.
Designed for any study requiring positive/negative classification of protein sequences.

Installation

pip install etap-clf

Or from source:

pip install "git+https://github.com/Sitgttish/summer26.git#subdirectory=eta_package"

ESM3 is a gated model. Before first use:

  1. Accept the license at https://huggingface.co/EvolutionaryScale/esm3-sm-open-v1
  2. Get a token at https://huggingface.co/settings/tokens
  3. Pass it via --hf-token or set HF_TOKEN in your environment.

Usage

Training

etap --train positive.fasta negative.fasta ./model_output/

Outputs saved to ./model_output/:

  • best_model.pth — model checkpoint
  • training_history.csv — epoch-level loss and val-AUC
  • test_metrics.csv — final accuracy, AUC, avg-precision, sensitivity, specificity

Optional flags:

--epochs 30          Max training epochs (default: 30)
--patience 10        Early stopping patience on val-AUC (default: 10)
--batch-size 64      ETA training batch size (default: 64)
--embed-batch-size 16  ESM3 embedding batch size; reduce if OOM (default: 16)
--lr 3e-4            Learning rate (default: 3e-4)
--proj-dim 256       Hidden dimension (default: 256)
--num-layers 4       Transformer encoder layers (default: 4)
--cache-dir ./cache  Reuse ESM3 embedding cache across runs
--hf-token TOKEN     HuggingFace token

Inference

etap --eval model_output/best_model.pth new_sequences.fasta ./results.csv

Output CSV columns: header, gene, prob_positive, predicted_label
If FASTA headers contain |label=1 or |label=0, full metrics are reported automatically.

Attention analysis (optional)

etap --eval best_model.pth sequences.fasta ./results.csv --gene-analyze

Saves five plots to ./analysis/:

  1. attn_aa_analysis.png — mean attention per amino acid type (+ class enrichment if labels present)
  2. attn_motifs.png — top high-attention 5-mer motifs
  3. attn_gene_heatmap.png — gene-level attention heatmap
  4. attn_position.png — positional attention profile (N→C terminus)

FASTA header conventions

Gene name is extracted automatically:

  • UniProt format >sp|P12345|GENE_HUMAN → gene = GENE
  • Generic: first token before space/|/_

Labels (for metric reporting in eval):

>SEQID|label=1    ferroptosis-positive
>SEQID|label=0    negative control

Python API

from etap import ETA, run_training, run_eval

# Training
ckpt_path, metrics = run_training(
    pos_fasta='positive.fasta',
    neg_fasta='negative.fasta',
    output_dir='./output/',
    hf_token='hf_...',
)

# Inference
results = run_eval(
    model_path='./output/best_model.pth',
    sequences_fasta='new_seqs.fasta',
    output_path='./results.csv',
    gene_analyze=True,
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etap_clf-0.1.1.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

etap_clf-0.1.1-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file etap_clf-0.1.1.tar.gz.

File metadata

  • Download URL: etap_clf-0.1.1.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for etap_clf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 585158aa4710524f4de94ca4830b2da3c6cc1a77911c3199e8f68f23dc971429
MD5 6338d6c1da9efa5a1817b29f85cdc712
BLAKE2b-256 269cd842fe20aeac2c65a66bf2cb7097f7f703cc9769636c7344fa09cf88c5ec

See more details on using hashes here.

File details

Details for the file etap_clf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: etap_clf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for etap_clf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8d44b4dce8b0590358d90975d2a1e8785b9798d00152964c086faa2f518a2bf6
MD5 be2ffba05a6eb9bf70ab368f329a6a9a
BLAKE2b-256 b51c2bc58e501d61f9a2bf58fe910a8826f54f592acff2133427b6235c7ff0dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page