ESM3-based Transformer Attention Protein classifier for binary protein sequence classification
Project description
ETAP — ESM3-based Transformer Attention Protein classifier
Binary protein sequence classifier built on ESM3 per-residue embeddings with a learned attention pooling layer.
Designed for any study requiring positive/negative classification of protein sequences.
Installation
pip install etap-clf
Or from source:
pip install "git+https://github.com/Sitgttish/summer26.git#subdirectory=eta_package"
ESM3 is a gated model. Before first use:
- Accept the license at https://huggingface.co/EvolutionaryScale/esm3-sm-open-v1
- Get a token at https://huggingface.co/settings/tokens
- Pass it via
--hf-tokenor setHF_TOKENin your environment.
Usage
Training
etap --train positive.fasta negative.fasta ./model_output/
Outputs saved to ./model_output/:
best_model.pth— model checkpointtraining_history.csv— epoch-level loss and val-AUCtest_metrics.csv— final accuracy, AUC, avg-precision, sensitivity, specificity
Optional flags:
--epochs 30 Max training epochs (default: 30)
--patience 10 Early stopping patience on val-AUC (default: 10)
--batch-size 64 ETA training batch size (default: 64)
--embed-batch-size 16 ESM3 embedding batch size; reduce if OOM (default: 16)
--lr 3e-4 Learning rate (default: 3e-4)
--proj-dim 256 Hidden dimension (default: 256)
--num-layers 4 Transformer encoder layers (default: 4)
--cache-dir ./cache Reuse ESM3 embedding cache across runs
--hf-token TOKEN HuggingFace token
Inference
etap --eval model_output/best_model.pth new_sequences.fasta ./results.csv
Output CSV columns: header, gene, prob_positive, predicted_label
If FASTA headers contain |label=1 or |label=0, full metrics are reported automatically.
Attention analysis (optional)
etap --eval best_model.pth sequences.fasta ./results.csv --gene-analyze
Saves five plots to ./analysis/:
attn_aa_analysis.png— mean attention per amino acid type (+ class enrichment if labels present)attn_motifs.png— top high-attention 5-mer motifsattn_gene_heatmap.png— gene-level attention heatmapattn_position.png— positional attention profile (N→C terminus)
FASTA header conventions
Gene name is extracted automatically:
- UniProt format
>sp|P12345|GENE_HUMAN→ gene =GENE - Generic: first token before space/
|/_
Labels (for metric reporting in eval):
>SEQID|label=1 ferroptosis-positive
>SEQID|label=0 negative control
Python API
from etap import ETA, run_training, run_eval
# Training
ckpt_path, metrics = run_training(
pos_fasta='positive.fasta',
neg_fasta='negative.fasta',
output_dir='./output/',
hf_token='hf_...',
)
# Inference
results = run_eval(
model_path='./output/best_model.pth',
sequences_fasta='new_seqs.fasta',
output_path='./results.csv',
gene_analyze=True,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file etap_clf-0.1.1.tar.gz.
File metadata
- Download URL: etap_clf-0.1.1.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
585158aa4710524f4de94ca4830b2da3c6cc1a77911c3199e8f68f23dc971429
|
|
| MD5 |
6338d6c1da9efa5a1817b29f85cdc712
|
|
| BLAKE2b-256 |
269cd842fe20aeac2c65a66bf2cb7097f7f703cc9769636c7344fa09cf88c5ec
|
File details
Details for the file etap_clf-0.1.1-py3-none-any.whl.
File metadata
- Download URL: etap_clf-0.1.1-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d44b4dce8b0590358d90975d2a1e8785b9798d00152964c086faa2f518a2bf6
|
|
| MD5 |
be2ffba05a6eb9bf70ab368f329a6a9a
|
|
| BLAKE2b-256 |
b51c2bc58e501d61f9a2bf58fe910a8826f54f592acff2133427b6235c7ff0dc
|