A package for training SCARSE on small sample sizes of peptide sequences that can later be used to predict peptide properties of unseen peptides. Making SCARSE perfectly suited for AI-infused peptide engineering.
Project description
SCARSE: Small-sample Classification And Regression Solution for low-resource peptide Engineering
Abstract
Reliable estimation of downstream performance in low-data peptide machine learning is critical for guiding early-stage AI-driven peptide engineering, yet it is often unclear how to assess whether a model will be effective in iterative discovery settings. Here, we show that cross validation R² score can serve as a simple and robust proxy for predicting active learning workflow performance, enabling early-stage evaluation of model suitability for sequential peptide optimization. To support this, we introduce SCARSE, a machine learning framework combining ESM-2 protein language model embeddings with Gaussian process regression and extremely randomized trees classification, designed for low-resource peptide property prediction (20–500 training samples). We benchmark SCARSE across 23 peptide and small-protein datasets covering substitution and indel variants, antimicrobial peptides, cell-penetrating peptides, and toxic/non-toxic peptides. The protein language model approach significantly outperforms a hand-engineered descriptor baseline on substitution and indel tasks, while the two approaches achieve comparable performance on shorter peptide non-mutant datasets where simpler descriptors capture enough of the signal. In simulated active learning workflows, SCARSE consistently outperforms baseline and random sampling strategies, and notably we demonstrate that CV R² computed from as few as 50 labeled peptides can be sufficient to estimate final active learning endpoint performance, providing a practical, data-efficient criterion for deciding whether a given dataset combined with SCARSE is suitable for iterative peptide discovery. SCARSE is released as a pip package and is available via HuggingFace Spaces to facilitate integration into peptide engineering workflows.
How SCARSE works
SCARSE is designed for peptide property prediction in low-data regimes by combining protein language model embeddings with classical machine learning methods.
The workflow consists of the following steps:
-
Input data
- A CSV file containing peptide sequences and one or more target variables.
- The sequence column (
seq_col) should contain amino acid sequences. - The target column(s) (
score_col) contain regression values or class labels.
-
Sequence embedding
- Sequences are converted into numerical representations using the ESM-2 protein language model.
-
Model selection
- Depending on the task:
- Regression → Gaussian Process Regression
- Classification → Extremely Randomized Trees
- These models are chosen for robustness in small-sample settings.
- Depending on the task:
-
Hyperparameter optimization
- Models are tuned using cross-validation and Optuna-based optimization.
- The number of folds and optimization trials can be controlled by the user.
-
Training output
- Cross-validation performance metrics are returned.
- A trained model environment is stored internally and reused for prediction.
-
Prediction
- New sequences are embedded using the same pipeline.
- The trained model generates predictions for each target variable.
Notes and best practices
-
Call order matters
You must runscarse.train()before callingscarse.pred(), as the trained model is stored internally. -
Data quality is important
- Ensure no missing or empty sequences
- Use consistent formatting (standard amino acid codes(ACDEFGHIKLMNPQRSTVWY))
-
Small datasets are supported
SCARSE have been evaluated for datasets as small as ~20 samples, but performance generally improves with more data.
Tested for Python version
- Python version == 3.12.10
Setup
pip install scarse
Usage
For training on regression problem:
import scarse
scarse.train(data_path="../app/train.csv",
classification=False,
seq_col="sequence",
score_col=["score"])
For training on classification problem:
import scarse
scarse.train(data_path="../app/train.csv",
classification=True,
seq_col="sequence",
score_col=["classes"])
For predicting after model have been trained:
df_pred = scarse.pred(data_path="../app/test.csv", seq_col="sequence")
Tutorials
See the following tutorial, structured as a Python notebook:
Correlate to active learning end-point performance
Below we illustrate the relation between CV R² score and end-point active learning performance.
The y-axis display how many times better performance SCARSE guided active learning delivers compared to random sampling when looking at the accumulation of top 10% of peptides.
By comparing the CV R² score of your data to the corresponding figure below for your dataset size one can get and indication of how suitable your data is combined with SCARSE to perform active learning peptide engineering.
Note that this can only be used as a guide to evaluate regression problem performance.
Citation
Coming soon!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scarse-1.0.0.tar.gz.
File metadata
- Download URL: scarse-1.0.0.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c91ad8a9d03753563bbde580775bcaa72b5cbcd531e8b755f30e939e2f0c54eb
|
|
| MD5 |
f746a87cdf5d1d0ac0e364002f46011e
|
|
| BLAKE2b-256 |
f9bfbad11723dfba3460592a37c7445777b81f7a37219261e1ac1c2f912401f2
|
File details
Details for the file scarse-1.0.0-py3-none-any.whl.
File metadata
- Download URL: scarse-1.0.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56e6b7c623c7d8930f2034eb55ed26e7d2f432c96ff74366dae7d4bbb07ae279
|
|
| MD5 |
f2ef22d7663dc4d41c39a6fcf308aa6c
|
|
| BLAKE2b-256 |
e6ee9d63a1d35a6eeea14d4bcc0617230dd59198417ca3877d95de402b6af390
|