A package for training SCARSE on small sample sizes of peptide sequences that can later be used to predict peptide properties of unseen peptides. Making SCARSE perfectly suited for AI-infused peptide engineering.

These details have not been verified by PyPI

Project links

Homepage

Project description

SCARSE: Small-sample Classification And Regression Solution for low-resource peptide Engineering

Workflow Diagram

Abstract

Reliable estimation of downstream performance in low-data peptide machine learning is critical for guiding early-stage AI-driven peptide engineering, yet it is often unclear how to assess whether a model will be effective in iterative discovery settings. Here, we show that cross validation R² score can serve as a simple and robust proxy for predicting active learning workflow performance, enabling early-stage evaluation of model suitability for sequential peptide optimization. To support this, we introduce SCARSE, a machine learning framework combining ESM-2 protein language model embeddings with Gaussian process regression and extremely randomized trees classification, designed for low-resource peptide property prediction (20–500 training samples). We benchmark SCARSE across 23 peptide and small-protein datasets covering substitution and indel variants, antimicrobial peptides, cell-penetrating peptides, and toxic/non-toxic peptides. The protein language model approach significantly outperforms a hand-engineered descriptor baseline on substitution and indel tasks, while the two approaches achieve comparable performance on shorter peptide non-mutant datasets where simpler descriptors capture enough of the signal. In simulated active learning workflows, SCARSE consistently outperforms baseline and random sampling strategies, and notably we demonstrate that CV R² computed from as few as 50 labeled peptides can be sufficient to estimate final active learning endpoint performance, providing a practical, data-efficient criterion for deciding whether a given dataset combined with SCARSE is suitable for iterative peptide discovery. SCARSE is released as a pip package and is available via HuggingFace Spaces to facilitate integration into peptide engineering workflows.

How SCARSE works

SCARSE is designed for peptide property prediction in low-data regimes by combining protein language model embeddings with classical machine learning methods.

The workflow consists of the following steps:

Input data
- A CSV file containing peptide sequences and one or more target variables.
- The sequence column (seq_col) should contain amino acid sequences.
- The target column(s) (score_col) contain regression values or class labels.
Sequence embedding
- Sequences are converted into numerical representations using the ESM-2 protein language model.
Model selection
- Depending on the task:
  - Regression → Gaussian Process Regression
  - Classification → Extremely Randomized Trees
- These models are chosen for robustness in small-sample settings.
Hyperparameter optimization
- Models are tuned using cross-validation and Optuna-based optimization.
- The number of folds and optimization trials can be controlled by the user.
Training output
- Cross-validation performance metrics are returned.
- A trained model environment is stored internally and reused for prediction.
Prediction
- New sequences are embedded using the same pipeline.
- The trained model generates predictions for each target variable.

Notes and best practices

Call order matters
You must run scarse.train() before calling scarse.pred(), as the trained model is stored internally.
Data quality is important
- Ensure no missing or empty sequences
- Use consistent formatting (standard amino acid codes(ACDEFGHIKLMNPQRSTVWY))
Small datasets are supported
SCARSE have been evaluated for datasets as small as ~20 samples, but performance generally improves with more data.

Tested for Python version

Python version == 3.12.10

Setup

pip install scarse

Usage

For training on regression problem:

import scarse

scarse.train(data_path="../app/train.csv", 
             classification=False, 
             seq_col="sequence",
             score_col=["score"])

For training on classification problem:

import scarse

scarse.train(data_path="../app/train.csv", 
             classification=True, 
             seq_col="sequence",
             score_col=["classes"])

For predicting after model have been trained:

df_pred = scarse.pred(data_path="../app/test.csv", seq_col="sequence")

Tutorials

See the following tutorial, structured as a Python notebook:

tutorial.ipynb

Correlate to active learning end-point performance

Below we illustrate the relation between CV R² score and end-point active learning performance.
The y-axis display how many times better performance SCARSE guided active learning delivers compared to random sampling when looking at the accumulation of top 10% of peptides.
By comparing the CV R² score of your data to the corresponding figure below for your dataset size one can get and indication of how suitable your data is combined with SCARSE to perform active learning peptide engineering.
Note that this can only be used as a guide to evaluate regression problem performance.

Workflow Diagram

Citation

Coming soon!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scarse-1.0.0.tar.gz (18.2 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scarse-1.0.0-py3-none-any.whl (15.7 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file scarse-1.0.0.tar.gz.

File metadata

Download URL: scarse-1.0.0.tar.gz
Upload date: Jun 26, 2026
Size: 18.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for scarse-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c91ad8a9d03753563bbde580775bcaa72b5cbcd531e8b755f30e939e2f0c54eb`
MD5	`f746a87cdf5d1d0ac0e364002f46011e`
BLAKE2b-256	`f9bfbad11723dfba3460592a37c7445777b81f7a37219261e1ac1c2f912401f2`

See more details on using hashes here.

File details

Details for the file scarse-1.0.0-py3-none-any.whl.

File metadata

Download URL: scarse-1.0.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for scarse-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56e6b7c623c7d8930f2034eb55ed26e7d2f432c96ff74366dae7d4bbb07ae279`
MD5	`f2ef22d7663dc4d41c39a6fcf308aa6c`
BLAKE2b-256	`e6ee9d63a1d35a6eeea14d4bcc0617230dd59198417ca3877d95de402b6af390`

See more details on using hashes here.

scarse 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SCARSE: Small-sample Classification And Regression Solution for low-resource peptide Engineering

Abstract

How SCARSE works

Notes and best practices

Tested for Python version

Setup

Usage

For training on regression problem:

For training on classification problem:

For predicting after model have been trained:

Tutorials

Correlate to active learning end-point performance

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes