Metrics of immunological foreignness for candidate T-cell epitopes
Project description
WEIRDO
Widely Estimated Immunological Recognition and Detection of Outliers
A Python library for computing peptide foreignness scores—predicting whether a peptide sequence is likely from a pathogen (bacteria, virus) or from self (human, mammalian).
Overview
WEIRDO trains a multi-layer perceptron (MLP) on k-mer presence data from SwissProt to predict organism category membership. Given any peptide, it outputs:
- Category probabilities: likelihood of appearing in human, bacteria, viruses, mammals, etc.
- Foreignness score:
max(pathogens) / (max(pathogens) + max(self))
Quick Start
from weirdo.scorers import SwissProtReference, MLPScorer
# Load reference data (SwissProt 8-mers with organism labels)
ref = SwissProtReference().load()
# Define organism categories
categories = [
'archaea', 'bacteria', 'fungi', 'human', 'invertebrates',
'mammals', 'plants', 'rodents', 'vertebrates', 'viruses'
]
# Get training data: each 8-mer labeled with organism presence
peptides, labels = ref.get_training_data(
target_categories=categories,
multi_label=True,
max_samples=200000 # Optional: sample for faster training
)
# Train the MLP
scorer = MLPScorer(k=8, hidden_layer_sizes=(256, 128, 64))
scorer.train(peptides, labels, target_categories=categories, epochs=200)
# Score new peptides (any length)
df = scorer.predict_dataframe(['MTMDKSEL', 'SIINFEKL', 'NLVPMVATV'])
print(df)
Output:
peptide human viruses bacteria mammals ... foreignness
MTMDKSEL 0.82 0.12 0.08 0.79 ... 0.127
SIINFEKL 0.15 0.73 0.21 0.18 ... 0.802
NLVPMVATV 0.31 0.68 0.15 0.35 ... 0.660
Installation
pip install weirdo
Download reference data (~2.5 GB compressed / ~7.5 GB uncompressed) for training:
weirdo data download
Training Data
WEIRDO uses pre-computed 8-mer data from SwissProt (~100M unique k-mers):
| Category | Description |
|---|---|
| human | Homo sapiens proteins |
| rodents | Mouse, rat proteins |
| mammals | Other mammals (dog, cow, primates, etc.) |
| vertebrates | Fish, birds, reptiles, amphibians |
| invertebrates | Insects, worms, mollusks |
| bacteria | Bacterial proteins |
| viruses | Viral proteins |
| archaea | Archaeal proteins |
| fungi | Fungal proteins |
| plants | Plant proteins |
Each 8-mer has True/False labels for each category, indicating whether it appears in proteins from that organism group.
Feature Extraction
The MLP uses 592 features extracted from each peptide:
Amino Acid Properties (48 features)
12 physicochemical properties × 4 statistics (mean, std, min, max):
- Hydropathy, hydrophilicity
- Mass, volume
- Polarity, pK side chain
- Accessible surface area (folded/unfolded)
- Local flexibility, refractivity
- Solvent exposed area, % exposed residues
Structural Features (27 features)
- Secondary structure propensities (12): helix, sheet, turn × 4 stats
- Category fractions (9): positive/negative charged, hydrophobic, aromatic, aliphatic, polar, tiny, small, cysteine
- Charge features (4): net charge, charge transitions, max cluster, R/(R+K) ratio
- Disorder features (2): disorder/order promoting fractions
Composition Features (420 features)
- Amino acid frequencies (20): fraction of each amino acid
- Dipeptide frequencies (400): fraction of each amino acid pair
Sequence Statistics (12 features)
- Length, log-length, sqrt-length
- Unknown fraction, unique AA fraction
- Max run length, repeat fraction
- Entropy/complexity (entropy, effective AAs, Gini, top-2/maximum frequency)
Reduced Alphabet Frequencies (80 features)
- Composition across common reduced alphabets (Murphy, GBMR, SDM, etc.)
Dipeptide Summary (5 features)
- Entropy, Gini, max/top2 frequency, homodipeptide fraction
API Reference
Training
from weirdo.scorers import SwissProtReference, MLPScorer
# Load reference
ref = SwissProtReference().load()
# Get training data
peptides, labels = ref.get_training_data(
target_categories=['human', 'viruses', 'bacteria', 'mammals'],
multi_label=True,
max_samples=100000 # Optional: limit for memory
)
# Train
scorer = MLPScorer(
k=8,
hidden_layer_sizes=(256, 128, 64),
activation='relu',
alpha=0.0001, # L2 regularization
)
scorer.train(
peptides, labels,
target_categories=['human', 'viruses', 'bacteria', 'mammals'],
epochs=200,
learning_rate=0.001
)
Prediction
# Category probabilities (sigmoid-activated)
probs = scorer.predict_proba(['MTMDKSEL'])
# Shape: (1, n_categories)
# Foreignness score
foreign = scorer.foreignness(
['MTMDKSEL'],
pathogen_categories=['bacteria', 'viruses'],
self_categories=['human', 'mammals', 'rodents']
)
# Returns: max(pathogens) / (max(pathogens) + max(self))
# Full DataFrame output (handles variable-length peptides)
df = scorer.predict_dataframe(['MTMDKSEL', 'SIINFEKL', 'NLVPMVATV'])
Feature Extraction
# Extract features as DataFrame
df = scorer.features_dataframe(['MTMDKSEL', 'SIINFEKL'])
# Shape: (2, 593) - 592 features + peptide column
# Feature names
names = scorer.get_feature_names()
# ['hydropathy_mean', 'hydropathy_std', ..., 'dipep_YY']
Model Persistence
from weirdo import save_model, load_model, list_models
# Save trained model
save_model(scorer, 'my-foreignness-model')
# List saved models
for model in list_models():
print(f"{model.name}: {model.scorer_type}")
# Load model
scorer = load_model('my-foreignness-model')
CLI
# Data management
weirdo data download # Download SwissProt reference
weirdo data list # Show data status
# Model management
weirdo models list # List trained models
weirdo models train --data train.csv --name my-model
weirdo models info my-model # Show model details
weirdo models available # List built-in downloadable pretrained models
weirdo models download NAME # Download pretrained weights by name
weirdo models download --url https://.../model.tar.gz --save-as my-model
# Scoring
weirdo score --model my-model MTMDKSEL SIINFEKL
Long-Run Training on Modal
Use scripts/train_modal_long_run.py to run long remote training and export
weights as a .tar.gz archive:
# One-time setup: seed full SwissProt CSV into Modal data volume
modal volume put weirdo-data-cache data/swissprot-8mers.csv downloads/swissprot-8mers.csv --force
modal run scripts/train_modal_long_run.py \
--model-name swissprot-mlp-modal \
--epochs 1000 \
--max-samples 2000000 \
--output-archive ./swissprot-mlp-modal.tar.gz
The script:
- trains
MLPScorerremotely - saves model files to a Modal volume
- packages weights into
MODEL_NAME.tar.gz - can return archive bytes to your local machine (
--output-archive) - reads SwissProt from
--swissprot-path(default:/root/.weirdo/downloads/swissprot-8mers.csv) - treats
--max-samples 0(default) as "use all available rows"
Distributing Model Weights
Recommended flow:
- Train and export
MODEL_NAME.tar.gz(Modal script above). - Upload archive to a GitHub Release asset (or other HTTPS hosting).
- Download after install via CLI:
# Direct URL (no code change needed)
weirdo models download --url https://github.com/ORG/REPO/releases/download/vX.Y.Z/MODEL_NAME.tar.gz --save-as MODEL_NAME
If you want named built-in downloads (weirdo models download MODEL_NAME),
add an entry to PRETRAINED_MODELS in weirdo/model_manager.py and release a
new package version.
Architecture
weirdo/
├── scorers/
│ ├── mlp.py # MLPScorer with feature extraction
│ ├── swissprot.py # SwissProtReference (training data)
│ ├── config.py # Presets and configuration
│ ├── registry.py # Scorer registry
│ └── trainable.py # TrainableScorer base class
├── model_manager.py # Save/load trained models
├── amino_acid_properties.py # 12 AA property dictionaries
└── api.py # High-level functions
Citation
@software{weirdo,
title = {WEIRDO: Widely Estimated Immunological Recognition and Detection of Outliers},
author = {PIRL-UNC},
url = {https://github.com/pirl-unc/weirdo}
}
License
Apache License 2.0. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file weirdo-2.1.3.tar.gz.
File metadata
- Download URL: weirdo-2.1.3.tar.gz
- Upload date:
- Size: 91.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0e17a9e78ff6e0b737504692c206af9b2fb5b714d3b7b3125efb22f151c59a3
|
|
| MD5 |
3253418139aea1624bd572d9730260bc
|
|
| BLAKE2b-256 |
ddf252cb382568fe4a7124a7c3a3e8e706c582d3dbd0a1a70e3ed0b64de98b89
|
File details
Details for the file weirdo-2.1.3-py3-none-any.whl.
File metadata
- Download URL: weirdo-2.1.3-py3-none-any.whl
- Upload date:
- Size: 87.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d38d52fc8484a950d9c95376ea93ac0121d9feb476a625feadf2ff3f25872526
|
|
| MD5 |
1e89b5f96e1c07a47eebadf228e0ef28
|
|
| BLAKE2b-256 |
d4be002b2c7ee2e53f6f78ab6aca37359252d941634ffbb560e44c1e728b5fa7
|