A library for two-stage machine learning models with staggered feature arrival
Project description
StagecoachML
StagecoachML is a tiny library for building two-stage models when your features arrive in two batches at different times.
Think:
- Ad serving and recommendation:
first score on user + context, then refine on creative/item + real-time signals. - Per-customer privacy:
a shared non-sensitive trunk, plus a per-customer head that uses private fields inside their own environment. - Latency-sensitive inference:
run a fast stage-1 model early in the request, and only run the heavier stage-2 model when needed.
StagecoachML encodes that pattern directly in the model interface instead of leaving it buried in infra and notebooks.
When should you use StagecoachML?
Use StagecoachML when:
- You can’t wait for all features before you have to start making decisions.
- Some features live in a different silo (e.g. customer’s infra) and must never hit the central model.
- You want to tune and evaluate the whole two-stage system as a single estimator
(train/test/CV), while still being able to:
- get stage-1 scores from early features, and
- get refined scores once late features arrive.
If you have all your features at once and a single model is fine, this library is probably overkill. But if you live with staggered features, StagecoachML keeps the logic honest.
Core idea
A StagecoachML model splits features into two groups:
- Early features: available at stage 1 (e.g. user, context).
- Late features: only available at stage 2 (e.g. ad/creative/item, customer-side data).
You choose:
- a stage-1 estimator that sees only early features, and
- a stage-2 estimator that sees late features plus (optionally) the stage-1
prediction, and either:
- learns to predict the residual
y − ŷ₁, or - learns the final target directly.
- learns to predict the residual
At inference time you can:
- call
predict_stage1(...)/predict_stage1_proba(...)when you only have early features; and - call
predict(...)/predict_proba(...)later when you have both.
Under the hood, you still train and cross-validate it like any other sklearn estimator.
Try it Online
Click the badge above to try StagecoachML directly in your browser with interactive examples powered by Pyodide - runs instantly with zero installation!
Installation
StagecoachML is a pure Python package that depends on NumPy, pandas, and scikit-learn.
pip install stagecoachml
Or install from source:
git clone https://github.com/finite-sample/stagecoachml.git
cd stagecoachml
pip install -e .
Import the estimators:
from stagecoachml import StagecoachRegressor, StagecoachClassifier
Quick start
Regression example (diabetes dataset)
from stagecoachml import StagecoachRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# Load data as a DataFrame
diabetes = load_diabetes(as_frame=True)
X = diabetes.frame.drop(columns=["target"])
y = diabetes.frame["target"]
# Split columns into "early" and "late" features
features = list(X.columns)
mid = len(features) // 2
early_features = features[:mid] # pretend these arrive early
late_features = features[mid:] # pretend these arrive later
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
# Stage-1: fast global model on early features
stage1 = LinearRegression()
# Stage-2: more flexible model on late features + stage-1 prediction
stage2 = RandomForestRegressor(n_estimators=200, random_state=0)
model = StagecoachRegressor(
stage1_estimator=stage1,
stage2_estimator=stage2,
early_features=early_features,
late_features=late_features,
residual=True,
use_stage1_pred_as_feature=True,
inner_cv=None, # set >1 to cross-fit stage-1 preds if you care
)
# Hyper-parameter search over both stages
param_grid = {
"stage1_estimator__fit_intercept": [True, False],
"stage2_estimator__max_depth": [None, 5, 10],
}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)
best = grid.best_estimator_
print("Stage-1 test R²: ", r2_score(y_test, best.predict_stage1(X_test)))
print("Final test R²: ", r2_score(y_test, best.predict(X_test)))
Classification example (breast cancer dataset)
from stagecoachml import StagecoachClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target
features = list(X.columns)
mid = len(features) // 2
early = features[:mid]
late = features[mid:]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
stage1_clf = LogisticRegression(max_iter=1000)
stage2_clf = RandomForestClassifier(n_estimators=200, random_state=2)
model = StagecoachClassifier(
stage1_estimator=stage1_clf,
stage2_estimator=stage2_clf,
early_features=early,
late_features=late,
use_stage1_pred_as_feature=True,
)
model.fit(X_train, y_train)
def metrics(y_true, y_pred):
return accuracy_score(y_true, y_pred), f1_score(y_true, y_pred)
# Provisional scores from early features only
stage1_test_proba = model.predict_stage1_proba(X_test)
stage1_acc, stage1_f1 = metrics(y_test, (stage1_test_proba >= 0.5).astype(int))
# Final scores with all features
final_acc, final_f1 = metrics(y_test, model.predict(X_test))
print("Stage-1 test accuracy/F1:", f"{stage1_acc:.3f}/{stage1_f1:.3f}")
print("Final test accuracy/F1:", f"{final_acc:.3f}/{final_f1:.3f}")
API overview
StagecoachRegressor
StagecoachRegressor(
stage1_estimator,
stage2_estimator,
early_features,
late_features,
residual=True,
use_stage1_pred_as_feature=True,
inner_cv=None,
random_state=None,
)
Key points:
stage1_estimator: any sklearn regressor (RandomForestRegressor,LinearRegression, etc.).stage2_estimator: another regressor for the late features (often more flexible).early_features/late_features: column names defining feature arrival.residual=True: stage 2 learnsy − ŷ₁and we add it back at prediction time.use_stage1_pred_as_feature=True: stage-1 prediction becomes an extra input to stage 2.inner_cv: optional K-fold cross-fitting to generate out-of-fold stage-1 predictions for stage-2 training.
Methods:
fit(X, y)predict_stage1(X)– early-only predictions.predict(X)– final predictions.
StagecoachClassifier
StagecoachClassifier(
stage1_estimator,
stage2_estimator,
early_features,
late_features,
use_stage1_pred_as_feature=True,
inner_cv=None,
random_state=None,
)
- Stage-1 classifier must implement
predict_probaordecision_function. - Stage-2 classifier must implement
predict_proba. predict_stage1_proba(X)returns a provisional probability for the positive class using early features only.predict_proba(X)/predict(X)use both stages.
Business-level use cases
1. Ad serving & recommendation
- Stage 1 (trunk): user, session, page/context features. Run for every candidate to do rough scoring / candidate pruning.
- Stage 2 (head): ad/creative/item-side features (embeddings, textual features, sponsorship info), plus stage-1 scores. Run only on the smaller candidate set.
This lets you:
- keep the expensive features and models off the critical path where possible,
- cross-validate the whole two-stage scoring process as one estimator, and
- reason explicitly about which features are actually available at each stage.
2. Per-customer models with private fields
- Shared trunk: trained on non-sensitive features across all customers.
- Per-customer head (stage 2): trained only on that customer’s private fields (GDP data, custom risk scores, internal labels) inside their environment.
You can:
- ship the trunk once,
- let each customer fit their own stage-2 model locally,
- still evaluate how “global trunk + local head” behaves on held-out data.
3. Latency and staged inference
If your system has a front-door budget (say ~10 ms) and a back-end budget per selected candidate, StagecoachML gives you a clean way to:
- do rough scoring at T₁ using a small, cheap stage-1 model;
- hydrate more features or call heavier services; and
- refine scores at T₂ with stage-2.
Because the whole pipeline is an sklearn estimator, you don’t have to guess whether this staging actually helps: you can compare two-stage vs single-stage models on the same train/test splits.
Examples
The examples/ directory contains runnable scripts:
-
examples/regression_example.pyUses the diabetes dataset, splits features into early/late, trains aStagecoachRegressor, and compares it to a one-stage baseline. -
examples/classification_example.pyUses the breast cancer dataset, trains aStagecoachClassifier, and compares provisional vs final predictions and a one-stage logistic baseline.
Run them with:
python -m examples.regression_example
python -m examples.classification_example
Design notes & non-goals
- Treat
Stagecoach*as one model for train/validation/test; don’t hand-tune stages in isolation and then try to glue them. inner_cvis an optional extra for robustness, not a replacement for normal cross-validation.- This library is not a general DAG/workflow engine. If you want full pipeline orchestration (scheduling, retries, monitoring, etc.), you probably want Airflow/Prefect/etc. StagecoachML is about one very specific modeling pattern: staged feature arrival.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stagecoachml-0.2.0.tar.gz.
File metadata
- Download URL: stagecoachml-0.2.0.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d76dadf87ad1d4914330406edf8b6153ecca04a9ba11ffe8f184f8799498731
|
|
| MD5 |
f20b1c1aa1daf33ad4a5b37e194de1fa
|
|
| BLAKE2b-256 |
12aa8f87d59dfa3998bc72373545e989dd86741438975687205dce5d292bc6a0
|
Provenance
The following attestation bundles were made for stagecoachml-0.2.0.tar.gz:
Publisher:
python-publish.yml on finite-sample/stagecoachml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stagecoachml-0.2.0.tar.gz -
Subject digest:
3d76dadf87ad1d4914330406edf8b6153ecca04a9ba11ffe8f184f8799498731 - Sigstore transparency entry: 763977202
- Sigstore integration time:
-
Permalink:
finite-sample/stagecoachml@c795534b5996b1863db917a7ee82d775fd7de927 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/finite-sample
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@c795534b5996b1863db917a7ee82d775fd7de927 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file stagecoachml-0.2.0-py3-none-any.whl.
File metadata
- Download URL: stagecoachml-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efa75fe2701f3ff5f59e2c794cdb6463bdae6917ab254ab29665e40842b90515
|
|
| MD5 |
521595851cbc2e65ad80f9da9b07ea50
|
|
| BLAKE2b-256 |
5145162e4ad07877fc6e6e43bcc869d8f96bac7521499c2bc6d42523de6478b9
|
Provenance
The following attestation bundles were made for stagecoachml-0.2.0-py3-none-any.whl:
Publisher:
python-publish.yml on finite-sample/stagecoachml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stagecoachml-0.2.0-py3-none-any.whl -
Subject digest:
efa75fe2701f3ff5f59e2c794cdb6463bdae6917ab254ab29665e40842b90515 - Sigstore transparency entry: 763977204
- Sigstore integration time:
-
Permalink:
finite-sample/stagecoachml@c795534b5996b1863db917a7ee82d775fd7de927 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/finite-sample
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@c795534b5996b1863db917a7ee82d775fd7de927 -
Trigger Event:
workflow_dispatch
-
Statement type: