Skip to main content

A library for two-stage machine learning models with staggered feature arrival

Project description

StagecoachML

PyPI - Version Tests Documentation License: MIT

StagecoachML is a tiny library for building two-stage models when your features arrive in two batches at different times.

Think:

  • Ad serving and recommendation:
    first score on user + context, then refine on creative/item + real-time signals.
  • Per-customer privacy:
    a shared non-sensitive trunk, plus a per-customer head that uses private fields inside their own environment.
  • Latency-sensitive inference:
    run a fast stage-1 model early in the request, and only run the heavier stage-2 model when needed.

StagecoachML encodes that pattern directly in the model interface instead of leaving it buried in infra and notebooks.


When should you use StagecoachML?

Use StagecoachML when:

  • You can’t wait for all features before you have to start making decisions.
  • Some features live in a different silo (e.g. customer’s infra) and must never hit the central model.
  • You want to tune and evaluate the whole two-stage system as a single estimator (train/test/CV), while still being able to:
    • get stage-1 scores from early features, and
    • get refined scores once late features arrive.

If you have all your features at once and a single model is fine, this library is probably overkill. But if you live with staggered features, StagecoachML keeps the logic honest.


Core idea

A StagecoachML model splits features into two groups:

  • Early features: available at stage 1 (e.g. user, context).
  • Late features: only available at stage 2 (e.g. ad/creative/item, customer-side data).

You choose:

  • a stage-1 estimator that sees only early features, and
  • a stage-2 estimator that sees late features plus (optionally) the stage-1 prediction, and either:
    • learns to predict the residual y − ŷ₁, or
    • learns the final target directly.

At inference time you can:

  • call predict_stage1(...) / predict_stage1_proba(...) when you only have early features; and
  • call predict(...) / predict_proba(...) later when you have both.

Under the hood, you still train and cross-validate it like any other sklearn estimator.


Installation

StagecoachML is a pure Python package that depends on NumPy, pandas, and scikit-learn.

pip install stagecoachml

Or install from source:

git clone https://github.com/finite-sample/stagecoachml.git
cd stagecoachml
pip install -e .

Import the estimators:

from stagecoachml import StagecoachRegressor, StagecoachClassifier

Quick start

Regression example (diabetes dataset)

from stagecoachml import StagecoachRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Load data as a DataFrame
diabetes = load_diabetes(as_frame=True)
X = diabetes.frame.drop(columns=["target"])
y = diabetes.frame["target"]

# Split columns into "early" and "late" features
features = list(X.columns)
mid = len(features) // 2
early_features = features[:mid]   # pretend these arrive early
late_features  = features[mid:]   # pretend these arrive later

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

# Stage-1: fast global model on early features
stage1 = LinearRegression()

# Stage-2: more flexible model on late features + stage-1 prediction
stage2 = RandomForestRegressor(n_estimators=200, random_state=0)

model = StagecoachRegressor(
    stage1_estimator=stage1,
    stage2_estimator=stage2,
    early_features=early_features,
    late_features=late_features,
    residual=True,
    use_stage1_pred_as_feature=True,
    inner_cv=None,            # set >1 to cross-fit stage-1 preds if you care
)

# Hyper-parameter search over both stages
param_grid = {
    "stage1_estimator__fit_intercept": [True, False],
    "stage2_estimator__max_depth": [None, 5, 10],
}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)

best = grid.best_estimator_

print("Stage-1 test R²: ", r2_score(y_test, best.predict_stage1(X_test)))
print("Final   test R²: ", r2_score(y_test, best.predict(X_test)))

Classification example (breast cancer dataset)

from stagecoachml import StagecoachClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

features = list(X.columns)
mid = len(features) // 2
early = features[:mid]
late  = features[mid:]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

stage1_clf = LogisticRegression(max_iter=1000)
stage2_clf = RandomForestClassifier(n_estimators=200, random_state=2)

model = StagecoachClassifier(
    stage1_estimator=stage1_clf,
    stage2_estimator=stage2_clf,
    early_features=early,
    late_features=late,
    use_stage1_pred_as_feature=True,
)

model.fit(X_train, y_train)

def metrics(y_true, y_pred):
    return accuracy_score(y_true, y_pred), f1_score(y_true, y_pred)

# Provisional scores from early features only
stage1_test_proba = model.predict_stage1_proba(X_test)
stage1_acc, stage1_f1 = metrics(y_test, (stage1_test_proba >= 0.5).astype(int))

# Final scores with all features
final_acc, final_f1 = metrics(y_test, model.predict(X_test))

print("Stage-1  test accuracy/F1:", f"{stage1_acc:.3f}/{stage1_f1:.3f}")
print("Final    test accuracy/F1:", f"{final_acc:.3f}/{final_f1:.3f}")

API overview

StagecoachRegressor

StagecoachRegressor(
    stage1_estimator,
    stage2_estimator,
    early_features,
    late_features,
    residual=True,
    use_stage1_pred_as_feature=True,
    inner_cv=None,
    random_state=None,
)

Key points:

  • stage1_estimator: any sklearn regressor (RandomForestRegressor, LinearRegression, etc.).
  • stage2_estimator: another regressor for the late features (often more flexible).
  • early_features / late_features: column names defining feature arrival.
  • residual=True: stage 2 learns y − ŷ₁ and we add it back at prediction time.
  • use_stage1_pred_as_feature=True: stage-1 prediction becomes an extra input to stage 2.
  • inner_cv: optional K-fold cross-fitting to generate out-of-fold stage-1 predictions for stage-2 training.

Methods:

  • fit(X, y)
  • predict_stage1(X) – early-only predictions.
  • predict(X) – final predictions.

StagecoachClassifier

StagecoachClassifier(
    stage1_estimator,
    stage2_estimator,
    early_features,
    late_features,
    use_stage1_pred_as_feature=True,
    inner_cv=None,
    random_state=None,
)
  • Stage-1 classifier must implement predict_proba or decision_function.
  • Stage-2 classifier must implement predict_proba.
  • predict_stage1_proba(X) returns a provisional probability for the positive class using early features only.
  • predict_proba(X) / predict(X) use both stages.

Business-level use cases

1. Ad serving & recommendation

  • Stage 1 (trunk): user, session, page/context features. Run for every candidate to do rough scoring / candidate pruning.
  • Stage 2 (head): ad/creative/item-side features (embeddings, textual features, sponsorship info), plus stage-1 scores. Run only on the smaller candidate set.

This lets you:

  • keep the expensive features and models off the critical path where possible,
  • cross-validate the whole two-stage scoring process as one estimator, and
  • reason explicitly about which features are actually available at each stage.

2. Per-customer models with private fields

  • Shared trunk: trained on non-sensitive features across all customers.
  • Per-customer head (stage 2): trained only on that customer’s private fields (GDP data, custom risk scores, internal labels) inside their environment.

You can:

  • ship the trunk once,
  • let each customer fit their own stage-2 model locally,
  • still evaluate how “global trunk + local head” behaves on held-out data.

3. Latency and staged inference

If your system has a front-door budget (say ~10 ms) and a back-end budget per selected candidate, StagecoachML gives you a clean way to:

  • do rough scoring at T₁ using a small, cheap stage-1 model;
  • hydrate more features or call heavier services; and
  • refine scores at T₂ with stage-2.

Because the whole pipeline is an sklearn estimator, you don’t have to guess whether this staging actually helps: you can compare two-stage vs single-stage models on the same train/test splits.


Examples

The examples/ directory contains runnable scripts:

  • examples/regression_example.py Uses the diabetes dataset, splits features into early/late, trains a StagecoachRegressor, and compares it to a one-stage baseline.

  • examples/classification_example.py Uses the breast cancer dataset, trains a StagecoachClassifier, and compares provisional vs final predictions and a one-stage logistic baseline.

Run them with:

python -m examples.regression_example
python -m examples.classification_example

Design notes & non-goals

  • Treat Stagecoach* as one model for train/validation/test; don’t hand-tune stages in isolation and then try to glue them.
  • inner_cv is an optional extra for robustness, not a replacement for normal cross-validation.
  • This library is not a general DAG/workflow engine. If you want full pipeline orchestration (scheduling, retries, monitoring, etc.), you probably want Airflow/Prefect/etc. StagecoachML is about one very specific modeling pattern: staged feature arrival.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stagecoachml-0.1.2.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stagecoachml-0.1.2-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file stagecoachml-0.1.2.tar.gz.

File metadata

  • Download URL: stagecoachml-0.1.2.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for stagecoachml-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ba9d9a18ced8592c6bfb3d40d0d1a0297e5cc105c1b5d7c6fb3bd67bc0642f9a
MD5 3c6592cabc72042933ed05493d9cef2b
BLAKE2b-256 93713b5203a1149693db8d88f2ba672300767353fdd7464388c55b94cd48af8c

See more details on using hashes here.

File details

Details for the file stagecoachml-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: stagecoachml-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for stagecoachml-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 49565d8349643d241ad2e7fd307ce4f4de3b6cefda2c35de744b972e70226e35
MD5 ff7cd982fe4603c622291ac4ea60eac0
BLAKE2b-256 a14b52f6c5319bbbdb89a2ce7d9d50acf98c1b05cb5da0396f6a005596f721a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page