Skip to main content

A library for two-stage machine learning models with staggered feature arrival

Project description

StagecoachML

PyPI - Version Tests Documentation Try in Browser License: MIT

StagecoachML is a tiny library for building two-stage models when your features arrive in two batches at different times.

Think:

  • Ad serving and recommendation:
    first score on user + context, then refine on creative/item + real-time signals.
  • Per-customer privacy:
    a shared non-sensitive trunk, plus a per-customer head that uses private fields inside their own environment.
  • Latency-sensitive inference:
    run a fast stage-1 model early in the request, and only run the heavier stage-2 model when needed.

StagecoachML encodes that pattern directly in the model interface instead of leaving it buried in infra and notebooks.


When should you use StagecoachML?

Use StagecoachML when:

  • You can’t wait for all features before you have to start making decisions.
  • Some features live in a different silo (e.g. customer’s infra) and must never hit the central model.
  • You want to tune and evaluate the whole two-stage system as a single estimator (train/test/CV), while still being able to:
    • get stage-1 scores from early features, and
    • get refined scores once late features arrive.

If you have all your features at once and a single model is fine, this library is probably overkill. But if you live with staggered features, StagecoachML keeps the logic honest.


Core idea

A StagecoachML model splits features into two groups:

  • Early features: available at stage 1 (e.g. user, context).
  • Late features: only available at stage 2 (e.g. ad/creative/item, customer-side data).

You choose:

  • a stage-1 estimator that sees only early features, and
  • a stage-2 estimator that sees late features plus (optionally) the stage-1 prediction, and either:
    • learns to predict the residual y − ŷ₁, or
    • learns the final target directly.

At inference time you can:

  • call predict_stage1(...) / predict_stage1_proba(...) when you only have early features; and
  • call predict(...) / predict_proba(...) later when you have both.

Under the hood, you still train and cross-validate it like any other sklearn estimator.


Try it Online

Try in Browser

Click the badge above to try StagecoachML directly in your browser with interactive examples powered by Pyodide - runs instantly with zero installation!

Installation

StagecoachML is a pure Python package that depends on NumPy, pandas, and scikit-learn.

pip install stagecoachml

Or install from source:

git clone https://github.com/finite-sample/stagecoachml.git
cd stagecoachml
pip install -e .

Import the estimators:

from stagecoachml import StagecoachRegressor, StagecoachClassifier

Quick start

Regression example (diabetes dataset)

from stagecoachml import StagecoachRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Load data as a DataFrame
diabetes = load_diabetes(as_frame=True)
X = diabetes.frame.drop(columns=["target"])
y = diabetes.frame["target"]

# Split columns into "early" and "late" features
features = list(X.columns)
mid = len(features) // 2
early_features = features[:mid]   # pretend these arrive early
late_features  = features[mid:]   # pretend these arrive later

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

# Stage-1: fast global model on early features
stage1 = LinearRegression()

# Stage-2: more flexible model on late features + stage-1 prediction
stage2 = RandomForestRegressor(n_estimators=200, random_state=0)

model = StagecoachRegressor(
    stage1_estimator=stage1,
    stage2_estimator=stage2,
    early_features=early_features,
    late_features=late_features,
    residual=True,
    use_stage1_pred_as_feature=True,
    inner_cv=None,            # set >1 to cross-fit stage-1 preds if you care
)

# Hyper-parameter search over both stages
param_grid = {
    "stage1_estimator__fit_intercept": [True, False],
    "stage2_estimator__max_depth": [None, 5, 10],
}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)

best = grid.best_estimator_

print("Stage-1 test R²: ", r2_score(y_test, best.predict_stage1(X_test)))
print("Final   test R²: ", r2_score(y_test, best.predict(X_test)))

Classification example (breast cancer dataset)

from stagecoachml import StagecoachClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target

features = list(X.columns)
mid = len(features) // 2
early = features[:mid]
late  = features[mid:]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

stage1_clf = LogisticRegression(max_iter=1000)
stage2_clf = RandomForestClassifier(n_estimators=200, random_state=2)

model = StagecoachClassifier(
    stage1_estimator=stage1_clf,
    stage2_estimator=stage2_clf,
    early_features=early,
    late_features=late,
    use_stage1_pred_as_feature=True,
)

model.fit(X_train, y_train)

def metrics(y_true, y_pred):
    return accuracy_score(y_true, y_pred), f1_score(y_true, y_pred)

# Provisional scores from early features only
stage1_test_proba = model.predict_stage1_proba(X_test)
stage1_acc, stage1_f1 = metrics(y_test, (stage1_test_proba >= 0.5).astype(int))

# Final scores with all features
final_acc, final_f1 = metrics(y_test, model.predict(X_test))

print("Stage-1  test accuracy/F1:", f"{stage1_acc:.3f}/{stage1_f1:.3f}")
print("Final    test accuracy/F1:", f"{final_acc:.3f}/{final_f1:.3f}")

API overview

StagecoachRegressor

StagecoachRegressor(
    stage1_estimator,
    stage2_estimator,
    early_features,
    late_features,
    residual=True,
    use_stage1_pred_as_feature=True,
    inner_cv=None,
    random_state=None,
)

Key points:

  • stage1_estimator: any sklearn regressor (RandomForestRegressor, LinearRegression, etc.).
  • stage2_estimator: another regressor for the late features (often more flexible).
  • early_features / late_features: column names defining feature arrival.
  • residual=True: stage 2 learns y − ŷ₁ and we add it back at prediction time.
  • use_stage1_pred_as_feature=True: stage-1 prediction becomes an extra input to stage 2.
  • inner_cv: optional K-fold cross-fitting to generate out-of-fold stage-1 predictions for stage-2 training.

Methods:

  • fit(X, y)
  • predict_stage1(X) – early-only predictions.
  • predict(X) – final predictions.

StagecoachClassifier

StagecoachClassifier(
    stage1_estimator,
    stage2_estimator,
    early_features,
    late_features,
    use_stage1_pred_as_feature=True,
    inner_cv=None,
    random_state=None,
)
  • Stage-1 classifier must implement predict_proba or decision_function.
  • Stage-2 classifier must implement predict_proba.
  • predict_stage1_proba(X) returns a provisional probability for the positive class using early features only.
  • predict_proba(X) / predict(X) use both stages.

Business-level use cases

1. Ad serving & recommendation

  • Stage 1 (trunk): user, session, page/context features. Run for every candidate to do rough scoring / candidate pruning.
  • Stage 2 (head): ad/creative/item-side features (embeddings, textual features, sponsorship info), plus stage-1 scores. Run only on the smaller candidate set.

This lets you:

  • keep the expensive features and models off the critical path where possible,
  • cross-validate the whole two-stage scoring process as one estimator, and
  • reason explicitly about which features are actually available at each stage.

2. Per-customer models with private fields

  • Shared trunk: trained on non-sensitive features across all customers.
  • Per-customer head (stage 2): trained only on that customer’s private fields (GDP data, custom risk scores, internal labels) inside their environment.

You can:

  • ship the trunk once,
  • let each customer fit their own stage-2 model locally,
  • still evaluate how “global trunk + local head” behaves on held-out data.

3. Latency and staged inference

If your system has a front-door budget (say ~10 ms) and a back-end budget per selected candidate, StagecoachML gives you a clean way to:

  • do rough scoring at T₁ using a small, cheap stage-1 model;
  • hydrate more features or call heavier services; and
  • refine scores at T₂ with stage-2.

Because the whole pipeline is an sklearn estimator, you don’t have to guess whether this staging actually helps: you can compare two-stage vs single-stage models on the same train/test splits.


Examples

The examples/ directory contains runnable scripts:

  • examples/regression_example.py Uses the diabetes dataset, splits features into early/late, trains a StagecoachRegressor, and compares it to a one-stage baseline.

  • examples/classification_example.py Uses the breast cancer dataset, trains a StagecoachClassifier, and compares provisional vs final predictions and a one-stage logistic baseline.

Run them with:

python -m examples.regression_example
python -m examples.classification_example

Design notes & non-goals

  • Treat Stagecoach* as one model for train/validation/test; don’t hand-tune stages in isolation and then try to glue them.
  • inner_cv is an optional extra for robustness, not a replacement for normal cross-validation.
  • This library is not a general DAG/workflow engine. If you want full pipeline orchestration (scheduling, retries, monitoring, etc.), you probably want Airflow/Prefect/etc. StagecoachML is about one very specific modeling pattern: staged feature arrival.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stagecoachml-0.2.0.tar.gz (11.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stagecoachml-0.2.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file stagecoachml-0.2.0.tar.gz.

File metadata

  • Download URL: stagecoachml-0.2.0.tar.gz
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stagecoachml-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3d76dadf87ad1d4914330406edf8b6153ecca04a9ba11ffe8f184f8799498731
MD5 f20b1c1aa1daf33ad4a5b37e194de1fa
BLAKE2b-256 12aa8f87d59dfa3998bc72373545e989dd86741438975687205dce5d292bc6a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for stagecoachml-0.2.0.tar.gz:

Publisher: python-publish.yml on finite-sample/stagecoachml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stagecoachml-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: stagecoachml-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stagecoachml-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 efa75fe2701f3ff5f59e2c794cdb6463bdae6917ab254ab29665e40842b90515
MD5 521595851cbc2e65ad80f9da9b07ea50
BLAKE2b-256 5145162e4ad07877fc6e6e43bcc869d8f96bac7521499c2bc6d42523de6478b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for stagecoachml-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on finite-sample/stagecoachml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page