Skip to main content

Task-first ML baselines. Run the simplest thing that could work.

Project description

stepzero

Tests PyPI Python License: MIT

Task-first ML baselines. Run the simplest thing that could work.

Before reaching for XGBoost or a neural net, run stepzero. It fits the simplest sensible model for your task, compares a few alternatives, and tells you whether your baseline is good enough or what to try next.

import stepzero as sz

result = sz.classification(X, y)
print(result)
# ClassificationResult(best='logistic', accuracy=0.960, headroom='low')

print(result.headroom)
# [low] Score of 0.96 with low variance (±0.012). The simple baseline is already
# performing well. Trying a gradient boosted tree is unlikely to offer a meaningful improvement.

Install

pip install stepzero

Requirements: Python 3.10+, numpy, pandas, scikit-learn, scipy.


Tasks

Classification

result = sz.classification(X, y)

result.best_model          # fitted sklearn Pipeline — call .predict(X_new) directly
result.best_model_name     # "logistic" | "tree" | "naive_bayes"
result.scores              # [ModelScore(name, score, metric), ...]
result.feature_importance  # pd.Series sorted by importance
result.headroom            # HeadroomSignal(level, reason)
  • Methods: logistic regression, decision tree, naive bayes
  • Metric: accuracy (5-fold stratified CV)

Regression

result = sz.regression(X, y)

result.best_model_name     # "ridge" | "tree"
result.feature_importance  # normalized importances as pd.Series
result.headroom
  • Methods: ridge, decision tree
  • Metric: RMSE (5-fold CV)

Forecasting

result = sz.forecasting(series, horizon=12)

result.forecast        # pd.Series with future timestamps as index
result.best_model_name # "seasonal_naive" | "linear_trend"
result.scores          # MAE per model
result.headroom
  • Methods: seasonal naive, linear trend
  • Parameters: horizon, freq (optional — inferred from DatetimeIndex), cv_splits
  • Metric: MAE (time-series CV)

Anomaly Detection

result = sz.anomaly_detection(series)

result.anomalies   # pd.Series[bool], same index as input
result.scores      # raw anomaly scores
result.method      # "zscore" | "iqr"
result.threshold   # auto-determined threshold
result.headroom
  • Methods: z-score, IQR
  • Parameters: threshold (optional — auto-set to flag ~5% of points), method
  • Metric: inter-method agreement

Text Classification

result = sz.text_classification(texts, labels)

result.best_model_name        # "tfidf_logistic" | "tfidf_naive_bayes"
result.top_features_per_class # {"class_0": ["word1", ...], ...}
result.headroom
  • Methods: TF-IDF + logistic regression, TF-IDF + naive bayes
  • Metric: accuracy (5-fold stratified CV)

Clustering

result = sz.clustering(X, k_range=(2, 10))

result.best_k    # selected number of clusters
result.labels    # cluster assignment per sample (np.ndarray)
result.centers   # cluster centroids in original feature space
result.scores    # silhouette score per k tried
result.headroom
  • Methods: k-means
  • Parameters: k_range
  • Metric: silhouette score

The headroom signal

Every result has a .headroom attribute:

result.headroom.level   # "low" | "medium" | "high"
result.headroom.reason  # actionable explanation + what to try next
print(result.headroom)
# [medium] CV accuracy of 0.81 ± 0.04. A 19% gap to ceiling remains.
# A gradient boosted tree (e.g., XGBoost or LightGBM) is a natural next step.
  • low means that the simple model is already doing well; complexity buys little
  • medium means that meaningful headroom remains; a tuned model may help
  • high means that the baseline is underperforming; a more complex model is likely worth it

Design philosophy

  • Task-first, not model-first. You describe the problem; stepzero picks the approach.
  • Opinionated defaults. Auto-scaling for linear models, missing value imputation, sensible eval.
  • No false modesty. The models are genuinely simple — logistic regression, decision trees, seasonal naive. No AutoML hidden underneath.
  • Ready to deploy. result.best_model is a fitted sklearn Pipeline. Call .predict() on new data immediately.
  • Minimal footprint. Only numpy, pandas, scikit-learn, and scipy. No optional heavy dependencies required for core functionality.

When to use stepzero

  • ✅ Starting a new ML project and want a defensible baseline in 5 minutes
  • ✅ Proving (or disproving) that a simple model is good enough
  • ✅ Teaching or demonstrating ML without the XGBoost-first bias
  • ✅ Kaggle competitions — establish your baseline before tuning

Contributing

Contributions are welcome. Please read CONTRIBUTING.md for the workflow.

In short: branch from develop, open a PR targeting develop. All PRs run the test suite automatically across Python 3.10–3.12.

Reporting issues

Open an issue on GitHub. Include your Python version, stepzero version, and a minimal reproducible example.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stepzero-0.1.0.tar.gz (85.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stepzero-0.1.0-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file stepzero-0.1.0.tar.gz.

File metadata

  • Download URL: stepzero-0.1.0.tar.gz
  • Upload date:
  • Size: 85.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stepzero-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2dc061c27f071c7ea9b80146db52f3509fe567cb0d1c2d3035c6d205c02fd8da
MD5 91ac60328662cf5dea72f0a8d0d29c1e
BLAKE2b-256 5e92eeebb0f8d4733d0580321fed0d9c1cbfb131b84c62845a967f73841160be

See more details on using hashes here.

Provenance

The following attestation bundles were made for stepzero-0.1.0.tar.gz:

Publisher: publish.yml on arnedb/stepzero

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stepzero-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: stepzero-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stepzero-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8644238d7978bfb930aa285df9b03967aec514871e2c29de9a28ab9095c44e02
MD5 fe95988de366e222bbf3045d6b1099ad
BLAKE2b-256 050bd45956fc7d1956ddd8bd4818b94fdac053911653c523d627c4512fad2518

See more details on using hashes here.

Provenance

The following attestation bundles were made for stepzero-0.1.0-py3-none-any.whl:

Publisher: publish.yml on arnedb/stepzero

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page