Unified AutoML + Dataset Diagnosis + Drift Detection for production ML
Project description
๐ก๏ธ AutoGuard ML
AutoML + Dataset Diagnosis + Drift Detection โ all in one package.
Most ML tools split responsibilities across multiple libraries. You use one tool for training, another for monitoring, another for explainability.
AutoGuard ML puts everything in one clean pipeline:
from autoguard import AutoGuard
ag = AutoGuard(target="churn")
ag.diagnose(df) # catch data problems before they ruin your model
df_clean = ag.auto_fix(df) # auto-clean the dataset
ag.fit(df_clean) # AutoML picks and tunes the best model
ag.explain() # SHAP feature importance
ag.report() # HTML report you can share
ag.monitor(new_df) # detect drift in production
That's the whole pipeline. One object. Seven methods.
Table of Contents
- What it does
- Install
- 5-Minute Quickstart
- Feature Guide
- CLI Reference
- Config System
- Plugin System
- REST API
- Project Structure
- Running Tests
- FAQ
- Contributing
What it does
| Module | What it catches or solves |
|---|---|
| ๐ฉบ Dataset Doctor | Missing values, class imbalance, outliers, feature correlation, data leakage, skewed distributions |
| ๐งน Auto-Fix | Fills missing values, encodes categoricals, caps outliers, normalizes skewed columns |
| โ๏ธ AutoML Engine | Tries RandomForest, XGBoost, LightGBM, LogisticRegression, Ridge โ tunes HPO with Optuna, picks winner via CV |
| ๐ Explainability | SHAP global importance and per-prediction local explanations |
| ๐ Report Generator | Self-contained dark-theme HTML report with leaderboard, risk score, and issue list |
| ๐ก Drift Monitor | KS test + PSI for numeric, Chi-squared + PSI for categorical, severity scores, alert logging |
| ๐ REST API | FastAPI server with /predict, /predict/proba, /monitor, /model/info |
| ๐ฅ๏ธ CLI | autoguard train, diagnose, fix, monitor, explain, serve, report |
Install
Basic install (training + monitoring):
pip install autoguard-ml
With REST API support:
pip install autoguard-ml[api]
Everything:
pip install autoguard-ml[all]
From source:
git clone https://github.com/autoguard/autoguard-ml
cd autoguard-ml
pip install -e ".[dev]"
Requirements: Python 3.10+, numpy, pandas, scikit-learn, xgboost, lightgbm, optuna, shap, scipy, matplotlib, seaborn, rich, click
5-Minute Quickstart
Get some data
import pandas as pd
df = pd.read_csv("your_data.csv")
No data yet? Generate sample data:
python examples/generate_sample_data.py
# creates: examples/data/train.csv and new_data.csv
Run the full pipeline
from autoguard import AutoGuard
# Initialize with your target column
ag = AutoGuard(target="churn")
# Step 1: Diagnose data quality
ag.diagnose(df)
โญโโโโโโโโโโโโโโโโโโโ Dataset Doctor โโโโโโโโโโโโโโโโโโโโโโโฎ
โ Rows: 5,000 Cols: 15 Target: churn โ
โ Risk Score: 58.0 / 100 (HIGH) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
HIGH โ 35% missing values in column 'age'
HIGH โ Severe class imbalance: 'churn' = 6.2%
CRITICAL โ Possible leakage in 'customer_id' (r=0.99)
MEDIUM โ Highly correlated: income โ salary (0.97)
# Step 2: Auto-clean
df_clean = ag.auto_fix(df)
# Step 3: Train โ AutoML picks the best model
ag.fit(df_clean)
๐ AutoML Leaderboard
Rank Model CV Score
โญ 1 XGBoost 0.92341
2 RandomForest 0.90187
3 LightGBM 0.89923
4 LogisticRegression 0.83410
โ Best model: XGBoost
# Step 4: Explain with SHAP
ag.explain()
# Step 5: Generate HTML report
ag.report()
# Step 6: Save
ag.save("model.pkl")
# Step 7: Monitor production data for drift
ag.monitor(new_df)
โญโโโโโโโโโโโโโโโโ Drift Monitor โโโโโโโโโโโโโโโโโฎ
โ Severity: 72.3/100 (HIGH) | 3 features โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
monthly_fee PSI=0.31 HIGH
city PSI=0.29 MODERATE
usage_score PSI=0.14 MODERATE
Feature Guide
1. Dataset Doctor
Runs before you train. Catches problems that would silently destroy model accuracy.
report = ag.diagnose(df)
# Summary
report["risk_score"] # 58.0 (0 to 100)
report["risk_level"] # "high" (low / medium / high / critical)
report["issues"] # list of all warnings with severity
# Individual check results
report["missing_values"]["columns_with_missing"] # {col: missing_ratio}
report["missing_values"]["high_missing_columns"] # columns above threshold
report["class_imbalance"]["is_imbalanced"] # True or False
report["class_imbalance"]["class_distribution"] # {class: ratio}
report["correlations"]["high_correlation_pairs"] # [{a, b, correlation}]
report["leakage_risk"]["suspicious_columns"] # [{column, corr_with_target}]
report["outliers"]["flagged_columns"] # {col: {n_outliers, pct}}
report["skewness"]["skewed_columns"] # {col: skew_value}
report["constant_columns"]["constant_columns"] # [col, ...]
Generated plots saved to autoguard_output/plots/:
| File | What it shows |
|---|---|
missing_values.png |
Bar chart of missing ratios per column |
correlation_matrix.png |
Heatmap of feature correlations |
target_distribution.png |
Class balance or value histogram |
skewness.png |
Absolute skewness by feature |
Disable plots:
from autoguard import AutoGuard, AutoGuardConfig
cfg = AutoGuardConfig()
cfg.data.generate_plots = False
ag = AutoGuard(target="label", config=cfg)
2. Auto-Fix (Data Cleaning)
Automatically preprocesses raw data ready for training.
df_clean = ag.auto_fix(df)
What it does, in order:
| Step | What happens |
|---|---|
| Drop constants | Removes columns with only one unique value |
| Fill missing numeric | Fills with column median |
| Fill missing categorical | Fills with column mode |
| Cap outliers | Clips values beyond IQR x 3 |
| Log-transform skewed | Applies log1p to highly skewed numeric columns |
| Encode categoricals | Label encode (<=15 unique), one-hot (<=50 unique), or frequency encode |
The target column is never modified.
# Apply the same fitted cleaner to new production data
df_new_clean = ag.auto_fix(new_df)
3. AutoML Engine
Tries multiple models, tunes hyperparameters with Optuna, cross-validates, picks the winner.
ag.fit(df_clean)
# Force problem type if needed
ag.fit(df_clean, problem_type="classification")
ag.fit(df_clean, problem_type="regression")
# After fitting
ag.best_model_name # "xgboost"
ag.best_model # the fitted sklearn-compatible estimator
ag.leaderboard # pd.DataFrame with all model scores
ag.problem_type # "classification" or "regression"
ag.feature_cols # list of feature column names used
Default models tried:
| Model | Classification | Regression |
|---|---|---|
| RandomForest | Yes | Yes |
| XGBoost | Yes | Yes |
| LightGBM | Yes | Yes |
| LogisticRegression | Yes | No |
| Ridge | Yes (RidgeClassifier) | Yes |
Default scoring metrics:
| Problem | Metric |
|---|---|
| Classification | f1_weighted |
| Regression | neg_root_mean_squared_error |
Change via config:
automl:
scoring_classification: roc_auc
scoring_regression: neg_mean_absolute_error
Class imbalance โ SMOTE is applied automatically. Disable: cfg.automl.handle_imbalance = False
4. Explainability (SHAP)
# Uses validation slice from training automatically
ag.explain()
# Custom data and specific sample
ag.explain(X=df.drop(columns=["label"]).head(300), sample_index=5)
Explainer auto-selection:
| Model type | SHAP explainer |
|---|---|
| RandomForest, XGBoost, LightGBM | TreeExplainer (fast) |
| LogisticRegression, Ridge, Lasso | LinearExplainer |
| Everything else | KernelExplainer (slower, model-agnostic) |
Output files saved to autoguard_output/explain/:
| File | What it shows |
|---|---|
shap_global_importance.png |
Bar chart: mean absolute SHAP per feature |
shap_summary_plot.png |
Beeswarm: feature value vs impact |
shap_local_0.png |
Waterfall for one specific prediction |
Get importance as a Series:
importance = ag._explainer.get_feature_importance(X)
print(importance.head(10))
# monthly_fee 0.1823
# tenure_months 0.1541
# usage_score 0.1203
5. Report Generator
# HTML report (default, opens in browser)
ag.report()
ag.report(output_path="results/my_report.html")
# JSON format
ag.report(format="json", output_path="report.json")
# The method also returns the report dict
data = ag.report()
data["best_model"] # "xgboost"
data["leaderboard"] # list of dicts
data["diagnosis"] # full diagnosis report
The HTML report is self-contained โ no internet required, no external CSS. Open it in any browser and share with teammates or stakeholders.
6. Drift Monitor
Detects when production data drifts away from the training distribution.
report = ag.monitor(new_df)
report["overall_drift_severity"] # 72.3 (0 to 100)
report["drift_level"] # "high" (none / low / moderate / high / critical)
report["n_features_drifted"] # 3
report["drifted_features"] # ["monthly_fee", "city", "usage_score"]
# Per-feature detail
f = report["features"]["monthly_fee"]
f["psi"] # 0.31 (Population Stability Index)
f["ks_pvalue"] # 0.0003
f["drifted"] # True
f["severity"] # "high"
f["ref_mean"] # 65.2 (training mean)
f["cur_mean"] # 95.8 (current mean)
PSI severity thresholds:
| PSI Value | Meaning | Recommended action |
|---|---|---|
| Less than 0.10 | No significant drift | All good |
| 0.10 to 0.20 | Moderate drift | Investigate |
| Greater than 0.20 | Severe drift | Consider retraining |
Continuous / streaming monitoring:
for batch in incoming_data_stream:
result = ag.monitor(batch, save_report=True)
if result["overall_drift_severity"] > 50:
print("High drift โ retraining needed")
trigger_retraining_pipeline()
7. Save and Load
# Save the entire fitted AutoGuard instance
ag.save("model.pkl")
ag.save("models/production_v2.pkl")
# Load it back anywhere โ comes back fully fitted
ag2 = AutoGuard.load("model.pkl")
preds = ag2.predict(new_df)
ag2.monitor(stream_df)
ag2.explain()
8. Predict
# Classification
preds = ag.predict(X) # array of class labels
proba = ag.predict_proba(X) # array of shape (n_samples, n_classes)
# Regression
preds = ag.predict(X) # array of continuous values
CLI Reference
train
autoguard train data.csv --target churn
autoguard train data.csv --target churn --output my_model.pkl
autoguard train data.csv --target churn --config config.yaml
autoguard train data.csv --target price --problem-type regression
autoguard train data.csv --target churn --no-diagnose --no-fix
autoguard train data.csv --target churn --report
| Flag | Default | Description |
|---|---|---|
| --target / -t | required | Target column name |
| --output / -o | model.pkl | Where to save the model |
| --config / -c | none | YAML config file path |
| --problem-type | auto | Force classification or regression |
| --no-diagnose | off | Skip the diagnosis step |
| --no-fix | off | Skip auto-cleaning |
| --report | off | Generate HTML report after training |
diagnose
autoguard diagnose data.csv --target churn
autoguard diagnose data.csv --target churn --output diag.json
autoguard diagnose data.csv --target churn --no-plots
fix
autoguard fix data.csv --target churn
autoguard fix data.csv --target churn --output clean.csv
monitor
autoguard monitor new_data.csv --model model.pkl
autoguard monitor new_data.csv --model model.pkl --output drift_report.json
explain
autoguard explain --model model.pkl --data data.csv
autoguard explain --model model.pkl --data data.csv --sample-index 5
report
autoguard report --model model.pkl
autoguard report --model model.pkl --output results/report.html
autoguard report --model model.pkl --format json
serve
autoguard serve model.pkl
autoguard serve model.pkl --host 127.0.0.1 --port 9000
init
autoguard init # creates autoguard_config.yaml
autoguard init --output project/config.yaml
Config System (YAML)
Drive everything from a YAML file instead of passing arguments in code.
# autoguard_config.yaml
target: churn
verbose: true
output_dir: autoguard_output
data:
missing_threshold: 0.30 # flag columns with >30% missing
correlation_threshold: 0.95 # flag feature pairs above this
outlier_method: iqr # iqr or zscore
outlier_threshold: 3.0
skewness_threshold: 1.0
imbalance_ratio_threshold: 0.10
leakage_correlation_threshold: 0.98
generate_plots: true
automl:
models:
- random_forest
- xgboost
- lightgbm
- logistic_regression
n_trials: 50 # Optuna trials per model โ more = better, slower
cv_folds: 5
scoring_classification: f1_weighted
scoring_regression: neg_root_mean_squared_error
timeout_per_model: 120 # seconds per model
handle_imbalance: true
explain:
max_samples: 300
plot_top_n_features: 15
drift:
ks_pvalue_threshold: 0.05
psi_threshold_warning: 0.10
psi_threshold_alert: 0.20
alert_email: null
Use in Python:
ag = AutoGuard(config_path="autoguard_config.yaml")
ag.fit(df)
Use in CLI:
autoguard train data.csv --config autoguard_config.yaml
Programmatic config:
from autoguard import AutoGuard, AutoGuardConfig
from autoguard.core.config import AutoMLConfig, DriftConfig
cfg = AutoGuardConfig(
automl=AutoMLConfig(
n_trials=100,
models=["xgboost", "lightgbm"],
scoring_classification="roc_auc",
),
drift=DriftConfig(
psi_threshold_alert=0.15,
),
)
ag = AutoGuard(target="label", config=cfg)
Plugin System (Custom Models)
Add any sklearn-compatible model to the AutoML search:
from autoguard.automl.registry import ModelRegistry
@ModelRegistry.register("extra_trees")
def build_extra_trees(trial, problem_type):
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor
params = {
"n_estimators": trial.suggest_int("et_n", 50, 300),
"max_depth": trial.suggest_int("et_depth", 3, 15),
"random_state": 42,
"n_jobs": -1,
}
if problem_type == "classification":
return ExtraTreesClassifier(**params)
return ExtraTreesRegressor(**params)
# Now include it in the model search
from autoguard import AutoGuard, AutoGuardConfig
from autoguard.core.config import AutoMLConfig
cfg = AutoGuardConfig(
automl=AutoMLConfig(models=["xgboost", "random_forest", "extra_trees"])
)
ag = AutoGuard(target="label", config=cfg)
ag.fit(df)
See all registered models:
from autoguard.automl.registry import ModelRegistry
print(ModelRegistry.available())
# ['random_forest', 'xgboost', 'lightgbm', 'logistic_regression', 'ridge', 'extra_trees']
REST API
Start the server
pip install autoguard-ml[api]
# Train and save a model first
autoguard train data.csv --target churn --output model.pkl
# Start the API
autoguard serve model.pkl
# Swagger UI: http://localhost:8000/docs
Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /health | Server and model status |
| GET | /model/info | Model name, leaderboard, feature list |
| GET | /diagnosis | Last dataset diagnosis report |
| POST | /predict | Batch inference โ returns class labels |
| POST | /predict/proba | Probabilistic inference โ returns probabilities |
| POST | /monitor | Drift detection on incoming data batch |
Example requests
Health check:
curl http://localhost:8000/health
{
"status": "ok",
"model_loaded": true,
"best_model": "xgboost",
"problem_type": "classification",
"version": "0.1.0"
}
Predict:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"data": [
{"age": 35, "tenure_months": 12, "monthly_fee": 75, "city": "NYC"},
{"age": 60, "tenure_months": 2, "monthly_fee": 120, "city": "LA"}
]
}'
{
"predictions": [0, 1],
"model_name": "xgboost",
"problem_type": "classification",
"n_samples": 2
}
Predict with probabilities:
curl -X POST http://localhost:8000/predict/proba \
-H "Content-Type: application/json" \
-d '{"data": [{"age": 35, "tenure_months": 12, "monthly_fee": 75}]}'
{
"probabilities": [[0.82, 0.18]],
"classes": ["0", "1"],
"model_name": "xgboost",
"n_samples": 1
}
Drift detection:
curl -X POST http://localhost:8000/monitor \
-H "Content-Type: application/json" \
-d '{"data": [{"age": 65, "tenure_months": 3, "monthly_fee": 110}]}'
{
"overall_drift_severity": 72.3,
"drift_level": "high",
"n_features_drifted": 3,
"drifted_features": ["age", "monthly_fee", "tenure_months"],
"timestamp": "2024-11-15T10:23:41Z"
}
Python client:
import requests
BASE = "http://localhost:8000"
# Predict
r = requests.post(f"{BASE}/predict", json={"data": X_new.to_dict(orient="records")})
predictions = r.json()["predictions"]
# Monitor for drift
r = requests.post(f"{BASE}/monitor", json={"data": stream_batch.to_dict(orient="records")})
print(r.json()["drift_level"])
Project Structure
autoguard-ml/
โ
โโโ autoguard/ main package
โ โโโ __init__.py public API: AutoGuard, AutoGuardConfig
โ โ
โ โโโ core/
โ โ โโโ guard.py AutoGuard class โ the main entry point
โ โ โโโ config.py YAML-driven configuration dataclasses
โ โ โโโ exceptions.py custom exception hierarchy
โ โ โโโ logging.py Rich console + JSON file logging
โ โ
โ โโโ data/
โ โ โโโ doctor.py DatasetDoctor: 7 quality checks + plots
โ โ โโโ cleaner.py AutoCleaner: fill, encode, normalise
โ โ
โ โโโ automl/
โ โ โโโ engine.py AutoMLEngine: HPO + CV + leaderboard
โ โ โโโ registry.py ModelRegistry: plugin system + built-ins
โ โ
โ โโโ explain/
โ โ โโโ shap_explainer.py ShapExplainer: Tree/Linear/Kernel + plots
โ โ
โ โโโ drift/
โ โ โโโ detector.py DriftDetector: KS + PSI + Chi2 + alerts
โ โ
โ โโโ api/
โ โ โโโ server.py FastAPI REST server
โ โ
โ โโโ cli/
โ โ โโโ main.py Click CLI: 7 commands
โ โ
โ โโโ utils/
โ โโโ report.py HTMLReportGenerator
โ
โโโ tests/
โ โโโ unit/
โ โ โโโ test_doctor.py DatasetDoctor tests
โ โ โโโ test_cleaner.py AutoCleaner tests
โ โ โโโ test_drift.py DriftDetector tests
โ โ โโโ test_config.py config tests
โ โโโ integration/
โ โโโ test_pipeline.py full end-to-end pipeline tests
โ
โโโ examples/
โ โโโ quickstart.py complete working demo script
โ โโโ generate_sample_data.py creates train.csv and new_data.csv
โ โโโ config_example.yaml annotated config file
โ
โโโ pyproject.toml pip packaging config
โโโ README.md this file
โโโ LICENSE MIT
Running Tests
pip install autoguard-ml[dev]
# All tests
pytest
# Unit tests only โ fast, no ML training
pytest tests/unit/
# Integration tests โ trains real models, takes 30-60 seconds
pytest tests/integration/
# Verbose with coverage report
pytest -v --cov=autoguard --cov-report=term-missing
# One specific file
pytest tests/unit/test_drift.py -v
FAQ
Do I have to call auto_fix before fit?
No. fit runs its own internal preprocessing. auto_fix is optional โ use it if you want to inspect or save the cleaned data before training.
Can I use my own pre-trained model?
Yes. Assign it directly:
ag = AutoGuard(target="label")
ag.best_model = my_sklearn_model
ag.best_model_name = "my_model"
ag._problem_type = "classification"
ag._feature_cols = X.columns.tolist()
ag._train_df_raw = train_df
ag._is_fitted = True
How do I get better model accuracy?
Increase n_trials in config (default is 30):
cfg.automl.n_trials = 100 # more trials = better model, slower
What if all models fail during AutoML?
An AutoMLError is raised with logs showing what went wrong. Common causes: too few samples for cross-validation, all-null columns, or mismatched data types. Run diagnose first to catch these.
Does it work for regression?
Yes. Problem type is auto-detected from the target column. Force it explicitly with ag.fit(df, problem_type="regression").
How do I silence the console output?
cfg = AutoGuardConfig(verbose=False)
ag = AutoGuard(target="label", config=cfg)
Where do all the output files go?
Everything goes to autoguard_output/ by default. Change it:
cfg = AutoGuardConfig(output_dir="my_project/outputs")
How do I trigger alerts when drift is detected?
Check the returned dict and add your own logic:
result = ag.monitor(batch)
if result["overall_drift_severity"] > 50:
send_slack_alert(f"Drift severity: {result['overall_drift_severity']:.1f}")
trigger_retraining_job()
Contributing
Contributions are welcome. Please:
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Add tests for new functionality
- Run
pytestand ensure everything passes - Submit a pull request
Development setup:
git clone https://github.com/autoguard/autoguard-ml
cd autoguard-ml
pip install -e ".[dev]"
pytest
License
MIT ยฉ AutoGuard Contributors. See LICENSE.
Acknowledgements
Built on top of: scikit-learn ยท XGBoost ยท LightGBM ยท Optuna ยท SHAP ยท FastAPI ยท Rich ยท Click ยท scipy ยท matplotlib ยท seaborn
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autoguard_ml-0.1.0.tar.gz.
File metadata
- Download URL: autoguard_ml-0.1.0.tar.gz
- Upload date:
- Size: 48.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebc16df52d7196579d847ae4c347921b750c077d0f3df6d7a2efea9b754a6f3d
|
|
| MD5 |
80ca234a5b42d166abc48b7e3dcd9fa0
|
|
| BLAKE2b-256 |
f54bd263430af69238a0c7356f3d7db9e357058919c6591f578fdd78283b68f0
|
File details
Details for the file autoguard_ml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: autoguard_ml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 47.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5333366a963a2ddec17836319715336e7e850dd545b2df854ad02503cf8d7ffe
|
|
| MD5 |
38f062abf8dddfc6b0942e3e0f0bad8b
|
|
| BLAKE2b-256 |
733f70ba020ce114625da313f997e2ea5429d5fed501e1b8a6438b20d356621b
|