Post-collection data science lifecycle toolkit — from raw DataFrame to model card
Project description
ds-toolkit
Post-collection data science lifecycle toolkit.
From raw DataFrame to evaluated, tracked, and reported model — in composable, Jupyter-native Python.
What is ds-toolkit?
ds-toolkit is an opinionated, production-ready library that wraps the messy middle of data science work — everything after you have data and before you have a deployed model. It gives you:
- One-call profiling and validation before you touch a single row
- CV-safe preprocessing that cannot leak across fold boundaries by design
- Auto-selecting encoders and scalers that make sensible choices without configuration
- Multi-model CV harness that ranks every estimator in one call
- Optuna-powered tuning with pre-built search spaces for every major model
- SHAP explainability that auto-picks TreeExplainer or KernelExplainer
- MLflow experiment tracking as a context manager — zero boilerplate
- Model cards generated from your result objects in two lines
Every module is sklearn-compatible (fit / transform / fit_transform), returns typed result objects with a .display() method that renders inline in Jupyter, and mutates nothing.
Architecture
ds_toolkit/
├── core/ # Stage 1–2: profiling, validation, cleaning
├── features/ # Stage 3: encoding, engineering, selection
├── models/ # Stage 4: registry, CV, tuning, ensembles
├── eval/ # Stage 5: metrics, SHAP, plots, error analysis
├── infra/ # Stage 6: experiment logging, config, serialisation
└── reporting/ # Stage 7: notebook output, HTML export, model cards
Installation
Core (no optional deps):
pip install dstoolkit-adnan
With boosting libraries:
pip install "dstoolkit-adnan[boosting]" # XGBoost + LightGBM + CatBoost
With tuning + tracking:
pip install "dstoolkit-adnan[tune,track]" # Optuna + MLflow
With SHAP explanations:
pip install "dstoolkit-adnan[explain]" # shap
Everything:
pip install "dstoolkit-adnan[all]"
Development install (editable):
git clone https://github.com/ShadowGodd1/ds-toolkit.git
cd ds-toolkit
pip install -e ".[dev]"
Quick Start — Full Pipeline
import pandas as pd
from ds_toolkit.core import DataProfiler, SchemaValidator, MissingHandler, OutlierDetector, TypeCaster
from ds_toolkit.features import EncoderFactory, DatetimeDecomposer, FeatureSelector, Scaler
from ds_toolkit.models import ModelRegistry, CVHarness, TunerOptuna
from ds_toolkit.eval import MetricsReport, ExplainerSHAP, DiagnosticPlotter, ErrorAnalyser
from ds_toolkit.infra import ExperimentLogger, ConfigManager, PipelineSerialiser
from ds_toolkit.reporting import NotebookReporter, generate_model_card
df = pd.read_csv("data/my_dataset.csv")
target_col = "label"
# ── Stage 1: Understand ──────────────────────────────────────────────────
profile = DataProfiler().profile(df)
profile.display() # renders inline in Jupyter
schema = {
"age": {"nullable": False, "min": 0, "max": 120},
"email": {"regex": r".+@.+\..+"},
}
validation = SchemaValidator().check(df, schema)
validation.display()
# ── Stage 2: Clean ───────────────────────────────────────────────────────
X = df.drop(columns=[target_col])
y = df[target_col]
X = TypeCaster().cast(X)
X, outlier_report = OutlierDetector(method="iqr", action="cap").detect(X)
handler = MissingHandler(strategy="median")
X = handler.fit_transform(X)
# ── Stage 3: Features ────────────────────────────────────────────────────
X = DatetimeDecomposer().decompose(X)
encoder = EncoderFactory(task="clf")
X = encoder.fit_transform(X, y)
scaler = Scaler(method="standard")
X = scaler.fit_transform(X)
selector = FeatureSelector(method="rfecv", task="clf")
X = selector.fit_transform(X, y)
# ── Stage 4: Train ───────────────────────────────────────────────────────
models = ModelRegistry.get(task="clf")
harness = CVHarness(task="clf", n_splits=5, scoring="roc_auc")
cv_results = harness.run(models, X, y)
cv_results.display()
best_name, best_model = cv_results.best_model
# Optional: tune the best model
tuner = TunerOptuna(task="clf", n_trials=100)
tune_result = tuner.tune(best_model, X, y)
best_model.set_params(**tune_result.best_params)
best_model.fit(X, y)
# ── Stage 5: Evaluate ────────────────────────────────────────────────────
y_pred = best_model.predict(X)
y_proba = best_model.predict_proba(X)
metrics = MetricsReport(task="clf").report(y, y_pred, y_proba)
metrics.display()
shap_result = ExplainerSHAP(top_n=10).explain(best_model, X)
shap_result.display()
diag = DiagnosticPlotter().diagnostics(best_model, X, y)
diag.display()
errors = ErrorAnalyser(n_worst=0.1).analyse(best_model, X, y)
errors.display()
# ── Stage 6: Track ───────────────────────────────────────────────────────
logger = ExperimentLogger(tracking_uri="./mlruns")
with logger.run("my_experiment", params={"model": best_name}) as run:
logger.log_metrics(metrics.metrics_df["value"].to_dict())
logger.log_model(best_model, name=best_name)
logger.log_shap(shap_result)
serialiser = PipelineSerialiser(output_dir="./models")
save_result = serialiser.save(best_model, name=best_name)
# ── Stage 7: Report ──────────────────────────────────────────────────────
NotebookReporter().display(cv_results, metrics, shap_result)
card = generate_model_card(
best_model,
cv_results=cv_results,
eval_results=metrics,
shap_result=shap_result,
error_report=errors,
experiment_info={"run_id": run.run_id},
)
card.display()
print(card.to_md()) # export as Markdown
Stage Reference
Stage 1 — Data Understanding & Validation
DataProfiler
One-call dataset summary: shape, dtypes, memory, missing%, cardinality, skew, kurtosis, outlier flag.
from ds_toolkit.core import DataProfiler
profiler = DataProfiler(
cardinality_threshold=50, # columns with ≤N unique values → categorical
outlier_method="iqr", # 'iqr' | 'zscore' | 'both'
missing_threshold=0.05, # warn if missing% exceeds this
)
result = profiler.profile(df)
result.display() # Jupyter inline
result.summary_df # pd.DataFrame — one row per column
result.warnings # list[str]
SchemaValidator
Pydantic-backed schema enforcement. Raises or returns a violations report.
from ds_toolkit.core import SchemaValidator
schema = {
"age": {"dtype": "numeric", "nullable": False, "min": 0, "max": 120},
"email": {"regex": r".+@.+\..+"},
"status": {"allowed": ["active", "inactive"]},
"id": {"unique": True, "nullable": False},
}
result = SchemaValidator(strict=False).check(df, schema)
result.passed # bool
result.violations_df # pd.DataFrame — [column, check, detail]
DistributionReport
Auto-generates histograms, KDE plots, QQ plots, box plots, and correlation heatmap. Exports self-contained HTML.
from ds_toolkit.core import DistributionReport
result = DistributionReport().run(df, output_dir="reports/")
result.html_path # Path to saved HTML
result.display() # inline in Jupyter
Stage 2 — Data Cleaning & Preprocessing
MissingHandler
Per-column imputation — CV-safe (fit on train only).
from ds_toolkit.core import MissingHandler
handler = MissingHandler(
strategy="median", # global fallback
col_strategies={"city": "mode", # per-column overrides
"note": "constant"},
fill_values={"note": "unknown"},
knn_neighbors=5,
)
X_train_clean = handler.fit_transform(X_train)
X_val_clean = handler.transform(X_val) # uses train statistics
Supported strategies: mean, median, mode, constant, knn, mice, none.
OutlierDetector
from ds_toolkit.core import OutlierDetector
detector = OutlierDetector(
method="iqr", # 'iqr' | 'zscore' | 'isoforest' | 'lof'
action="cap", # 'flag' | 'cap' | 'drop'
col_actions={"revenue": "drop"}, # per-column action override
iqr_factor=1.5,
)
result_df, report = detector.detect(df)
TypeCaster
from ds_toolkit.core import TypeCaster
caster = TypeCaster(
cardinality_threshold=50, # object cols with ≤N unique → category
downcast_numerics=True, # int64 → smallest safe int
parse_dates=True, # detect and parse date strings
)
df_typed = caster.cast(df)
caster.change_log # list of {column, from, to}
Deduplicator
from ds_toolkit.core import Deduplicator
dedup = Deduplicator(
keys=["patient_id", "visit_date"], # exact dedup keys
fuzzy_cols=["full_name"], # fuzzy dedup columns (requires rapidfuzz)
fuzzy_threshold=90,
)
df_clean = dedup.clean(df)
dedup.report() # pd.DataFrame — rows removed
Stage 3 — Feature Engineering
EncoderFactory
Auto-selects encoding by cardinality and task type.
| Condition | Strategy |
|---|---|
| Column has ordered metadata | OrdinalEncoder |
Cardinality ≤ ohe_threshold (default 15) |
OneHotEncoder |
| Cardinality > threshold + target available | TargetEncoder (smoothed, CV-safe) |
| Cardinality > threshold + no target | HashingEncoder |
from ds_toolkit.features import EncoderFactory
enc = EncoderFactory(
task="clf",
ohe_threshold=15,
ordered_cols={"size": ["S", "M", "L", "XL"]},
)
X_train_enc = enc.fit_transform(X_train, y_train)
X_val_enc = enc.transform(X_val)
enc.encoding_map # dict: column → strategy used
DatetimeDecomposer
from ds_toolkit.features import DatetimeDecomposer
dt = DatetimeDecomposer(
cols=["created_at"], # None = auto-detect all datetime cols
cyclical=True, # add sin/cos encodings for month, dow, hour
add_holidays=True, # requires: pip install holidays
country_code="KE", # ISO country code for holiday calendar
)
df_expanded = dt.decompose(df)
# Adds: created_at_year, _month, _day, _day_of_week, _is_weekend,
# _month_sin, _month_cos, _dow_sin, _dow_cos, ...
InteractionBuilder
from ds_toolkit.features import InteractionBuilder
builder = InteractionBuilder(
cols=["age", "income", "score"],
include_types=["product", "ratio"], # 'polynomial' | 'product' | 'ratio'
prune_interactions=True, # drop near-zero-variance interactions
top_k=20, # optional: RF-based top-k selection
)
X_train_int = builder.fit_transform(X_train, y_train)
X_val_int = builder.transform(X_val)
builder.selected_features_ # list of surviving feature names
FeatureSelector
Multi-stage pipeline: variance → correlation → RFECV → SHAP (each stage toggleable).
from ds_toolkit.features import FeatureSelector
selector = FeatureSelector(
method="rfecv", # 'variance' | 'correlation' | 'rfecv' | 'shap'
task="clf",
correlation_threshold=0.95,
cv_folds=5,
)
X_train_sel = selector.fit_transform(X_train, y_train)
X_val_sel = selector.transform(X_val)
selector.selected_features_ # list of kept features
selector.report() # pd.DataFrame — [feature, stage, reason]
Scaler
from ds_toolkit.features import Scaler
scaler = Scaler(
method="standard", # 'standard' | 'minmax' | 'robust'
exclude_cols=["id", "flag"], # never scale these
)
X_train_sc = scaler.fit_transform(X_train)
X_val_sc = scaler.transform(X_val)
scaler.scaling_stats_ # pd.DataFrame — center/scale per column
Stage 4 — Model Training & Selection
ModelRegistry
from ds_toolkit.models import ModelRegistry
models = ModelRegistry.get(task="clf") # all available
models = ModelRegistry.get(task="clf",
include=["lr", "rf", "xgboost"]) # only these
models = ModelRegistry.get(task="clf",
exclude=["mlp"]) # all except these
Built-in keys: lr, rf, gbm, et, mlp, xgboost, lightgbm, catboost
CVHarness
from ds_toolkit.models import CVHarness
harness = CVHarness(
task="clf",
n_splits=5,
scoring="roc_auc",
verbose=True,
)
cv_results = harness.run(models, X_train, y_train)
cv_results.summary_df # ranked by mean_score
cv_results.best_model # (name, fitted estimator)
cv_results.display() # inline table in Jupyter
CV strategy is auto-selected:
| Condition | Strategy |
|---|---|
task='clf', balanced |
StratifiedKFold(n_splits=5) |
task='clf', imbalanced |
StratifiedKFold + class_weight='balanced' |
task='reg' |
KFold(n_splits=5, shuffle=True) |
task='ts' |
TimeSeriesSplit(n_splits=5) |
TunerOptuna
from ds_toolkit.models import TunerOptuna
from sklearn.ensemble import RandomForestClassifier
tuner = TunerOptuna(
task="clf",
n_trials=100,
cv_folds=5,
scoring="roc_auc",
)
result = tuner.tune(RandomForestClassifier(), X_train, y_train)
result.best_params # dict — apply with model.set_params(**result.best_params)
result.best_score
result.study # optuna.Study for further analysis
Pre-built search spaces: LogisticRegression, Ridge, RandomForest, ExtraTrees, GradientBoosting, XGBoost, LightGBM.
EnsembleBuilder
from ds_toolkit.models import EnsembleBuilder
builder = EnsembleBuilder(
task="clf",
method="stack", # 'stack' | 'vote' | 'blend'
meta_learner="lr", # 'lr' | 'ridge' | any sklearn estimator
cv_folds=5,
)
ensemble = builder.build(models, X_train, y_train)
preds = ensemble.predict(X_val)
proba = ensemble.predict_proba(X_val)
Stage 5 — Evaluation & Diagnostics
MetricsReport
from ds_toolkit.eval import MetricsReport
result = MetricsReport(task="clf").report(y_true, y_pred, y_proba=y_proba)
result.metrics_df # pd.DataFrame — metric → value
result.display()
| Task | Primary Metrics | Secondary Metrics |
|---|---|---|
| Binary clf | ROC-AUC, F1, Precision, Recall | Log-loss, MCC, PR-AUC |
| Multi-class clf | Macro F1, Accuracy | Per-class P/R/F1 |
| Regression | RMSE, MAE, R² | MAPE, Adj. R², Max error |
ExplainerSHAP
from ds_toolkit.eval import ExplainerSHAP
result = ExplainerSHAP(top_n=10).explain(model, X)
result.display() # summary plot inline
result.values # raw SHAP values (n_samples × n_features)
result.figures # dict: 'summary', 'bar', 'dependence_<col>'
Auto-selects TreeExplainer for tree-based models, KernelExplainer for all others.
DiagnosticPlotter
from ds_toolkit.eval import DiagnosticPlotter
result = DiagnosticPlotter().diagnostics(model, X, y)
result.display()
result.figures # dict of matplotlib figures
Classification: confusion matrix (raw + normalised), ROC curve, PR curve, calibration plot
Regression: residuals vs fitted, Q-Q plot, scale-location, Cook's distance
ErrorAnalyser
from ds_toolkit.eval import ErrorAnalyser
result = ErrorAnalyser(n_worst=0.1).analyse(model, X, y)
result.segments_df # feature distribution shift: worst vs rest
result.worst_df # the n_worst mis-predicted rows
result.display()
Stage 6 — Experiment Tracking & Reproducibility
ExperimentLogger
from ds_toolkit.infra import ExperimentLogger
logger = ExperimentLogger(tracking_uri="./mlruns")
with logger.run("my_experiment", params={"model": "rf", "n_estimators": 200}) as run:
model.fit(X_train, y_train)
logger.log_metrics({"roc_auc": 0.91, "f1": 0.87})
logger.log_model(model, name="random_forest")
logger.log_shap(shap_result)
print(run.run_id)
print(run.artifact_uri)
Auto-logged per run: params, metrics, model artifact, SHAP plot, requirements.txt snapshot, git commit hash.
ConfigManager
from ds_toolkit.infra import ConfigManager
# config/experiment.yaml:
# model:
# n_estimators: 200
# task: clf
# data:
# target_col: ${TARGET_COL} # resolved from env var
cfg = ConfigManager.load(
"config/experiment.yaml",
required=["data.target_col", "model.task"],
)
cfg.model.n_estimators # 200
cfg.data.target_col # value from $TARGET_COL
PipelineSerialiser
from ds_toolkit.infra import PipelineSerialiser
serial = PipelineSerialiser(output_dir="./models")
# Save with SHA-256 checksum + metadata sidecar
result = serial.save(
pipeline,
name="rf_v1",
metadata={"roc_auc": 0.91, "trained_on": "2024-01-15"},
)
print(result.path) # ./models/rf_v1_20240115_143022.pkl
print(result.checksum) # SHA-256 hex
# Load — raises ChecksumError if file was tampered
model = serial.load(result.path)
Stage 7 — Reporting & Notebook Output
NotebookReporter
from ds_toolkit.reporting import NotebookReporter
NotebookReporter().display(
cv_results=cv_results,
eval_results=metrics,
shap_result=shap_result,
title="Patient Readmission Model — v1",
)
HTMLExporter
from ds_toolkit.reporting import HTMLExporter
result = HTMLExporter().export(
output_path="reports/experiment_v1.html",
cv_results=cv_results,
eval_results=metrics,
shap_result=shap_result,
diagnostic_result=diag,
title="Experiment Report",
)
# Self-contained HTML — no external deps, safe to email
ModelCard
from ds_toolkit.reporting import generate_model_card
card = generate_model_card(
model,
cv_results=cv_results,
eval_results=metrics,
shap_result=shap_result,
error_report=errors,
experiment_info={"run_id": run.run_id, "git_hash": "a1b2c3d"},
)
card.display() # inline in Jupyter
card.to_md() # Markdown string
card.to_html() # HTML string
Design Principles
- No side effects. Every module accepts a DataFrame or model and returns a new object. Nothing is mutated in place.
- CV-safety by default. Anything that touches the target (
TargetEncoder,MissingHandler,Scaler,FeatureSelector) has afit/transformsplit. Fit on train. Transform on val/test. - Jupyter-native. Every result object has a
.display()method that renders rich HTML inline. Nothing requires a separate report step. - Stack-agnostic. XGBoost, LightGBM, CatBoost, and all sklearn estimators are first-class citizens across every stage.
- Optional dependencies stay optional.
shap,optuna,mlflow,rapidfuzz, and the boosting libraries are never imported at the top level. They are imported at call time and fail with a clear install message.
Running Tests
# All 209 tests
pytest
# Specific stage
pytest tests/test_core/
pytest tests/test_features/
pytest tests/test_models/
pytest tests/test_eval/
# With coverage
pytest --cov=ds_toolkit --cov-report=html
Contributing
Contributions are welcome. See CONTRIBUTING.md for guidelines.
Quick contribution flow:
git clone https://github.com/ShadowGodd1/ds-toolkit.git
cd ds-toolkit
pip install -e ".[dev]"
git checkout -b feature/my-feature
# make changes
pytest
git push origin feature/my-feature
# open a Pull Request
Changelog
See CHANGELOG.md.
License
MIT — see LICENSE.
Author
Adnan Mohamud
CEO & Founder, PataDoc — The Partner in Health in Your Hand
github.com/ShadowGodd1
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dstoolkit_adnan-1.0.3.tar.gz.
File metadata
- Download URL: dstoolkit_adnan-1.0.3.tar.gz
- Upload date:
- Size: 70.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d89ff2c90f25f281491be29c8a2dcc0ffbd2c0aaa9ad0e059541ac75ec98bd1d
|
|
| MD5 |
7747a63fc8fde5e64cae725bc1057a5d
|
|
| BLAKE2b-256 |
63832740ffad2d7f8e59ed02086ea935080c8225822aa4164609fc254562fd76
|
Provenance
The following attestation bundles were made for dstoolkit_adnan-1.0.3.tar.gz:
Publisher:
publish.yml on ShadowGodd1/ds-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dstoolkit_adnan-1.0.3.tar.gz -
Subject digest:
d89ff2c90f25f281491be29c8a2dcc0ffbd2c0aaa9ad0e059541ac75ec98bd1d - Sigstore transparency entry: 1553908942
- Sigstore integration time:
-
Permalink:
ShadowGodd1/ds-toolkit@04dceab44cd5150588f548f7e3812794f633211b -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/ShadowGodd1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@04dceab44cd5150588f548f7e3812794f633211b -
Trigger Event:
push
-
Statement type:
File details
Details for the file dstoolkit_adnan-1.0.3-py3-none-any.whl.
File metadata
- Download URL: dstoolkit_adnan-1.0.3-py3-none-any.whl
- Upload date:
- Size: 82.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5b7f65579193e4037ada3f72dd599eeaaf17aa709f6d50cdf94efc447c5b948
|
|
| MD5 |
4ca890d768b78feefa1badef090cb133
|
|
| BLAKE2b-256 |
f1b96099c4cd4c05677107d90a8daf0e30af2234abf7794635f340a52de584bb
|
Provenance
The following attestation bundles were made for dstoolkit_adnan-1.0.3-py3-none-any.whl:
Publisher:
publish.yml on ShadowGodd1/ds-toolkit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dstoolkit_adnan-1.0.3-py3-none-any.whl -
Subject digest:
e5b7f65579193e4037ada3f72dd599eeaaf17aa709f6d50cdf94efc447c5b948 - Sigstore transparency entry: 1553908948
- Sigstore integration time:
-
Permalink:
ShadowGodd1/ds-toolkit@04dceab44cd5150588f548f7e3812794f633211b -
Branch / Tag:
refs/tags/v1.0.3 - Owner: https://github.com/ShadowGodd1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@04dceab44cd5150588f548f7e3812794f633211b -
Trigger Event:
push
-
Statement type: