Skip to main content

A stateful, chainable data science pipeline for tabular ML workflows.

Project description

dspipeline

A stateful, chainable data-science pipeline for the full tabular ML workflow — from raw data to model-ready arrays — in a single class.

Installation

pip install dspipeline

Quick start

import pandas as pd
from dspipeline import DataSciencePipeline

df  = pd.read_csv("your_dataset.csv")
dsp = DataSciencePipeline(df, target_col="Churn", task_type="classification")

# One-liner: diagnostics → cleaning → preprocessing
dsp.run_diagnostics().run_cleaning().run_preprocessing()

# Leakproof split
X_train, X_test, y_train, y_test, preprocessor = dsp.split(test_size=0.2)

# Fit your model on processed arrays
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(dsp.results["split"]["X_train_processed"], y_train)

What it covers

Phase Methods
Diagnostics profile_missing, detect_structural, detect_dimensional, detect_categorical, detect_predictive, detect_anomaly_scan, detect_leakage
Cleaning format_structure, drop_duplicates, standardize_text, impute_numeric, impute_categorical
Anomaly handling handle_outliers
Transformation transform_shape
Encoding encode
Feature selection select_features, vif_optimize
Split split
EDA analyze_distribution, evaluate_distribution, analyze_relationship, test_hypothesis
Time-series enforce_stationarity, analyze_autocorrelation

Method chaining

Every mutating method returns self, so you can chain calls:

(dsp
  .profile_missing()
  .drop_duplicates(subset=["user_id"], sort_by="updated_at")
  .impute_numeric(strategy="knn")
  .impute_categorical(strategy="mode")
  .handle_outliers(method="iqr", action="clip", threshold=1.5)
  .transform_shape(scale_method="robust")
  .encode(nominal_cols=["color"], ordinal_maps={"size": ["S", "M", "L"]})
  .select_features(multi_corr_threshold=0.85)
  .vif_optimize(threshold=5.0)
)
X_train, X_test, y_train, y_test, preprocessor = dsp.split(stratify=True)

State inspection

dsp.summary()           # prints shape history for every step
dsp.results.keys()      # all stored reports and artefacts
dsp.history             # list of {method, shape_before, shape_after}
df_snapshot = dsp.snapshot()   # copy of current working DataFrame
dsp.reset()             # restore to original raw DataFrame

Individual functions

Every function is also importable directly:

from dspipeline import (
    advanced_missing_profiler,
    detect_anomalies,
    handle_numerical_missing,
    advanced_knn_impute,
    transform_data_shape,
    encode_categorical_data,
    optimize_vif,
    setup_leakproof_environment,
    analyze_distribution,
    test_hypothesis,
    enforce_stationarity,
    analyze_autocorrelation,
)

Time-series example

dsp = DataSciencePipeline(df, target_col="demand")

# Check and fix stationarity
series, d, report = dsp.enforce_stationarity("revenue", seasonal_period=12)
print(f"Applied d={d} differencing steps")

# ACF / PACF + ARIMA order hints
acf_report = dsp.analyze_autocorrelation("revenue", lags=40)
print(f"Suggested ARIMA({acf_report['arima_hint_p']}, {d}, {acf_report['arima_hint_q']})")

Requirements

  • Python ≥ 3.9
  • pandas, numpy, scikit-learn, scipy, statsmodels, matplotlib, seaborn

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dshandler-0.1.3.tar.gz (69.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dshandler-0.1.3-py3-none-any.whl (71.4 kB view details)

Uploaded Python 3

File details

Details for the file dshandler-0.1.3.tar.gz.

File metadata

  • Download URL: dshandler-0.1.3.tar.gz
  • Upload date:
  • Size: 69.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dshandler-0.1.3.tar.gz
Algorithm Hash digest
SHA256 76763736b35b0973e5e246c8a1e5c674afe338c46551679c8ca3f059ef2933f6
MD5 d5aa491c31a7dd42bc341a27e94358a9
BLAKE2b-256 c5d479da7f36ac7a5ece0bb47aaf410318bed16a3b654d088f359d11a278f110

See more details on using hashes here.

File details

Details for the file dshandler-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: dshandler-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 71.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dshandler-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d8529748dd869e8f7d435c148e2c85394e809e22e0a41db183c4fe6290714247
MD5 633f767d6c43ff3ee5e5e1cfff8a6fda
BLAKE2b-256 f38f246fbd2dc41117342bbe37f83c9a771039d193012db1dbaa907e1a832722

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page