Skip to main content

A stateful, chainable data science pipeline for tabular ML workflows.

Project description

dspipeline

A stateful, chainable data-science pipeline for the full tabular ML workflow — from raw data to model-ready arrays — in a single class.

Installation

pip install dspipeline

Quick start

import pandas as pd
from dspipeline import DataSciencePipeline

df  = pd.read_csv("your_dataset.csv")
dsp = DataSciencePipeline(df, target_col="Churn", task_type="classification")

# One-liner: diagnostics → cleaning → preprocessing
dsp.run_diagnostics().run_cleaning().run_preprocessing()

# Leakproof split
X_train, X_test, y_train, y_test, preprocessor = dsp.split(test_size=0.2)

# Fit your model on processed arrays
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(dsp.results["split"]["X_train_processed"], y_train)

What it covers

Phase Methods
Diagnostics profile_missing, detect_structural, detect_dimensional, detect_categorical, detect_predictive, detect_anomaly_scan, detect_leakage
Cleaning format_structure, drop_duplicates, standardize_text, impute_numeric, impute_categorical
Anomaly handling handle_outliers
Transformation transform_shape
Encoding encode
Feature selection select_features, vif_optimize
Split split
EDA analyze_distribution, evaluate_distribution, analyze_relationship, test_hypothesis
Time-series enforce_stationarity, analyze_autocorrelation

Method chaining

Every mutating method returns self, so you can chain calls:

(dsp
  .profile_missing()
  .drop_duplicates(subset=["user_id"], sort_by="updated_at")
  .impute_numeric(strategy="knn")
  .impute_categorical(strategy="mode")
  .handle_outliers(method="iqr", action="clip", threshold=1.5)
  .transform_shape(scale_method="robust")
  .encode(nominal_cols=["color"], ordinal_maps={"size": ["S", "M", "L"]})
  .select_features(multi_corr_threshold=0.85)
  .vif_optimize(threshold=5.0)
)
X_train, X_test, y_train, y_test, preprocessor = dsp.split(stratify=True)

State inspection

dsp.summary()           # prints shape history for every step
dsp.results.keys()      # all stored reports and artefacts
dsp.history             # list of {method, shape_before, shape_after}
df_snapshot = dsp.snapshot()   # copy of current working DataFrame
dsp.reset()             # restore to original raw DataFrame

Individual functions

Every function is also importable directly:

from dspipeline import (
    advanced_missing_profiler,
    detect_anomalies,
    handle_numerical_missing,
    advanced_knn_impute,
    transform_data_shape,
    encode_categorical_data,
    optimize_vif,
    setup_leakproof_environment,
    analyze_distribution,
    test_hypothesis,
    enforce_stationarity,
    analyze_autocorrelation,
)

Time-series example

dsp = DataSciencePipeline(df, target_col="demand")

# Check and fix stationarity
series, d, report = dsp.enforce_stationarity("revenue", seasonal_period=12)
print(f"Applied d={d} differencing steps")

# ACF / PACF + ARIMA order hints
acf_report = dsp.analyze_autocorrelation("revenue", lags=40)
print(f"Suggested ARIMA({acf_report['arima_hint_p']}, {d}, {acf_report['arima_hint_q']})")

Requirements

  • Python ≥ 3.9
  • pandas, numpy, scikit-learn, scipy, statsmodels, matplotlib, seaborn

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dshandler-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dshandler-0.1.0-py3-none-any.whl (4.5 kB view details)

Uploaded Python 3

File details

Details for the file dshandler-0.1.0.tar.gz.

File metadata

  • Download URL: dshandler-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dshandler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8626015f88691e8595fbd494865565b75709eb5d698d7c84a7d19df1cfcc46cf
MD5 461f95b297559eef9b37104b4e4f6fbc
BLAKE2b-256 7db2325695fa8bc22c80aae4d6725679741260087632a3afe0542c26f7d17930

See more details on using hashes here.

File details

Details for the file dshandler-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dshandler-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dshandler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 04b936bd2e33bd0c974bf8ff2d9d6efad2c1e3c73c045d4f16608075414f3f53
MD5 f830e93f2c56ce6ddb53e4d822707a66
BLAKE2b-256 23963ebc4a1f8ff8a55f1d7a65a8e4e894f4a73e996382622a9ccc058c88d9f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page