Skip to main content

A stateful, chainable data science pipeline for tabular ML workflows.

Project description

dspipeline

A stateful, chainable data-science pipeline for the full tabular ML workflow — from raw data to model-ready arrays — in a single class.

Installation

pip install dspipeline

Quick start

import pandas as pd
from dspipeline import DataSciencePipeline

df  = pd.read_csv("your_dataset.csv")
dsp = DataSciencePipeline(df, target_col="Churn", task_type="classification")

# One-liner: diagnostics → cleaning → preprocessing
dsp.run_diagnostics().run_cleaning().run_preprocessing()

# Leakproof split
X_train, X_test, y_train, y_test, preprocessor = dsp.split(test_size=0.2)

# Fit your model on processed arrays
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(dsp.results["split"]["X_train_processed"], y_train)

What it covers

Phase Methods
Diagnostics profile_missing, detect_structural, detect_dimensional, detect_categorical, detect_predictive, detect_anomaly_scan, detect_leakage
Cleaning format_structure, drop_duplicates, standardize_text, impute_numeric, impute_categorical
Anomaly handling handle_outliers
Transformation transform_shape
Encoding encode
Feature selection select_features, vif_optimize
Split split
EDA analyze_distribution, evaluate_distribution, analyze_relationship, test_hypothesis
Time-series enforce_stationarity, analyze_autocorrelation

Method chaining

Every mutating method returns self, so you can chain calls:

(dsp
  .profile_missing()
  .drop_duplicates(subset=["user_id"], sort_by="updated_at")
  .impute_numeric(strategy="knn")
  .impute_categorical(strategy="mode")
  .handle_outliers(method="iqr", action="clip", threshold=1.5)
  .transform_shape(scale_method="robust")
  .encode(nominal_cols=["color"], ordinal_maps={"size": ["S", "M", "L"]})
  .select_features(multi_corr_threshold=0.85)
  .vif_optimize(threshold=5.0)
)
X_train, X_test, y_train, y_test, preprocessor = dsp.split(stratify=True)

State inspection

dsp.summary()           # prints shape history for every step
dsp.results.keys()      # all stored reports and artefacts
dsp.history             # list of {method, shape_before, shape_after}
df_snapshot = dsp.snapshot()   # copy of current working DataFrame
dsp.reset()             # restore to original raw DataFrame

Individual functions

Every function is also importable directly:

from dspipeline import (
    advanced_missing_profiler,
    detect_anomalies,
    handle_numerical_missing,
    advanced_knn_impute,
    transform_data_shape,
    encode_categorical_data,
    optimize_vif,
    setup_leakproof_environment,
    analyze_distribution,
    test_hypothesis,
    enforce_stationarity,
    analyze_autocorrelation,
)

Time-series example

dsp = DataSciencePipeline(df, target_col="demand")

# Check and fix stationarity
series, d, report = dsp.enforce_stationarity("revenue", seasonal_period=12)
print(f"Applied d={d} differencing steps")

# ACF / PACF + ARIMA order hints
acf_report = dsp.analyze_autocorrelation("revenue", lags=40)
print(f"Suggested ARIMA({acf_report['arima_hint_p']}, {d}, {acf_report['arima_hint_q']})")

Requirements

  • Python ≥ 3.9
  • pandas, numpy, scikit-learn, scipy, statsmodels, matplotlib, seaborn

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dshandler-0.1.2.tar.gz (69.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dshandler-0.1.2-py3-none-any.whl (71.2 kB view details)

Uploaded Python 3

File details

Details for the file dshandler-0.1.2.tar.gz.

File metadata

  • Download URL: dshandler-0.1.2.tar.gz
  • Upload date:
  • Size: 69.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dshandler-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ad7a4773179dc0a939431a09b5f42d3b030a06948cebec905b2ea1566d6daf9a
MD5 a1ddc51c926daa48046739bf5e5442de
BLAKE2b-256 ad622dbeddf5c7eea5eea682af6352583440af05517384f6c4a93abc60549831

See more details on using hashes here.

File details

Details for the file dshandler-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dshandler-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 71.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dshandler-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a56fee809d88f18752f735172e2d26607d1bf4c922b30469d439b1566e3a7e7a
MD5 6f6308079c8f6f140185a586761241e8
BLAKE2b-256 276aa94b14e7ea0dc7b353e823a0ad64ad2a5e3bb6adacaa2ac597c9706f0673

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page