A stateful, chainable data science pipeline for tabular ML workflows.
Project description
dspipeline
A stateful, chainable data-science pipeline for the full tabular ML workflow — from raw data to model-ready arrays — in a single class.
Installation
pip install dspipeline
Quick start
import pandas as pd
from dspipeline import DataSciencePipeline
df = pd.read_csv("your_dataset.csv")
dsp = DataSciencePipeline(df, target_col="Churn", task_type="classification")
# One-liner: diagnostics → cleaning → preprocessing
dsp.run_diagnostics().run_cleaning().run_preprocessing()
# Leakproof split
X_train, X_test, y_train, y_test, preprocessor = dsp.split(test_size=0.2)
# Fit your model on processed arrays
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(dsp.results["split"]["X_train_processed"], y_train)
What it covers
| Phase | Methods |
|---|---|
| Diagnostics | profile_missing, detect_structural, detect_dimensional, detect_categorical, detect_predictive, detect_anomaly_scan, detect_leakage |
| Cleaning | format_structure, drop_duplicates, standardize_text, impute_numeric, impute_categorical |
| Anomaly handling | handle_outliers |
| Transformation | transform_shape |
| Encoding | encode |
| Feature selection | select_features, vif_optimize |
| Split | split |
| EDA | analyze_distribution, evaluate_distribution, analyze_relationship, test_hypothesis |
| Time-series | enforce_stationarity, analyze_autocorrelation |
Method chaining
Every mutating method returns self, so you can chain calls:
(dsp
.profile_missing()
.drop_duplicates(subset=["user_id"], sort_by="updated_at")
.impute_numeric(strategy="knn")
.impute_categorical(strategy="mode")
.handle_outliers(method="iqr", action="clip", threshold=1.5)
.transform_shape(scale_method="robust")
.encode(nominal_cols=["color"], ordinal_maps={"size": ["S", "M", "L"]})
.select_features(multi_corr_threshold=0.85)
.vif_optimize(threshold=5.0)
)
X_train, X_test, y_train, y_test, preprocessor = dsp.split(stratify=True)
State inspection
dsp.summary() # prints shape history for every step
dsp.results.keys() # all stored reports and artefacts
dsp.history # list of {method, shape_before, shape_after}
df_snapshot = dsp.snapshot() # copy of current working DataFrame
dsp.reset() # restore to original raw DataFrame
Individual functions
Every function is also importable directly:
from dspipeline import (
advanced_missing_profiler,
detect_anomalies,
handle_numerical_missing,
advanced_knn_impute,
transform_data_shape,
encode_categorical_data,
optimize_vif,
setup_leakproof_environment,
analyze_distribution,
test_hypothesis,
enforce_stationarity,
analyze_autocorrelation,
)
Time-series example
dsp = DataSciencePipeline(df, target_col="demand")
# Check and fix stationarity
series, d, report = dsp.enforce_stationarity("revenue", seasonal_period=12)
print(f"Applied d={d} differencing steps")
# ACF / PACF + ARIMA order hints
acf_report = dsp.analyze_autocorrelation("revenue", lags=40)
print(f"Suggested ARIMA({acf_report['arima_hint_p']}, {d}, {acf_report['arima_hint_q']})")
Requirements
- Python ≥ 3.9
- pandas, numpy, scikit-learn, scipy, statsmodels, matplotlib, seaborn
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dshandler-0.1.3.tar.gz.
File metadata
- Download URL: dshandler-0.1.3.tar.gz
- Upload date:
- Size: 69.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76763736b35b0973e5e246c8a1e5c674afe338c46551679c8ca3f059ef2933f6
|
|
| MD5 |
d5aa491c31a7dd42bc341a27e94358a9
|
|
| BLAKE2b-256 |
c5d479da7f36ac7a5ece0bb47aaf410318bed16a3b654d088f359d11a278f110
|
File details
Details for the file dshandler-0.1.3-py3-none-any.whl.
File metadata
- Download URL: dshandler-0.1.3-py3-none-any.whl
- Upload date:
- Size: 71.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8529748dd869e8f7d435c148e2c85394e809e22e0a41db183c4fe6290714247
|
|
| MD5 |
633f767d6c43ff3ee5e5e1cfff8a6fda
|
|
| BLAKE2b-256 |
f38f246fbd2dc41117342bbe37f83c9a771039d193012db1dbaa907e1a832722
|