Library that provides helperfunctions for data science preprocessing and exploratory data analysis.
Project description
jan883-eda
A collection of utility functions for data analysis, preprocessing, model evaluation, and clustering in Python. Designed to streamline the workflow of data scientists and machine learning practitioners.
Installation
Install the package via pip:
pip install jan883-eda
For local development from this repository:
uv sync
uv run python -c "import jan883_eda; print(jan883_eda.__all__)"
Usage
Below are examples demonstrating how to use some of the key functions in the package. These examples assume you have a DataFrame (your_dataframe) or feature matrix (X) and target vector (y) ready.
Exploratory Data Analysis (EDA)
- Inspect DataFrame:
from jan883_eda import inspect_df
inspect_df(your_dataframe)
This displays the head, shape, description, NaN values, and duplicates of the DataFrame.
- Column Summary:
from jan883_eda import column_summary
summary = column_summary(your_dataframe)
print(summary)
- Data Quality Report:
from jan883_eda import data_quality_report
quality = data_quality_report(your_dataframe)
print(quality)
Data Preprocessing
- Update Column Names:
from jan883_eda import update_column_names
updated_df = update_column_names(your_dataframe)
- Label Encoding:
from jan883_eda import label_encode_column
encoded_df = label_encode_column(your_dataframe, 'column_name')
- Train-Test Safe Preprocessor:
from jan883_eda import fit_transform_preprocessor
preprocessor, X_train_ready, X_test_ready = fit_transform_preprocessor(X_train, X_test)
Model Evaluation
- Evaluate Classification Model:
from jan883_eda import evaluate_classification_model
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
evaluate_classification_model(model, X, y)
- Test Multiple Regression Models:
from jan883_eda import best_regression_models
results = best_regression_models(X, y)
print(results)
- Cross-Validated Model Comparison:
from jan883_eda import compare_classifiers_cv, compare_regressors_cv
classification_results = compare_classifiers_cv(X, y, scoring="f1_weighted")
regression_results = compare_regressors_cv(X, y, scoring="r2")
Diagnostics
- Classification and Regression Diagnostics:
from jan883_eda import (
class_balance_report,
classification_metrics_table,
plot_confusion_matrix,
regression_metrics,
plot_regression_diagnostics,
)
balance = class_balance_report(y)
metrics = classification_metrics_table(y_test, y_pred)
plot_confusion_matrix(y_test, y_pred)
regression_summary = regression_metrics(y_test, y_pred)
residuals = plot_regression_diagnostics(y_test, y_pred)
Feature Selection
- Feature Ranking and Pruning:
from jan883_eda import (
low_variance_features,
correlation_prune,
mutual_information_ranking,
permutation_importance_table,
)
low_variance = low_variance_features(X)
correlated = correlation_prune(X, threshold=0.9)
mi_scores = mutual_information_ranking(X, y, problem_type="classification")
importance = permutation_importance_table(fitted_model, X_test, y_test)
Clustering
- Evaluate and Profile Clusters:
from jan883_eda import evaluate_kmeans_clusters, cluster_profile, pca_cluster_projection
k_scores = evaluate_kmeans_clusters(X_scaled, k_range=range(2, 10))
profiles = cluster_profile(your_dataframe, labels)
projection = pca_cluster_projection(X_scaled, labels)
Time Series
- Analyze Stationarity:
from jan883_eda import analyze_stationarity
stationary_series = your_time_series.diff().dropna()
analyze_stationarity(stationary_series, alpha=0.05, lags=15)
This runs an Augmented Dickey-Fuller test, prints a plain-English stationarity interpretation, and plots ACF/PACF charts to help inspect autoregressive and moving-average structure.
- Forecasting Helpers:
from jan883_eda import (
stationarity_report,
plot_rolling_statistics,
seasonal_decomposition_plot,
make_lag_features,
time_series_train_test_split,
forecast_metrics,
)
report = stationarity_report(your_time_series)
rolling = plot_rolling_statistics(your_time_series, window=12)
decomposition = seasonal_decomposition_plot(your_time_series, period=12)
lagged = make_lag_features(your_time_series, lags=(1, 2, 3), rolling_windows=(7, 14))
train_ts, test_ts = time_series_train_test_split(lagged, test_size=0.2)
scores = forecast_metrics(y_true, y_pred)
Drift and Pipelines
- Train-Test Drift and Production Pipelines:
from jan883_eda import (
compare_train_test_distributions,
build_model_pipeline,
validate_prediction_columns,
save_pipeline,
load_pipeline,
)
drift = compare_train_test_distributions(X_train, X_test)
pipeline = build_model_pipeline(X_train, estimator)
pipeline.fit(X_train, y_train)
validated = validate_prediction_columns(new_data, X_train.columns)
save_pipeline(pipeline, "model.joblib")
loaded_pipeline = load_pipeline("model.joblib")
Functions Overview
The package provides a variety of functions grouped by their purpose:
- EDA Functions:
inspect_df,column_summary,univariate_analysis, and more. - Data Quality:
data_quality_report,duplicate_summary. - Data Preprocessing:
update_column_names,label_encode_column,one_hot_encode_column,build_preprocessor,fit_transform_preprocessor, and more. - Model Evaluation:
evaluate_classification_model,evaluate_regression_model,best_classification_models,best_regression_models,compare_classifiers_cv,compare_regressors_cv, and more. - Diagnostics:
class_balance_report,classification_metrics_table,plot_confusion_matrix,regression_metrics,plot_regression_diagnostics, and more. - Feature Selection:
low_variance_features,correlation_prune,mutual_information_ranking,permutation_importance_table. - Clustering Analysis:
plot_elbow_method,plot_intercluster_distance,plot_silhouette_visualizer,evaluate_kmeans_clusters,cluster_profile, and more. - Time Series:
analyze_stationarity,stationarity_report,make_lag_features,forecast_metrics, and more. - Drift and Pipelines:
compare_train_test_distributions,population_stability_index,build_model_pipeline,save_pipeline,load_pipeline.
For a complete list of functions and their detailed documentation, refer to the docstrings within the source code.
Requirements
The following dependencies are required to use the package:
- Python >= 3.12
- pandas >= 2.2.3
- numpy >= 2.2.4
- matplotlib >= 3.10.1
- seaborn >= 0.13.2
- scikit-learn >= 1.6.1
- setuptools >= 69
- statsmodels >= 0.14.4
- yellowbrick >= 1.5
- imbalanced-learn >= 0.13.0
- xgboost >= 3.0.0
These are installed automatically when you install the package with pip.
License
This package is distributed under the MIT License.
Contact
For questions, bug reports, or contributions, use the project repository where this package is maintained.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jan883_eda-0.2.2.tar.gz.
File metadata
- Download URL: jan883_eda-0.2.2.tar.gz
- Upload date:
- Size: 67.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
714912edd16970ccd1ac5562578ddbf8c72bde14cb530675a84768b2b7099111
|
|
| MD5 |
21e080a57ee25a3f8273084432b3f665
|
|
| BLAKE2b-256 |
53698ee8a2c3b73597ab012baf0d62de587cec89091da0c632591823fc29dbe2
|
File details
Details for the file jan883_eda-0.2.2-py3-none-any.whl.
File metadata
- Download URL: jan883_eda-0.2.2-py3-none-any.whl
- Upload date:
- Size: 38.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f893782c5025766d5a7693dd26d132b422d09a7275d41fdcadbf09dd53eacc7a
|
|
| MD5 |
441cfb59db37f71a41a16daac3d9aadb
|
|
| BLAKE2b-256 |
59515aa5e5cd4e507604ed97bf9f70622e3ff748e9266c099362ca806ff65e9f
|