Skip to main content

Library that provides helperfunctions for data science preprocessing and exploratory data analysis.

Project description

jan883-eda

A collection of utility functions for data analysis, preprocessing, model evaluation, and clustering in Python. Designed to streamline the workflow of data scientists and machine learning practitioners.

Installation

Install the package via pip:

pip install jan883-eda

For local development from this repository:

uv sync
uv run python -c "import jan883_eda; print(jan883_eda.__all__)"

Usage

Below are examples demonstrating how to use some of the key functions in the package. These examples assume you have a DataFrame (your_dataframe) or feature matrix (X) and target vector (y) ready.

Exploratory Data Analysis (EDA)

  • Inspect DataFrame:
from jan883_eda import inspect_df

inspect_df(your_dataframe)

This displays the head, shape, description, NaN values, and duplicates of the DataFrame.

  • Column Summary:
from jan883_eda import column_summary

summary = column_summary(your_dataframe)
print(summary)
  • Data Quality Report:
from jan883_eda import data_quality_report

quality = data_quality_report(your_dataframe)
print(quality)

Data Preprocessing

  • Update Column Names:
from jan883_eda import update_column_names

updated_df = update_column_names(your_dataframe)
  • Label Encoding:
from jan883_eda import label_encode_column

encoded_df = label_encode_column(your_dataframe, 'column_name')
  • Train-Test Safe Preprocessor:
from jan883_eda import fit_transform_preprocessor

preprocessor, X_train_ready, X_test_ready = fit_transform_preprocessor(X_train, X_test)

Model Evaluation

  • Evaluate Classification Model:
from jan883_eda import evaluate_classification_model
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
evaluate_classification_model(model, X, y)
  • Test Multiple Regression Models:
from jan883_eda import best_regression_models

results = best_regression_models(X, y)
print(results)
  • Cross-Validated Model Comparison:
from jan883_eda import compare_classifiers_cv, compare_regressors_cv

classification_results = compare_classifiers_cv(X, y, scoring="f1_weighted")
regression_results = compare_regressors_cv(X, y, scoring="r2")

Diagnostics

  • Classification and Regression Diagnostics:
from jan883_eda import (
    class_balance_report,
    classification_metrics_table,
    plot_confusion_matrix,
    regression_metrics,
    plot_regression_diagnostics,
)

balance = class_balance_report(y)
metrics = classification_metrics_table(y_test, y_pred)
plot_confusion_matrix(y_test, y_pred)
regression_summary = regression_metrics(y_test, y_pred)
residuals = plot_regression_diagnostics(y_test, y_pred)

Feature Selection

  • Feature Ranking and Pruning:
from jan883_eda import (
    low_variance_features,
    correlation_prune,
    mutual_information_ranking,
    permutation_importance_table,
)

low_variance = low_variance_features(X)
correlated = correlation_prune(X, threshold=0.9)
mi_scores = mutual_information_ranking(X, y, problem_type="classification")
importance = permutation_importance_table(fitted_model, X_test, y_test)

Clustering

  • Evaluate and Profile Clusters:
from jan883_eda import evaluate_kmeans_clusters, cluster_profile, pca_cluster_projection

k_scores = evaluate_kmeans_clusters(X_scaled, k_range=range(2, 10))
profiles = cluster_profile(your_dataframe, labels)
projection = pca_cluster_projection(X_scaled, labels)

Time Series

  • Analyze Stationarity:
from jan883_eda import analyze_stationarity

stationary_series = your_time_series.diff().dropna()
analyze_stationarity(stationary_series, alpha=0.05, lags=15)

This runs an Augmented Dickey-Fuller test, prints a plain-English stationarity interpretation, and plots ACF/PACF charts to help inspect autoregressive and moving-average structure.

  • Forecasting Helpers:
from jan883_eda import (
    stationarity_report,
    plot_rolling_statistics,
    seasonal_decomposition_plot,
    make_lag_features,
    time_series_train_test_split,
    forecast_metrics,
)

report = stationarity_report(your_time_series)
rolling = plot_rolling_statistics(your_time_series, window=12)
decomposition = seasonal_decomposition_plot(your_time_series, period=12)
lagged = make_lag_features(your_time_series, lags=(1, 2, 3), rolling_windows=(7, 14))
train_ts, test_ts = time_series_train_test_split(lagged, test_size=0.2)
scores = forecast_metrics(y_true, y_pred)

Drift and Pipelines

  • Train-Test Drift and Production Pipelines:
from jan883_eda import (
    compare_train_test_distributions,
    build_model_pipeline,
    validate_prediction_columns,
    save_pipeline,
    load_pipeline,
)

drift = compare_train_test_distributions(X_train, X_test)
pipeline = build_model_pipeline(X_train, estimator)
pipeline.fit(X_train, y_train)
validated = validate_prediction_columns(new_data, X_train.columns)
save_pipeline(pipeline, "model.joblib")
loaded_pipeline = load_pipeline("model.joblib")

Functions Overview

The package provides a variety of functions grouped by their purpose:

  • EDA Functions: inspect_df, column_summary, univariate_analysis, and more.
  • Data Quality: data_quality_report, duplicate_summary.
  • Data Preprocessing: update_column_names, label_encode_column, one_hot_encode_column, build_preprocessor, fit_transform_preprocessor, and more.
  • Model Evaluation: evaluate_classification_model, evaluate_regression_model, best_classification_models, best_regression_models, compare_classifiers_cv, compare_regressors_cv, and more.
  • Diagnostics: class_balance_report, classification_metrics_table, plot_confusion_matrix, regression_metrics, plot_regression_diagnostics, and more.
  • Feature Selection: low_variance_features, correlation_prune, mutual_information_ranking, permutation_importance_table.
  • Clustering Analysis: plot_elbow_method, plot_intercluster_distance, plot_silhouette_visualizer, evaluate_kmeans_clusters, cluster_profile, and more.
  • Time Series: analyze_stationarity, stationarity_report, make_lag_features, forecast_metrics, and more.
  • Drift and Pipelines: compare_train_test_distributions, population_stability_index, build_model_pipeline, save_pipeline, load_pipeline.

For a complete list of functions and their detailed documentation, refer to the docstrings within the source code.

Requirements

The following dependencies are required to use the package:

  • Python >= 3.12
  • pandas >= 2.2.3
  • numpy >= 2.2.4
  • matplotlib >= 3.10.1
  • seaborn >= 0.13.2
  • scikit-learn >= 1.6.1
  • setuptools >= 69
  • statsmodels >= 0.14.4
  • yellowbrick >= 1.5
  • imbalanced-learn >= 0.13.0
  • xgboost >= 3.0.0

These are installed automatically when you install the package with pip.

License

This package is distributed under the MIT License.

Contact

For questions, bug reports, or contributions, use the project repository where this package is maintained.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jan883_eda-0.2.2.tar.gz (67.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jan883_eda-0.2.2-py3-none-any.whl (38.0 kB view details)

Uploaded Python 3

File details

Details for the file jan883_eda-0.2.2.tar.gz.

File metadata

  • Download URL: jan883_eda-0.2.2.tar.gz
  • Upload date:
  • Size: 67.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for jan883_eda-0.2.2.tar.gz
Algorithm Hash digest
SHA256 714912edd16970ccd1ac5562578ddbf8c72bde14cb530675a84768b2b7099111
MD5 21e080a57ee25a3f8273084432b3f665
BLAKE2b-256 53698ee8a2c3b73597ab012baf0d62de587cec89091da0c632591823fc29dbe2

See more details on using hashes here.

File details

Details for the file jan883_eda-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: jan883_eda-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 38.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for jan883_eda-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f893782c5025766d5a7693dd26d132b422d09a7275d41fdcadbf09dd53eacc7a
MD5 441cfb59db37f71a41a16daac3d9aadb
BLAKE2b-256 59515aa5e5cd4e507604ed97bf9f70622e3ff748e9266c099362ca806ff65e9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page