Skip to main content

No project description provided

Project description

ypds_helpers

PyPI - Version PyPI - Python Version License

A Python toolkit for data analysis, visualization, and machine learning preprocessing

ypds_helpers is a collection of utility functions and tools designed to streamline common data science workflows. It provides convenient helpers for data exploration, correlation analysis, visualization, and machine learning model preprocessing pipelines.

FeaturesInstallationUsageAPI Reference

Features

  • Data Handling: Quick data exploration with comprehensive statistics and type-based column selection
  • Correlation Analysis: Advanced correlation detection using Phik (φk) correlation coefficient for mixed data types
  • Visualization: Ready-to-use plotting functions for distributions, categorical data, and model residuals
  • ML Preprocessing: Pre-built pipelines for numerical and categorical data preprocessing
  • Model Evaluation: Grid search utilities with automatic result tracking and comparison
  • Jupyter Integration: Automatic detection and proper display in both Jupyter notebooks and regular Python environments

Installation

pip install ypds-helpers

The package requires Python ≤3.13 and automatically installs the following dependencies:

  • pandas
  • numpy
  • scikit-learn
  • seaborn
  • matplotlib
  • phik

Usage

Quick Data Exploration

import pandas as pd
from ypds_helpers.data_handling import show_df, get_num_cols, get_cat_cols

# Load your data
df = pd.read_csv('data.csv')

# Get comprehensive overview
show_df(df, n=10)  # Shows first 10 rows, statistics for numerical and categorical columns

# Get column lists by type
numerical_cols = get_num_cols(df, exclude_cols=['id'])
categorical_cols = get_cat_cols(df, exclude_cols=['target'])

Correlation Analysis

from ypds_helpers.data_handling import highest_corrs

# Find strongest correlations using Phik (works with mixed data types)
top_correlations = highest_corrs(
    df, 
    cols=['age', 'income', 'category', 'score'],
    interval_cols=['age', 'income', 'score'],
    num=15
)

Data Visualization

from ypds_helpers.plotting import plot_numeric, plot_cats, show_residues

# Visualize numerical features with histograms and boxplots
plot_numeric(
    df, 
    num_cols=['age', 'income', 'score'],
    hue='category',  # Split by category
    normalize=True,
    kde=True
)

# Visualize categorical distributions
plot_cats(
    df,
    cat_cols=['region', 'product_type'],
    hue='customer_segment',
    max_cats=10  # Group smaller categories
)

# Analyze model residuals
show_residues(y_true, y_pred, title='Model Performance')

Machine Learning Preprocessing

from ypds_helpers.models import (
    make_num_processor,
    make_ord_processor,
    make_typo_corrector,
    grid_search
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

# Create preprocessing pipelines
num_pipeline = make_num_processor(min_val=0, max_val=100)
cat_pipeline = make_ord_processor(categories=['low', 'medium', 'high'])

# Combine transformers
preprocessor = ColumnTransformer([
    ('num', num_pipeline, numerical_cols),
    ('cat', cat_pipeline, categorical_cols)
])

# Create full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor())
])

# Grid search with automatic result tracking
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [10, 20, None]
}

grid_search(
    pipeline=pipeline,
    grid=param_grid,
    X_train=X_train,
    y_train=y_train,
    model='random_forest',
    scoring='neg_mean_squared_error',
    cv_method=5,
    n_jobs=-1
)

API Reference

Data Handling Module

show_df(df, n=5)

Display comprehensive information about a DataFrame including head, statistics, and info.

Parameters:

  • df: DataFrame to analyze
  • n: Number of rows to display (default: 5)

get_num_cols(df, exclude_cols=None)

Returns list of numerical column names.

get_cat_cols(df, exclude_cols=None)

Returns list of categorical column names.

print_unique_cat_vals(dfs, exclude=None)

Print unique values for categorical features across one or multiple DataFrames.

highest_corrs(df, cols=None, interval_cols=None, num=10)

Calculate and return the highest correlations using Phik coefficient.

Parameters:

  • df: DataFrame with data
  • cols: Columns to analyze (default: all columns)
  • interval_cols: Numerical columns for interval correlation
  • num: Number of top correlations to return (default: 10)

Plotting Module

plot_numeric(df, num_cols=None, title='', hue=None, normalize=False, kde=True, ncols=2, scale=2.5, **kwargs)

Create histograms and boxplots for numerical features.

plot_cats(df, cat_cols=None, hue=None, title='', ncols=2, max_cats=10, max_cats_alias='all_other', **kwargs)

Create bar charts for categorical feature distributions.

show_residues(y_true, y_pred, title='', **kwargs)

Plot residual distribution and scatter plot for model evaluation.

Models Module

make_num_processor(min_val, max_val)

Create a preprocessing pipeline for numerical data with sanitization, imputation, and scaling.

make_ord_processor(categories)

Create a preprocessing pipeline for ordinal categorical data with typo correction, encoding, and imputation.

make_typo_corrector(correct_vals)

Create a transformer that corrects single-character typos using Hamming distance.

grid_search(pipeline, grid, X_train, y_train, model, scoring='roc_auc', cv_method=None, n_jobs=-1)

Perform grid search with automatic result tracking and model comparison.

Parameters:

  • pipeline: Scikit-learn pipeline
  • grid: Parameter grid (list or dict)
  • X_train, y_train: Training data
  • model: Model name for tracking
  • scoring: Scoring metric (default: 'roc_auc')
  • cv_method: Cross-validation method (default: 5-fold)
  • n_jobs: Number of parallel jobs (default: -1 for all cores)

show_search_result(search, n_results=10)

Display formatted grid search results.

evaluate_params(grid)

Display maximum metric values for each hyperparameter.

Examples

Check out the examples directory for complete working examples including:

  • Data exploration workflows
  • Feature engineering pipelines
  • Model training and evaluation
  • Visualization galleries

Development Status

This package is currently in Beta (Development Status: 4 - Beta). The API may change in future releases.

License

ypds_helpers is distributed under the terms of the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ypds_helpers-0.0.3.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ypds_helpers-0.0.3-py2.py3-none-any.whl (13.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file ypds_helpers-0.0.3.tar.gz.

File metadata

  • Download URL: ypds_helpers-0.0.3.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for ypds_helpers-0.0.3.tar.gz
Algorithm Hash digest
SHA256 de4f8ca919d690b2cd1a8c9fc67e2068e1fffc3aa9323a7edbc3801bdabb7c68
MD5 c66ff5fe62f1e40ef94dd8ca744a4879
BLAKE2b-256 4f88c4ec86987a25c0db4caf75cf6a57cb61e5180c3eb6088689fc22f107d9bb

See more details on using hashes here.

File details

Details for the file ypds_helpers-0.0.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for ypds_helpers-0.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8ef86ffedf3f4b2ed26a82e0a2ecc7985342228433a92c57b5925581168acaab
MD5 72eb8b9bfe5f52127d907e7547d34abf
BLAKE2b-256 b1f00ef715dfd4b175b617531b62f884eeccfa69a41243549ff15d36c9750d7b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page