No project description provided

These details have not been verified by PyPI

Project links

Project description

ypds_helpers

A Python toolkit for data analysis, visualization, and machine learning preprocessing

ypds_helpers is a collection of utility functions and tools designed to streamline common data science workflows. It provides convenient helpers for data exploration, correlation analysis, visualization, and machine learning model preprocessing pipelines.

Features • Installation • Usage • API Reference

Features

Data Handling: Quick data exploration with comprehensive statistics and type-based column selection
Correlation Analysis: Advanced correlation detection using Phik (φk) correlation coefficient for mixed data types
Visualization: Ready-to-use plotting functions for distributions, categorical data, and model residuals
ML Preprocessing: Pre-built pipelines for numerical and categorical data preprocessing
Model Evaluation: Grid search utilities with automatic result tracking and comparison
Jupyter Integration: Automatic detection and proper display in both Jupyter notebooks and regular Python environments

Installation

pip install ypds-helpers

The package requires Python ≤3.13 and automatically installs the following dependencies:

pandas
numpy
scikit-learn
seaborn
matplotlib
phik

Usage

Quick Data Exploration

import pandas as pd
from ypds_helpers.data_handling import show_df, get_num_cols, get_cat_cols

# Load your data
df = pd.read_csv('data.csv')

# Get comprehensive overview
show_df(df, n=10)  # Shows first 10 rows, statistics for numerical and categorical columns

# Get column lists by type
numerical_cols = get_num_cols(df, exclude_cols=['id'])
categorical_cols = get_cat_cols(df, exclude_cols=['target'])

Correlation Analysis

from ypds_helpers.data_handling import highest_corrs

# Find strongest correlations using Phik (works with mixed data types)
top_correlations = highest_corrs(
    df, 
    cols=['age', 'income', 'category', 'score'],
    interval_cols=['age', 'income', 'score'],
    num=15
)

Data Visualization

from ypds_helpers.plotting import plot_numeric, plot_cats, show_residues

# Visualize numerical features with histograms and boxplots
plot_numeric(
    df, 
    num_cols=['age', 'income', 'score'],
    hue='category',  # Split by category
    normalize=True,
    kde=True
)

# Visualize categorical distributions
plot_cats(
    df,
    cat_cols=['region', 'product_type'],
    hue='customer_segment',
    max_cats=10  # Group smaller categories
)

# Analyze model residuals
show_residues(y_true, y_pred, title='Model Performance')

Machine Learning Preprocessing

from ypds_helpers.models import (
    make_num_processor,
    make_ord_processor,
    make_typo_corrector,
    grid_search
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

# Create preprocessing pipelines
num_pipeline = make_num_processor(min_val=0, max_val=100)
cat_pipeline = make_ord_processor(categories=['low', 'medium', 'high'])

# Combine transformers
preprocessor = ColumnTransformer([
    ('num', num_pipeline, numerical_cols),
    ('cat', cat_pipeline, categorical_cols)
])

# Create full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor())
])

# Grid search with automatic result tracking
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [10, 20, None]
}

grid_search(
    pipeline=pipeline,
    grid=param_grid,
    X_train=X_train,
    y_train=y_train,
    model='random_forest',
    scoring='neg_mean_squared_error',
    cv_method=5,
    n_jobs=-1
)

API Reference

Data Handling Module

`show_df(df, n=5)`

Display comprehensive information about a DataFrame including head, statistics, and info.

Parameters:

df: DataFrame to analyze
n: Number of rows to display (default: 5)

`get_num_cols(df, exclude_cols=None)`

Returns list of numerical column names.

`get_cat_cols(df, exclude_cols=None)`

Returns list of categorical column names.

`print_unique_cat_vals(dfs, exclude=None)`

Print unique values for categorical features across one or multiple DataFrames.

`highest_corrs(df, cols=None, interval_cols=None, num=10)`

Calculate and return the highest correlations using Phik coefficient.

Parameters:

df: DataFrame with data
cols: Columns to analyze (default: all columns)
interval_cols: Numerical columns for interval correlation
num: Number of top correlations to return (default: 10)

Plotting Module

`plot_numeric(df, num_cols=None, title='', hue=None, normalize=False, kde=True, ncols=2, scale=2.5, **kwargs)`

Create histograms and boxplots for numerical features.

`plot_cats(df, cat_cols=None, hue=None, title='', ncols=2, max_cats=10, max_cats_alias='all_other', **kwargs)`

Create bar charts for categorical feature distributions.

`show_residues(y_true, y_pred, title='', **kwargs)`

Plot residual distribution and scatter plot for model evaluation.

Models Module

`make_num_processor(min_val, max_val)`

Create a preprocessing pipeline for numerical data with sanitization, imputation, and scaling.

`make_ord_processor(categories)`

Create a preprocessing pipeline for ordinal categorical data with typo correction, encoding, and imputation.

`make_typo_corrector(correct_vals)`

Create a transformer that corrects single-character typos using Hamming distance.

`grid_search(pipeline, grid, X_train, y_train, model, scoring='roc_auc', cv_method=None, n_jobs=-1)`

Perform grid search with automatic result tracking and model comparison.

Parameters:

pipeline: Scikit-learn pipeline
grid: Parameter grid (list or dict)
X_train, y_train: Training data
model: Model name for tracking
scoring: Scoring metric (default: 'roc_auc')
cv_method: Cross-validation method (default: 5-fold)
n_jobs: Number of parallel jobs (default: -1 for all cores)

`show_search_result(search, n_results=10)`

Display formatted grid search results.

`evaluate_params(grid)`

Display maximum metric values for each hyperparameter.

Examples

Check out the examples directory for complete working examples including:

Data exploration workflows
Feature engineering pipelines
Model training and evaluation
Visualization galleries

Development Status

This package is currently in Beta (Development Status: 4 - Beta). The API may change in future releases.

License

ypds_helpers is distributed under the terms of the MIT license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.3

Nov 8, 2025

0.0.1

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ypds_helpers-0.0.3.tar.gz (12.8 kB view details)

Uploaded Nov 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ypds_helpers-0.0.3-py2.py3-none-any.whl (13.9 kB view details)

Uploaded Nov 8, 2025 Python 2Python 3

File details

Details for the file ypds_helpers-0.0.3.tar.gz.

File metadata

Download URL: ypds_helpers-0.0.3.tar.gz
Upload date: Nov 8, 2025
Size: 12.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for ypds_helpers-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`de4f8ca919d690b2cd1a8c9fc67e2068e1fffc3aa9323a7edbc3801bdabb7c68`
MD5	`c66ff5fe62f1e40ef94dd8ca744a4879`
BLAKE2b-256	`4f88c4ec86987a25c0db4caf75cf6a57cb61e5180c3eb6088689fc22f107d9bb`

See more details on using hashes here.

File details

Details for the file ypds_helpers-0.0.3-py2.py3-none-any.whl.

File metadata

Download URL: ypds_helpers-0.0.3-py2.py3-none-any.whl
Upload date: Nov 8, 2025
Size: 13.9 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for ypds_helpers-0.0.3-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ef86ffedf3f4b2ed26a82e0a2ecc7985342228433a92c57b5925581168acaab`
MD5	`72eb8b9bfe5f52127d907e7547d34abf`
BLAKE2b-256	`b1f00ef715dfd4b175b617531b62f884eeccfa69a41243549ff15d36c9750d7b`

See more details on using hashes here.

ypds_helpers 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ypds_helpers

Features

Installation

Usage

Quick Data Exploration

Correlation Analysis

Data Visualization

Machine Learning Preprocessing

API Reference

Data Handling Module

show_df(df, n=5)

get_num_cols(df, exclude_cols=None)

get_cat_cols(df, exclude_cols=None)

print_unique_cat_vals(dfs, exclude=None)

highest_corrs(df, cols=None, interval_cols=None, num=10)

Plotting Module

plot_numeric(df, num_cols=None, title='', hue=None, normalize=False, kde=True, ncols=2, scale=2.5, **kwargs)

plot_cats(df, cat_cols=None, hue=None, title='', ncols=2, max_cats=10, max_cats_alias='all_other', **kwargs)

show_residues(y_true, y_pred, title='', **kwargs)

Models Module

make_num_processor(min_val, max_val)

make_ord_processor(categories)

make_typo_corrector(correct_vals)

grid_search(pipeline, grid, X_train, y_train, model, scoring='roc_auc', cv_method=None, n_jobs=-1)

show_search_result(search, n_results=10)

evaluate_params(grid)

Examples

Development Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`show_df(df, n=5)`

`get_num_cols(df, exclude_cols=None)`

`get_cat_cols(df, exclude_cols=None)`

`print_unique_cat_vals(dfs, exclude=None)`

`highest_corrs(df, cols=None, interval_cols=None, num=10)`

`plot_numeric(df, num_cols=None, title='', hue=None, normalize=False, kde=True, ncols=2, scale=2.5, **kwargs)`

`plot_cats(df, cat_cols=None, hue=None, title='', ncols=2, max_cats=10, max_cats_alias='all_other', **kwargs)`

`show_residues(y_true, y_pred, title='', **kwargs)`

`make_num_processor(min_val, max_val)`

`make_ord_processor(categories)`

`make_typo_corrector(correct_vals)`

`grid_search(pipeline, grid, X_train, y_train, model, scoring='roc_auc', cv_method=None, n_jobs=-1)`

`show_search_result(search, n_results=10)`

`evaluate_params(grid)`