No project description provided
Project description
ypds_helpers
A Python toolkit for data analysis, visualization, and machine learning preprocessing
ypds_helpers is a collection of utility functions and tools designed to streamline common data science workflows. It provides convenient helpers for data exploration, correlation analysis, visualization, and machine learning model preprocessing pipelines.
Features • Installation • Usage • API Reference
Features
- Data Handling: Quick data exploration with comprehensive statistics and type-based column selection
- Correlation Analysis: Advanced correlation detection using Phik (φk) correlation coefficient for mixed data types
- Visualization: Ready-to-use plotting functions for distributions, categorical data, and model residuals
- ML Preprocessing: Pre-built pipelines for numerical and categorical data preprocessing
- Model Evaluation: Grid search utilities with automatic result tracking and comparison
- Jupyter Integration: Automatic detection and proper display in both Jupyter notebooks and regular Python environments
Installation
pip install ypds-helpers
The package requires Python ≤3.13 and automatically installs the following dependencies:
- pandas
- numpy
- scikit-learn
- seaborn
- matplotlib
- phik
Usage
Quick Data Exploration
import pandas as pd
from ypds_helpers.data_handling import show_df, get_num_cols, get_cat_cols
# Load your data
df = pd.read_csv('data.csv')
# Get comprehensive overview
show_df(df, n=10) # Shows first 10 rows, statistics for numerical and categorical columns
# Get column lists by type
numerical_cols = get_num_cols(df, exclude_cols=['id'])
categorical_cols = get_cat_cols(df, exclude_cols=['target'])
Correlation Analysis
from ypds_helpers.data_handling import highest_corrs
# Find strongest correlations using Phik (works with mixed data types)
top_correlations = highest_corrs(
df,
cols=['age', 'income', 'category', 'score'],
interval_cols=['age', 'income', 'score'],
num=15
)
Data Visualization
from ypds_helpers.plotting import plot_numeric, plot_cats, show_residues
# Visualize numerical features with histograms and boxplots
plot_numeric(
df,
num_cols=['age', 'income', 'score'],
hue='category', # Split by category
normalize=True,
kde=True
)
# Visualize categorical distributions
plot_cats(
df,
cat_cols=['region', 'product_type'],
hue='customer_segment',
max_cats=10 # Group smaller categories
)
# Analyze model residuals
show_residues(y_true, y_pred, title='Model Performance')
Machine Learning Preprocessing
from ypds_helpers.models import (
make_num_processor,
make_ord_processor,
make_typo_corrector,
grid_search
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
# Create preprocessing pipelines
num_pipeline = make_num_processor(min_val=0, max_val=100)
cat_pipeline = make_ord_processor(categories=['low', 'medium', 'high'])
# Combine transformers
preprocessor = ColumnTransformer([
('num', num_pipeline, numerical_cols),
('cat', cat_pipeline, categorical_cols)
])
# Create full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('model', RandomForestRegressor())
])
# Grid search with automatic result tracking
param_grid = {
'model__n_estimators': [100, 200],
'model__max_depth': [10, 20, None]
}
grid_search(
pipeline=pipeline,
grid=param_grid,
X_train=X_train,
y_train=y_train,
model='random_forest',
scoring='neg_mean_squared_error',
cv_method=5,
n_jobs=-1
)
API Reference
Data Handling Module
show_df(df, n=5)
Display comprehensive information about a DataFrame including head, statistics, and info.
Parameters:
df: DataFrame to analyzen: Number of rows to display (default: 5)
get_num_cols(df, exclude_cols=None)
Returns list of numerical column names.
get_cat_cols(df, exclude_cols=None)
Returns list of categorical column names.
print_unique_cat_vals(dfs, exclude=None)
Print unique values for categorical features across one or multiple DataFrames.
highest_corrs(df, cols=None, interval_cols=None, num=10)
Calculate and return the highest correlations using Phik coefficient.
Parameters:
df: DataFrame with datacols: Columns to analyze (default: all columns)interval_cols: Numerical columns for interval correlationnum: Number of top correlations to return (default: 10)
Plotting Module
plot_numeric(df, num_cols=None, title='', hue=None, normalize=False, kde=True, ncols=2, scale=2.5, **kwargs)
Create histograms and boxplots for numerical features.
plot_cats(df, cat_cols=None, hue=None, title='', ncols=2, max_cats=10, max_cats_alias='all_other', **kwargs)
Create bar charts for categorical feature distributions.
show_residues(y_true, y_pred, title='', **kwargs)
Plot residual distribution and scatter plot for model evaluation.
Models Module
make_num_processor(min_val, max_val)
Create a preprocessing pipeline for numerical data with sanitization, imputation, and scaling.
make_ord_processor(categories)
Create a preprocessing pipeline for ordinal categorical data with typo correction, encoding, and imputation.
make_typo_corrector(correct_vals)
Create a transformer that corrects single-character typos using Hamming distance.
grid_search(pipeline, grid, X_train, y_train, model, scoring='roc_auc', cv_method=None, n_jobs=-1)
Perform grid search with automatic result tracking and model comparison.
Parameters:
pipeline: Scikit-learn pipelinegrid: Parameter grid (list or dict)X_train,y_train: Training datamodel: Model name for trackingscoring: Scoring metric (default: 'roc_auc')cv_method: Cross-validation method (default: 5-fold)n_jobs: Number of parallel jobs (default: -1 for all cores)
show_search_result(search, n_results=10)
Display formatted grid search results.
evaluate_params(grid)
Display maximum metric values for each hyperparameter.
Examples
Check out the examples directory for complete working examples including:
- Data exploration workflows
- Feature engineering pipelines
- Model training and evaluation
- Visualization galleries
Development Status
This package is currently in Beta (Development Status: 4 - Beta). The API may change in future releases.
License
ypds_helpers is distributed under the terms of the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ypds_helpers-0.0.3.tar.gz.
File metadata
- Download URL: ypds_helpers-0.0.3.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de4f8ca919d690b2cd1a8c9fc67e2068e1fffc3aa9323a7edbc3801bdabb7c68
|
|
| MD5 |
c66ff5fe62f1e40ef94dd8ca744a4879
|
|
| BLAKE2b-256 |
4f88c4ec86987a25c0db4caf75cf6a57cb61e5180c3eb6088689fc22f107d9bb
|
File details
Details for the file ypds_helpers-0.0.3-py2.py3-none-any.whl.
File metadata
- Download URL: ypds_helpers-0.0.3-py2.py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ef86ffedf3f4b2ed26a82e0a2ecc7985342228433a92c57b5925581168acaab
|
|
| MD5 |
72eb8b9bfe5f52127d907e7547d34abf
|
|
| BLAKE2b-256 |
b1f00ef715dfd4b175b617531b62f884eeccfa69a41243549ff15d36c9750d7b
|