Packaged functions for Machine Learning and Data Science tasks.
Project description
🐍 VKPyKit
A comprehensive Python toolkit for Machine Learning and Data Science workflows
Features • Installation • Quick Start • Documentation • Contributing
📖 Overview
VKPyKit is a production-ready Python package designed to streamline common Machine Learning and Data Science tasks. Built on top of industry-standard libraries like scikit-learn, pandas, matplotlib, seaborn, TensorFlow, and Keras, it provides convenient wrapper functions and utilities for:
- VKPy Utilities: Core utility functions for reproducible ML experiments including seed management
- Exploratory Data Analysis (EDA): Comprehensive visualization and statistical analysis tools
- Decision Trees (DT): Model training, evaluation, hyperparameter tuning, pruning, and tree visualization
- Linear Regression (LR): Regression model performance assessment with multiple metrics
- Machine Learning Models (MLM): General classification model performance evaluation and visualization
Instead of repeatedly writing the same boilerplate code across projects, VKPyKit packages these commonly-used functions into a reusable, well-tested library.
✨ Features
🛠️ VKPy Utilities
- Seed Management: Set random seeds across all major ML libraries (NumPy, TensorFlow, Keras, PyTorch)
- Reproducibility: Ensure consistent results across multiple runs of your experiments
- Multi-Library Support: Single function call to set seeds for all commonly used ML frameworks
- CUDA Support: Automatic configuration for GPU-based PyTorch experiments
📊 Exploratory Data Analysis (EDA)
- Stacked Bar Plots: Visualize categorical distributions with respect to target variables
- Labeled Bar Plots: Bar charts with percentage or count annotations
- Distribution Analysis: Combined histogram and boxplot visualizations
- Outlier Detection: Automated boxplot generation for outlier identification
- Correlation Heatmaps: Visualize feature correlations
- Pair Plots: Comprehensive pairwise relationship visualization
- Target Distribution: Analyze feature distributions across target classes
- Pivot Tables: Generate comprehensive pivot tables with multiple statistics
- Data Overview: Quick statistical summary and data quality assessment
- Image Grid Display: Plot a random sample of images from a dataset with labels
🌲 Decision Trees (DT)
- Model Performance Metrics: Comprehensive classification performance reporting
- Confusion Matrices: Visual confusion matrix generation with customization
- Tree Visualization: Render decision tree structure with optional text rules and feature importance
- Hyperparameter Tuning: Automated grid search for optimal decision tree parameters (returns model or full results dict)
- Pre-Pruning: Grid search with automated visualization and train/test evaluation
- Post-Pruning: Cost-complexity pruning path analysis with F1-score optimization
- Feature Importance: Analyze and visualize feature contributions
📈 Linear Regression (LR)
- Performance Evaluation: R², Adjusted R², RMSE, MAE, and MAPE regression metrics
- MAPE Score: Mean Absolute Percentage Error utility
- Adjusted R²: Penalized R² accounting for number of predictors
🤖 Machine Learning Models (MLM)
- Model Performance Metrics: Comprehensive classification performance reporting for any sklearn classifier
- Confusion Matrices: Visual confusion matrix generation with percentages and optional binary threshold conversion
- Model Evaluation: Accuracy, Precision, Recall, and F1-Score metrics; supports
argmaxfor multi-class outputs - Feature Importance Visualization: Plot and rank features by their importance scores
- Training History Tracking: Visualize Keras/TensorFlow model training metrics over epochs
- End-to-End Model Execution: Complete training, validation, and reporting pipeline for neural networks
- Universal Compatibility: Works with any scikit-learn classification model and TensorFlow/Keras models
🚀 Installation
Using pip (Recommended)
pip install VKPyKit
From Source
git clone https://github.com/assignarc/VKPyKit.git
cd VKPyKit
pip install -e .
Requirements
- Python >= 3.9
- Dependencies:
numpy,pandas,scikit-learn,matplotlib,seaborn,openpyxl,plotly,tensorflow,keras
All dependencies will be automatically installed with the package.
🎯 Quick Start
from VKPyKit.VKPy import *
from VKPyKit.EDA import *
from VKPyKit.DT import *
from VKPyKit.LR import *
from VKPyKit.MLM import *
# Set seeds for reproducibility across all ML libraries
VKPy.setseed(42)
# Quick EDA visualization
EDA.histogram_boxplot_all(
data=df,
figsize=(15, 10),
bins=10,
kde=True
)
# Train and evaluate a Decision Tree
DT.model_performance_classification(
model=my_dt_classifier,
predictors=X_test,
expected=y_test,
printall=True,
title='Customer Churn Model'
)
# Evaluate any classification model
MLM.model_performance_classification(
model=my_classifier,
predictors=X_test,
expected=y_test,
printall=True,
title='My Classification Model'
)
# Plot feature importance
MLM.plot_feature_importance(
model=my_model,
features=feature_names,
numberoftopfeatures=10
)
# Evaluate a regression model
LR.model_performance_regression(
model=my_lr_model,
predictors=X_test,
target=y_test
)
📚 Documentation
VKPy Utilities
Seed Management for Reproducibility
Ensure consistent results across multiple runs of your ML experiments by setting random seeds for all major libraries:
from VKPyKit.VKPy import *
# Set seed for reproducibility across NumPy, TensorFlow, Keras, and PyTorch
VKPy.setseed(42)
# Now all random operations will be reproducible
# This affects:
# - NumPy random operations
# - TensorFlow/Keras model initialization and training
# - PyTorch model initialization and training (including CUDA operations)
# - Python's built-in random module
Benefits:
- ✅ Reproducible experiments across different runs
- ✅ Consistent model initialization weights
- ✅ Reliable train-test splits
- ✅ Easier debugging and model comparison
- ✅ GPU operations (CUDA) are also deterministic
Exploratory Data Analysis (EDA)
Histogram with Boxplot
Visualize the distribution of all numerical features in your dataset:
from VKPyKit.EDA import *
import pandas as pd
# Load your data
df = pd.read_csv('your_data.csv')
# Generate histogram and boxplot for all numerical columns
EDA.histogram_boxplot_all(
data=df,
figsize=(15, 10),
bins=10,
kde=True
)
# Generate histogram and boxplot for a single feature
EDA.histogram_boxplot(
data=df,
feature='age',
figsize=(12, 7),
kde=True,
bins=20
)
Stacked Bar Plots
Visualize categorical variable distributions against a target:
# Single stacked bar plot
EDA.barplot_stacked(
data=df,
predictor='category_column',
target='target_column'
)
# Multiple stacked bar plots
EDA.barplot_stacked_all(
data=df,
predictors=['cat_col1', 'cat_col2', 'cat_col3'],
target='target_column'
)
Labeled Bar Plots
# Single labeled bar plot (with optional percentage display)
EDA.barplot_labeled(
data=df,
feature='category_column',
percentages=True,
category_levels=10 # show top 10 levels only
)
# Multiple labeled bar plots for a list of predictors
EDA.barplot_labeled_all(
data=df,
predictors=['cat_col1', 'cat_col2'],
target='target_column'
)
Distribution Analysis for Target Variable
# Analyze how a feature distributes across target classes
EDA.distribution_plot_for_target(
data=df,
predictor='numerical_feature',
target='target_column',
figsize=(12, 10)
)
# Analyze multiple features
EDA.distribution_plot_for_target_all(
data=df,
predictors=['feature1', 'feature2', 'feature3'],
target='target_column',
figsize=(12, 10)
)
Correlation Analysis
# Generate correlation heatmap
EDA.heatmap_all(
data=df,
features=['feature1', 'feature2', 'feature3'] # Optional: specify features
)
# Generate pairplot for feature relationships
EDA.pairplot_all(
data=df,
features=['feature1', 'feature2', 'feature3'],
hues=['target_column'],
min_unique_values_for_pairplot=4,
diagonal_plot_kind='auto'
)
Outlier Detection
# Visualize outliers across all numerical features
EDA.boxplot_outliers(data=df)
# Boxplot for a dependent variable against multiple categories
EDA.boxplot_dependent_category(
data=df,
dependent='price',
independent=['brand', 'category'],
figsize=(12, 5)
)
Pivot Tables and Statistical Analysis
# Generate comprehensive pivot tables with multiple statistics
EDA.pivot_table_all(
data=df,
predictors=['category1', 'category2'],
target='numerical_target',
stats=['mean', 'median', 'count', 'std'],
figsize=(12, 10),
chart_type='bar', # 'bar', 'line', or None
printall=True
)
Quick Data Overview
# Get a comprehensive statistical summary and data quality check
EDA.overview(
data=df,
printall=True
)
# Displays: shape, data types, missing values, duplicates, basic statistics, and memory usage
Image Grid Display
# Plot a sample grid of images with their labels
EDA.plot_images(
images=image_array, # numpy array of images
labels=labels_df, # DataFrame with 'Label' column
rows=3,
cols=4
)
Decision Trees (DT)
Model Performance Evaluation
from VKPyKit.DT import *
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
# Evaluate performance
DT.model_performance_classification(
model=model,
predictors=X_test,
expected=y_test,
printall=True,
title='Decision Tree Classifier Performance'
)
Confusion Matrix Visualization
# Plot confusion matrix
DT.plot_confusion_matrix(
model=model,
predictors=X_test,
expected=y_test,
title='Confusion Matrix - Decision Tree'
)
Tree Visualization
# Visualize the tree structure with optional text rules + feature importance
DT.visualize_decision_tree(
model=model,
features=X_train.columns.tolist(),
classes=['No', 'Yes'],
figsize=(20, 10),
showtext=True, # print text rules
showimportance=True # plot feature importance
)
Hyperparameter Tuning
# Returns best DecisionTreeClassifier model
best_model = DT.tune_decision_tree(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
max_depth_v=(2, 11, 2), # (start, end, step)
max_leaf_nodes_v=(10, 51, 10),
min_samples_split_v=(10, 51, 10),
printall=True,
sortresultby=['F1Difference'],
sortbyAscending=False
)
# Returns a full results dictionary (scores df, tuned model scores df, and best model)
results = DT.tune_decision_tree_results(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
max_depth_v=(2, 11, 2),
max_leaf_nodes_v=(10, 51, 10),
min_samples_split_v=(10, 51, 10),
printall=True,
sortresultby=['F1Difference'],
sortbyAscending=False,
metrictooptimize='F1Difference' # 'Accuracy', 'Recall', 'Precision', 'F1', 'F1Difference', 'RecallDifference'
)
print(results['scores']) # All combinations
print(results['tuned_model_scores']) # Best combination
best_model = results['model']
Pre-Pruning
# Run grid search, visualize best tree, and evaluate on train/test
prepruning_results = DT.prepruning_nodes_samples_split(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
max_depth_v=(2, 9, 2),
max_leaf_nodes_v=(50, 250, 50),
min_samples_split_v=(10, 70, 10),
printall=True,
sortresultby='F1',
sortbyAscending=False
)
model = prepruning_results['model']
train_perf = prepruning_results['prepruning_train_perf']
test_perf = prepruning_results['prepruning_test_perf']
Post-Pruning (Cost-Complexity)
# Analyze the cost-complexity path and select the best alpha
postpruning_results = DT.postpruning_cost_complexity(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
printall=True,
figsize=(10, 6)
)
model = postpruning_results['model']
train_perf = postpruning_results['postpruning_train_perf']
test_perf = postpruning_results['postpruning_test_perf']
Linear Regression (LR)
Evaluate a Regression Model
from VKPyKit.LR import *
# Computer comprehensive regression metrics (RMSE, MAE, MAPE, R², Adj R²)
perf_df = LR.model_performance_regression(
model=my_lr_model,
predictors=X_test,
target=y_test
)
print(perf_df)
# Returns DataFrame with: RMSE, MAE, MAPE, R-squared, Adj R-squared
# Utility: MAPE score
mape = LR.mape_score(targets=y_test, predictions=y_pred)
# Utility: Adjusted R² score
adj_r2 = LR.adj_r2_score(predictors=X_test, targets=y_test, predictions=y_pred)
Machine Learning Models (MLM)
Evaluate Any Classification Model
The MLM module works with any scikit-learn classifier (Random Forest, SVM, Logistic Regression, etc.):
from VKPyKit.MLM import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Prepare your data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train any classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate performance with comprehensive metrics
MLM.model_performance_classification(
model=model,
predictors=X_test,
expected=y_test,
threshold=0.5, # threshold for binary classification
score_average='binary', # 'binary', 'macro', 'weighted', etc.
printall=True,
title='Random Forest Classifier',
idmax=False # set True for multi-class models using argmax
)
Confusion Matrix for Any Classifier
# Plot confusion matrix with percentages
MLM.plot_confusion_matrix(
model=model,
predictors=X_test,
expected=y_test,
convert_pred_to_binary=False, # set True to threshold continuous predictions
threshold=0.5,
title='Random Forest - Confusion Matrix'
)
Feature Importance Visualization
# Plot feature importance for tree-based models
MLM.plot_feature_importance(
model=model,
features=X_train.columns.tolist(),
figsize=(10, 6),
numberoftopfeatures=15, # Show top 15 features
title='Random Forest Feature Importance',
ignoreZeroImportance=True # Hide features with zero importance
)
Training History for Neural Networks
from tensorflow import keras
# Train a Keras model
model = keras.Sequential([...])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=50)
# Plot all metrics in training history
MLM.model_history_plot(
history=history,
title='Neural Network Training'
)
# Plot a specific metric only
MLM.model_history_plot(
history=history,
plot_metric='accuracy',
title='Model Accuracy Over Epochs'
)
End-to-End Model Execution
import tensorflow as tf
from tensorflow import keras
# Prepare data dictionary
data = {
"X_train": X_train,
"y_train": y_train,
"X_val": X_val,
"y_val": y_val
}
# Define your model
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
keras.layers.Dropout(0.5),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
# Execute complete training and evaluation pipeline
results = MLM.execute_model(
model_in=model,
optimizer=keras.optimizers.Adam(learning_rate=0.001),
model_name='Binary Classifier',
data=data,
class_weights={0: 1, 1: 2}, # Handle class imbalance
loss_type=keras.losses.BinaryCrossentropy(),
optmization_metrics=[keras.metrics.BinaryAccuracy(), keras.metrics.Recall()],
threshold=0.5,
target_names=['Class 0', 'Class 1'],
epochs=50,
batch_size=32,
verbose=1,
print_model_summary=True,
activation='relu'
)
# Results DataFrame contains comprehensive metrics
print(results[['ModelName', 'Validation_Accuracy', 'Validation_F1Score']])
🛠️ API Reference
VKPy Class
| Method | Description |
|---|---|
setseed(seed) |
Set random seeds across NumPy, TensorFlow, Keras, and PyTorch |
EDA Class
| Method | Description |
|---|---|
histogram_boxplot() |
Combined histogram and boxplot for a single feature |
histogram_boxplot_all() |
Combined histogram and boxplot for all numerical features |
barplot_stacked() |
Stacked bar chart for a single categorical variable |
barplot_stacked_all() |
Multiple stacked bar charts for a list of predictors |
barplot_labeled() |
Bar plot with count/percentage labels for a single feature |
barplot_labeled_all() |
Labeled bar plots for a list of categorical predictors |
distribution_plot_for_target() |
Distribution analysis across target classes (single) |
distribution_plot_for_target_all() |
Distribution analysis across target classes (multiple) |
boxplot_outliers() |
Outlier detection boxplots for all numerical features |
boxplot_dependent_category() |
Boxplot for a dependent variable vs. category features |
heatmap_all() |
Correlation heatmap |
pairplot_all() |
Pairwise feature relationship plots |
pivot_table_all() |
Generate pivot tables with multiple statistics |
overview() |
Quick statistical summary and data quality check |
plot_images() |
Display a random sample grid of images with labels |
DT Class
| Method | Description |
|---|---|
model_performance_classification() |
Comprehensive performance metrics (Accuracy, Recall, Precision, F1) |
plot_confusion_matrix() |
Visualize confusion matrix with counts and percentages |
visualize_decision_tree() |
Render tree structure; optionally show text rules and feature importance |
tune_decision_tree() |
Grid search tuning — returns best DecisionTreeClassifier model |
tune_decision_tree_results() |
Grid search tuning — returns dict of scores, tuned scores, and model |
prepruning_nodes_samples_split() |
Pre-pruning via grid search with full train/test evaluation |
postpruning_cost_complexity() |
Post-pruning via cost-complexity path; selects best alpha by test F1 |
plot_feature_importance() |
Visualize feature importance scores |
LR Class
| Method | Description |
|---|---|
model_performance_regression() |
Compute RMSE, MAE, MAPE, R², and Adjusted R² metrics |
mape_score() |
Compute Mean Absolute Percentage Error |
adj_r2_score() |
Compute Adjusted R² given predictors, targets, preds |
MLM Class
| Method | Description |
|---|---|
model_performance_classification() |
Comprehensive performance metrics for any classifier; supports threshold and argmax |
plot_confusion_matrix() |
Visualize confusion matrix with percentages; supports binary threshold conversion |
plot_feature_importance() |
Plot and display feature importance rankings with optional filtering |
model_history_plot() |
Visualize Keras/TensorFlow training history (loss, accuracy, custom metrics) |
execute_model() |
Complete end-to-end training, validation, and evaluation pipeline for Keras |
🧪 Testing
VKPyKit includes a comprehensive test suite to ensure code quality and reliability. The test suite covers all major modules:
- EDA Module Tests (
tests/test_EDA.py): Tests for all exploratory data analysis functions - DT Module Tests (
tests/test_DT.py): Tests for decision tree utilities - LR Module Tests (
tests/test_LR.py): Tests for linear regression functions - MLM Module Tests (
tests/test_MLM.py): Tests for machine learning model evaluation
Running Tests
# Install test dependencies
pip install VKPyKit[test]
# Run all tests
pytest
# Run tests with coverage report
pytest --cov=VKPyKit
The test suite uses synthetic data generated via conftest.py to ensure reproducible and reliable testing.
🤝 Contributing
Contributions are welcome! If you have additional utility functions or improvements, please contribute to the project.
How to Contribute
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Reporting Issues
Found a bug or have a feature request? Please open an issue on GitHub Issues.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
👤 Author
Vishal Khapre
- GitHub: @assignarc
- PyPI: VKPyKit
🌟 Acknowledgments
Built with:
Made by Vishal Khapre
If you find VKPyKit useful, please consider giving it a ⭐ on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vkpykit-0.4.12.tar.gz.
File metadata
- Download URL: vkpykit-0.4.12.tar.gz
- Upload date:
- Size: 43.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c07cf16bac003e9d22b5f1a17c259c88430a2fd543a29a104bcb61014980b32
|
|
| MD5 |
0dfc043b692ac05f2b74b73b91849f7f
|
|
| BLAKE2b-256 |
716a7b26bb56926004589e4ccd6e5d1d908b7134cbaf4e31beb6a1b3463b0600
|
File details
Details for the file vkpykit-0.4.12-py3-none-any.whl.
File metadata
- Download URL: vkpykit-0.4.12-py3-none-any.whl
- Upload date:
- Size: 25.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04fd37ac291feedc63fd09c81ea92f63bf3e783d3476f1b313a95aaccfa210a4
|
|
| MD5 |
6710935174f45b76c9a9931647a7f359
|
|
| BLAKE2b-256 |
975fc170dbf392b05e7cc94f11d41797671d5bf21976b28ae2374d76b3b8e0c4
|