Skip to main content

A comprehensive data science toolkit with 221+ functions for ML workflows

Project description

🚀 DSKit - A Unified Wrapper Library for Data Science & ML

DSKit is a comprehensive, community-driven, open-source Python library that wraps complex Data Science and ML operations into intuitive, user-friendly 1-line commands.

Instead of writing hundreds of lines for cleaning, EDA, plotting, preprocessing, modeling, evaluation, and explainability, DSKit makes everything simple, readable, reusable, and production-ready.

The goal is to bring a complete end-to-end Data Science ecosystem in one place with wrapper-style functions and classes, supporting everything from basic data manipulation to advanced AutoML.


🎯 Project Objective

To create a Python library that lets users perform complete Data Science workflows with minimal code:

from dskit import DSKit

# Complete ML Pipeline in 4 lines!
kit = DSKit.load("data.csv")
kit.comprehensive_eda(target_col="target").clean().engineer_features()
kit.train_advanced("xgboost").auto_tune().evaluate()
kit.explain()  # Generate SHAP explanations

The library remains:

  • Simple: One-line commands for complex operations
  • Comprehensive: 221 functions covering entire ML pipeline
  • Extensible: Modular design for easy customization
  • Beginner-friendly: Intuitive API with smart defaults
  • Expert-ready: Advanced features and customization options
  • Production-ready: Robust error handling and optimization

📦 Installation

From PyPI (Recommended)

# Basic installation
pip install dskit

# Full installation with all optional dependencies
pip install dskit[full]

# Install specific feature sets
pip install dskit[visualization]  # Plotly support
pip install dskit[nlp]           # NLP utilities
pip install dskit[automl]        # AutoML algorithms

# Development installation
pip install dskit[dev]

From Source

git clone https://github.com/Programmers-Paradise/imputeKit.git
cd imputeKit
pip install -e .

Verify Installation

# Test the package
python test_package.py

# Check CLI
dskit --help

📦 Core Modules

DSKit includes comprehensive modules for:

📁 Data I/O

  • Multi-format loading (CSV, Excel, JSON, Parquet)
  • Batch folder processing
  • Smart data type detection

🧹 Data Cleaning

  • Auto-detect and fix data types
  • Smart missing value imputation
  • Outlier detection and removal
  • Column name standardization
  • Text preprocessing and NLP utilities

📊 Exploratory Data Analysis

  • Comprehensive EDA reports
  • Data health scoring
  • Interactive visualizations
  • Statistical summaries
  • Correlation analysis
  • Missing data patterns

🔧 Feature Engineering

  • Polynomial and interaction features
  • Date/time feature extraction
  • Binning and discretization
  • Target encoding
  • Dimensionality reduction (PCA)
  • Text feature extraction
  • Sentiment analysis

🤖 Machine Learning

  • 15+ algorithms (including XGBoost, LightGBM, CatBoost)
  • AutoML capabilities
  • Hyperparameter optimization
  • Cross-validation
  • Ensemble methods
  • Imbalanced data handling

📈 Visualization

  • Static plots (matplotlib/seaborn)
  • Interactive plots (plotly)
  • Model performance charts
  • Feature importance plots
  • Advanced correlation heatmaps

🧠 Model Explainability

  • SHAP integration
  • Feature importance analysis
  • Model performance metrics
  • Error analysis
  • Learning curves

📐 Hyperplane Analysis

  • Algorithm-specific hyperplane visualization
  • SVM margins and support vectors
  • Logistic regression probability contours
  • Perceptron misclassification highlighting
  • LDA class centers and projections
  • Linear regression residual analysis
  • Multi-algorithm comparison tools

🎯 AutoML Features

  • Automated preprocessing pipelines
  • Model comparison and selection
  • Hyperparameter tuning (Grid, Random, Bayesian, Optuna)
  • Automated feature selection
  • Pipeline optimization

🚀 Quick Start

Installation

# Basic installation
pip install dskit

# Full installation with all optional dependencies
pip install dskit[full]

# Development installation
git clone https://github.com/your-username/dskit.git
cd dskit
pip install -e .[dev,full]

Basic Usage

from dskit import DSKit

# Load and explore data
kit = DSKit.load("your_data.csv")
kit.data_health_check()  # Get data quality score
kit.comprehensive_eda(target_col="target")  # Full EDA report

# Clean and preprocess
kit.clean()  # Auto-clean: fix types, handle missing, normalize columns
kit.engineer_features()  # Create polynomial, date, and text features

# Train and evaluate models
kit.train_test_auto(target="your_target")
kit.compare_models("your_target")  # Compare multiple algorithms
kit.train_advanced("xgboost").auto_tune()  # Train with hyperparameter tuning
kit.evaluate().explain()  # Evaluate and generate SHAP explanations

Advanced Features

# Advanced text processing
kit.sentiment_analysis(["text_column"])
kit.extract_text_features(["text_column"])
kit.generate_wordcloud("text_column")

# Feature engineering
kit.create_polynomial_features(degree=3)
kit.create_date_features(["date_column"])
kit.apply_pca(variance_threshold=0.95)

# AutoML
kit.auto_tune(method="optuna", max_evals=100)
best_models = kit.compare_models("target", task="classification")

# Advanced visualizations
kit.plot_feature_importance(top_n=20)
kit.plot_learning_curves()
kit.plot_validation_curves()

# Algorithm-specific hyperplane visualization
dskit.plot_svm_hyperplane(svm_model, X, y)  # SVM with margins
dskit.plot_logistic_hyperplane(lr_model, X, y)  # Probability contours
dskit.plot_perceptron_hyperplane(perceptron_model, X, y)  # Misclassified points

# Compare multiple algorithm hyperplanes
models = {'SVM': svm, 'LR': lr, 'Perceptron': perceptron}
dskit.compare_algorithm_hyperplanes(models, X, y)

📚 Complete Feature Documentation

🧩 IMPLEMENTED FEATURES (All Tasks Complete)

Each task below is numbered and written in simple language with enough theory so that any contributor — even new ones — can understand exactly what to build.


📖 Examples & Tutorials

Complete ML Pipeline Example

import pandas as pd
from dskit import DSKit

# 1. Load and explore
kit = DSKit.load("customer_data.csv")
health_score = kit.data_health_check()  # Returns: 85.3/100

# 2. Comprehensive EDA
kit.comprehensive_eda(target_col="churn", sample_size=1000)
kit.generate_profile_report("eda_report.html")  # Automated EDA report

# 3. Advanced text processing (if text columns exist)
kit.advanced_text_clean(["feedback"])
kit.sentiment_analysis(["feedback"])
kit.extract_text_features(["feedback"])

# 4. Feature engineering
kit.create_date_features(["registration_date"])
kit.create_polynomial_features(degree=2, interaction_only=True)
kit.create_binning_features(["age", "income"], n_bins=5)

# 5. Preprocessing
kit.clean()  # Auto-clean pipeline
kit.handle_imbalanced_data(method="smote")  # Handle class imbalance

# 6. Model training and optimization
X_train, X_test, y_train, y_test = kit.train_test_auto("churn")
comparison = kit.compare_models("churn")  # Compare 10+ algorithms
kit.train_advanced("xgboost").auto_tune(method="optuna", max_evals=50)

# 7. Evaluation and explainability
kit.evaluate().explain()  # Comprehensive evaluation + SHAP
kit.plot_feature_importance()
kit.cross_validate(cv=5)

NLP Pipeline Example

# Text analysis workflow
kit = DSKit.load("reviews.csv")
kit.text_stats(["review_text"])  # Basic text statistics
kit.advanced_text_clean(["review_text"], remove_urls=True, expand_contractions=True)
kit.sentiment_analysis(["review_text"])  # Add sentiment scores
kit.generate_wordcloud("review_text", max_words=100)
kit.extract_keywords("review_text", top_n=20)

Time Series Feature Engineering

# Date/time feature extraction
kit.create_date_features(["transaction_date"])
# Creates: year, month, day, weekday, quarter, is_weekend columns

kit.create_aggregation_features("customer_id", ["amount"], ["mean", "std", "count"])
# Creates aggregated features grouped by customer

🎯 AutoML Capabilities

DSKit includes comprehensive AutoML features:

  • Automated Preprocessing: Smart data cleaning and feature engineering
  • Model Selection: Automatic algorithm comparison and selection
  • Hyperparameter Optimization: Grid, Random, Bayesian, and Optuna-based tuning
  • Feature Selection: Univariate, RFE, and embedded methods
  • Ensemble Methods: Voting classifiers and advanced ensembles
  • Performance Optimization: Cross-validation and learning curve analysis

📊 Supported Algorithms

Classification & Regression

  • Traditional: Random Forest, Gradient Boosting, SVM, KNN, Naive Bayes
  • Advanced: XGBoost, LightGBM, CatBoost, Neural Networks
  • Ensemble: Voting Classifiers, Stacking, Bagging

Preprocessing

  • Scaling: Standard, MinMax, Robust, Quantile
  • Encoding: Label, One-Hot, Target, Binary
  • Imputation: Mean, Median, Mode, KNN, Iterative
  • Feature Selection: SelectKBest, RFE, RFECV, Embedded

🔧 Configuration

DSKit supports flexible configuration:

# Global configuration
from dskit.config import set_config
set_config({
    'visualization_backend': 'plotly',  # or 'matplotlib'
    'auto_save_plots': True,
    'default_test_size': 0.2,
    'random_state': 42,
    'n_jobs': -1
})

# Method-specific parameters
kit.auto_tune(method="optuna", max_evals=100, timeout=3600)
kit.comprehensive_eda(sample_size=5000, include_correlations=True)

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

git clone https://github.com/your-username/dskit.git
cd dskit
pip install -e .[dev,full]
pre-commit install

Running Tests

pytest tests/ --cov=dskit --cov-report=html

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

  • Built on top of excellent libraries: pandas, scikit-learn, matplotlib, seaborn, plotly
  • Inspired by the need for simplified data science workflows
  • Community-driven development with contributions from data scientists worldwide

DSKit - Making Data Science Simple, Comprehensive, and Accessible! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ak_dskit-1.0.2.tar.gz (78.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ak_dskit-1.0.2-py3-none-any.whl (74.3 kB view details)

Uploaded Python 3

File details

Details for the file ak_dskit-1.0.2.tar.gz.

File metadata

  • Download URL: ak_dskit-1.0.2.tar.gz
  • Upload date:
  • Size: 78.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for ak_dskit-1.0.2.tar.gz
Algorithm Hash digest
SHA256 c3b018c1003cecf473e6d6407446c4c4cef382cac30987cb256393bc539b223e
MD5 2996e48b0b2ead6f1ddf85206168f0ed
BLAKE2b-256 7e9a30078d13c6a6dcd7a0701baf409ad0e7821bb110f187d7ab45d2f4125007

See more details on using hashes here.

File details

Details for the file ak_dskit-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ak_dskit-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 74.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for ak_dskit-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f830e02f130a21192216f4176655ea7f118c76561f498684b8a261b5c85d17e6
MD5 37c3e1739e65f9a77f68d2ef3fd49149
BLAKE2b-256 9d5be537ac50ee866e29bae52f7b856d907758a45484769b7e3a2a19ed49cfe1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page