No project description provided

Project description

pip install classifier-toolkit

Classifier Toolkit

This is a new project.

Table of Content

Installation
Usage
Modules Overview
Development & CI/CD
Future Work

Installation

This library is published in the PyPI directory. To install, users can run pip install 'classifier_toolkit' command.

Usage

This library automates binary classification tasks in the finance domain, specifically for default and fraud labeling. It includes several packages designed to address the main steps in any machine learning/data science task:

EDA: accessible via EDA_Toolkit. Provides EDA and feature engineering functionality with all necessary visualizations.
Feature Reduction: filter-style pre-selection pipeline (expert rules, low variance, drift, predictive power, counter-intuitive direction, high correlation).
Feature Selection: wrapper and embedded methods (RFE, Boruta, Sequential, Bayesian, ElasticNet, MetaSelector).
Model fitting and hyperparameter tuning: To be implemented.
Evaluation and reporting: To be implemented.

In the future, the package architectures will be included here. However, for now please consult the docstrings in the specific methods in the relevant modules.

Note: that this library does not contain data wrangling steps (although it contains feature engineering), it's an intermediate step between EDA and feature engineering where users should fix any data quality related issues. Therefore, conducting the EDA is crucial to mitigate any issues before moving onto the feature engineering and the subsequent steps.

Modules Overview

EDA Toolkit: This module includes classes and methods for performing comprehensive exploratory data analysis. It provides automated warnings for data quality issues, univariate and bivariate analysis, and various data visualizations to help understand the dataset.
Univariate Analysis: This class focuses on the analysis of individual variables. It includes methods for calculating statistical measures, visualizing distributions, and assessing relationships between variables and a target through techniques like Cramer's V and Information Value. This helps in understanding the significance and distribution of each feature independently.
Bivariate Analysis: This class deals with the analysis of two variables to understand their relationship. It includes functionalities for generating correlation heatmaps, performing ANOVA tests between numerical and categorical variables, and computing pairwise Cramer's V for categorical features. This aids in identifying patterns and correlations between pairs of variables, which is crucial for feature selection and engineering.
Feature Engineering: This module assists in transforming features, handling missing values, encoding categorical variables, and more. It aims to enhance the dataset's quality for better model performance.
Visualizations: This module offers a wide range of plotting capabilities to visually analyze data distributions, relationships, and other crucial aspects of the dataset.
Automated Warnings: A utility to automatically check the dataset for common issues such as missing or duplicate values, outliers, and more, providing warnings to guide data cleaning efforts.

Feature Reduction: Filter-style pipeline that reduces the feature set before model-based selection. Six sequential steps, each independently configurable:

Expert Rules — drop features by hand-coded list.
Low Variance — drop near-constant numerical and categorical features.
Drift — drop features whose distribution has shifted (PSI, KS, JS and more).
Predictive Power — drop features with weak univariate Gini / PR-AUC.
Counter-Intuitive Direction — drop features violating a business prior.
High Correlation — drop pairwise-redundant features (Pearson/Spearman/Kendall; Cramér's V).

The FeatureReducer orchestrates all six steps in one sklearn-compatible fit / transform call.

from classifier_toolkit.feature_reduction import FeatureReducer

reducer = FeatureReducer(gini_threshold=0.01, numeric_correlation_threshold=0.85)
reducer.fit(X_train, y=y_train, X_val=X_val)
X_filtered = reducer.transform(X_train)

reducer.summary()                          # print pipeline summary
reducer.export_summary("summary.xlsx", format="excel")

Standalone helpers are also available for ad-hoc analysis outside the pipeline:

from classifier_toolkit.feature_reduction import (
    calculate_feature_predictive_metrics,  # Gini + PR-AUC for every feature
    CorrelationAnalyser,                   # inspect all correlated pairs
    DriftAnalyser,                         # inspect drift stats per feature
)

Logs are silent by default. To enable them:

import logging

logging.basicConfig(level=logging.INFO)
logging.getLogger("classifier_toolkit.feature_reduction").setLevel(logging.INFO)

Feature Selection: This module provides various feature selection techniques:
- Embedded Methods: Includes ElasticNet for regularization-based feature selection.
- Wrapper Methods:
  - Recursive Feature Elimination (RFE) with support for various ensemble methods (Random Forest, XGBoost, LightGBM, CatBoost).
  - Sequential Feature Selection (forward, backward, floating, and bidirectional).
- Meta Selector: Combines multiple feature selection methods to provide a robust selection.
- Utility Functions: Includes scoring functions and plotting utilities for feature importance visualization.

Development & CI/CD

This project uses modern tooling for fast and efficient development workflows:

Dependency Management

UV: Lightning-fast Python package installer and resolver (replacing Poetry)
Install UV: curl -LsSf https://astral.sh/uv/install.sh | sh
Install dependencies: uv sync --group dev --group lint --group test

CI/CD Pipeline

Our CI/CD pipeline is optimized for speed and efficiency:

Parallel Test Execution: Tests are split into two groups (eda and feature_selection) that run simultaneously, reducing test time by ~50%
Shared Caching: Both parallel jobs share the same dependency cache (~1.7GB), avoiding duplicate downloads
Smart Test Reruns: Failed tests run first (pytest --lf --ff) for faster feedback on fixes
Master Protection: Build tests only run on master branch and PRs targeting master, saving CI resources on feature branches
Automatic Linting: Code quality checks (Ruff, SQLFluff) run on every push

Pipeline Jobs:

dependencies - Installs and caches project dependencies
lint - Runs code quality checks (Ruff, SQLFluff)
test - Executes tests in parallel with shared cache
build_test - Builds and validates package (master/PRs only)

Local Development:

# Run linting
uv run make lint

# Run tests
uv run make test

# Build package
uv build

Future Work

The next planned improvements and additions to the library include:

Adding model fitting and hyperparameter tuning functionalities.
Developing comprehensive evaluation and reporting tools to assist with model assessment.
Expanding documentation to include architecture diagrams and detailed usage examples.

Project details

Release history Release notifications | RSS feed

This version

0.2.3

Jun 30, 2026

0.2.2

Aug 13, 2025

0.2.1 yanked

Jan 6, 2025

Reason this release was yanked:

outdated

0.2.0 yanked

Nov 13, 2024

Reason this release was yanked:

outdated

0.1.4 yanked

Sep 19, 2024

Reason this release was yanked:

outdated

0.1.0 yanked

Sep 13, 2024

Reason this release was yanked:

outdated

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

classifier_toolkit-0.2.3.tar.gz (363.2 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

classifier_toolkit-0.2.3-py3-none-any.whl (138.4 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file classifier_toolkit-0.2.3.tar.gz.

File metadata

Download URL: classifier_toolkit-0.2.3.tar.gz
Upload date: Jun 30, 2026
Size: 363.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for classifier_toolkit-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`cb3153b83224de11d5280a202e6df0e5c3ebb7dec6e8350adfde80db7ea8a7bf`
MD5	`811bc6c1976a47b416630f9d73c4e46a`
BLAKE2b-256	`d09f7be158c4fcaed13d4362c47b669932ed4f001c58be32735dcb51f6ab91bd`

See more details on using hashes here.

File details

Details for the file classifier_toolkit-0.2.3-py3-none-any.whl.

File metadata

Download URL: classifier_toolkit-0.2.3-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 138.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.25 {"installer":{"name":"uv","version":"0.11.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for classifier_toolkit-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce2e3e978a6a7a091b44231a3562d4a9cbdd3ba06f1167885003432e81644f9d`
MD5	`9856f6dac380a898918e2d58b397648d`
BLAKE2b-256	`744f9cebbb0cac81d6c02d0d1c2f0680733ea4526cef2c7a816d81e1f16109dc`

See more details on using hashes here.

classifier-toolkit 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Classifier Toolkit

Table of Content

Installation

Usage

Modules Overview

Development & CI/CD

Dependency Management

CI/CD Pipeline

Future Work

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes