Skip to main content

A comprehensive binning and discretization library for machine learning

Project description

PyPI Version Python Versions Build Status Code Coverage License Documentation Status Monthly Downloads GitHub Stars Code Style - Ruff Type Checking - MyPy

A modern, type-safe Python library for data binning and discretization with comprehensive error handling, sklearn compatibility, and DataFrame support.

🚀 Key Features

Multiple Binning Methods
  • EqualWidthBinning - Equal-width intervals across data range

  • EqualFrequencyBinning - Equal-frequency (quantile-based) bins

  • KMeansBinning - K-means clustering-based discretization

  • GaussianMixtureBinning - Gaussian mixture model clustering-based binning

  • DBSCANBinning - Density-based clustering for natural groupings

  • EqualWidthMinimumWeightBinning - Weight-constrained equal-width binning

  • TreeBinning - Decision tree-based supervised binning for classification and regression

  • Chi2Binning - Chi-square statistic-based supervised binning for optimal class separation

  • IsotonicBinning - Isotonic regression-based supervised binning for monotonic relationships

  • ManualIntervalBinning - Custom interval boundary specification

  • ManualFlexibleBinning - Mixed interval and singleton bin definitions

  • SingletonBinning - Creates one bin per unique numeric value

🔧 Framework Integration
  • Pandas DataFrames - Native support with column name preservation

  • Polars DataFrames - High-performance columnar data support (optional)

  • NumPy Arrays - Efficient numerical array processing

  • Scikit-learn Pipelines - Full transformer compatibility

Modern Code Quality
  • Type Safety - 100% mypy compliance with comprehensive type annotations

  • Code Quality - 100% ruff compliance with modern Python syntax

  • Error Handling - Comprehensive validation with helpful error messages and suggestions

  • Test Coverage - 100% code coverage with 841 comprehensive tests

  • Documentation - Extensive examples and API documentation

📦 Installation

pip install binlearn

🔥 Quick Start

import numpy as np
import pandas as pd
from binlearn import EqualWidthBinning, TreeBinning, SingletonBinning, Chi2Binning

# Create sample data
data = pd.DataFrame({
    'age': np.random.normal(35, 10, 1000),
    'income': np.random.lognormal(10, 0.5, 1000),
    'score': np.random.uniform(0, 100, 1000)
})

# Equal-width binning with DataFrame preservation
binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True)
data_binned = binner.fit_transform(data)

print(f"Original shape: {data.shape}")
print(f"Binned shape: {data_binned.shape}")
print(f"Bin edges for age: {binner.bin_edges_['age']}")

# SingletonBinning for numeric discrete values
numeric_discrete_data = pd.DataFrame({
    'category_id': [1, 2, 1, 3, 2, 1],
    'rating': [1, 2, 1, 3, 2, 1]
})

singleton_binner = SingletonBinning(preserve_dataframe=True)
numeric_binned = singleton_binner.fit_transform(numeric_discrete_data)
print(f"Numeric discrete binning: {numeric_binned.shape}")

🎯 Supervised Binning Example

from binlearn import TreeBinning
import numpy as np
from sklearn.datasets import make_classification

# Create classification dataset
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)

# Method 1: Using guidance_columns (binlearn style)
# Combine features and target into single dataset
X_with_target = np.column_stack([X, y])

sup_binner1 = TreeBinning(
    guidance_columns=[4],  # Use the target column to guide binning
    task_type='classification',
    tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
X_binned1 = sup_binner1.fit_transform(X_with_target)

# Method 2: Using X and y parameters (sklearn style)
# Pass features and target separately like sklearn
sup_binner2 = TreeBinning(
    task_type='classification',
    tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
sup_binner2.fit(X, y)  # y is automatically used as guidance
X_binned2 = sup_binner2.transform(X)

print(f"Method 1 - Input shape: {X_with_target.shape}, Output shape: {X_binned1.shape}")
print(f"Method 2 - Input shape: {X.shape}, Output shape: {X_binned2.shape}")
print(f"Both methods create same bins: {np.array_equal(X_binned1, X_binned2)}")

🛠️ Scikit-learn Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from binlearn import EqualFrequencyBinning

# Use the same classification dataset from previous example
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create ML pipeline with binning preprocessing
pipeline = Pipeline([
    ('binning', EqualFrequencyBinning(n_bins=5)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train and evaluate
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")

📚 Available Methods

Interval-based Methods (Unsupervised):

  • EqualWidthBinning - Creates bins of equal width across the data range

  • EqualFrequencyBinning - Creates bins with approximately equal number of samples

  • KMeansBinning - Uses K-means clustering to determine bin boundaries

  • GaussianMixtureBinning - Uses Gaussian mixture models for probabilistic clustering

  • DBSCANBinning - Uses density-based clustering for natural groupings

  • EqualWidthMinimumWeightBinning - Equal-width bins with weight constraints

  • ManualIntervalBinning - Specify custom interval boundaries

Supervised Methods:

  • TreeBinning - Decision tree-based binning optimized for target variables (classification and regression)

  • Chi2Binning - Chi-square statistic-based binning for optimal feature-target association

  • IsotonicBinning - Isotonic regression-based binning for monotonic relationships

Flexible Methods:

  • ManualFlexibleBinning - Define mixed interval and singleton bins

  • SingletonBinning - Creates one bin per unique numeric value

⚙️ Requirements

Python Versions: 3.10, 3.11, 3.12, 3.13

Core Dependencies:
  • NumPy >= 1.21.0

  • SciPy >= 1.7.0

  • Scikit-learn >= 1.0.0

  • kmeans1d >= 0.3.0

Optional Dependencies:
  • Pandas >= 1.3.0 (for DataFrame support)

  • Polars >= 0.15.0 (for Polars DataFrame support)

Development Dependencies:
  • pytest >= 6.0 (for testing)

  • ruff >= 0.1.0 (for linting and formatting)

  • mypy >= 1.0.0 (for type checking)

🧪 Development Setup

# Clone repository
git clone https://github.com/TheDAALab/binlearn.git
cd binlearn

# Install in development mode with all dependencies
pip install -e ".[tests,dev,pandas,polars]"

# Run all tests
pytest

# Run code quality checks
ruff check binlearn/
mypy binlearn/ --ignore-missing-imports

# Build documentation
cd docs && make html

🏆 Code Quality Standards

  • 100% Test Coverage - Comprehensive test suite with 841 tests

  • 100% Type Safety - Complete mypy compliance with modern type annotations

  • 100% Code Quality - Full ruff compliance with modern Python standards

  • Comprehensive Documentation - Detailed API docs and examples

  • Modern Python - Uses latest Python features and best practices

  • Robust Error Handling - Helpful error messages with actionable suggestions

🤝 Contributing

We welcome contributions! Here’s how to get started:

  1. Fork the repository on GitHub

  2. Create a feature branch: git checkout -b feature/your-feature

  3. Make your changes and add tests

  4. Ensure all quality checks pass:

    pytest                                    # Run tests
    ruff check binlearn/                      # Check code quality
    mypy binlearn/ --ignore-missing-imports   # Check types
  5. Submit a pull request

Areas for Contribution:
  • 🐛 Bug reports and fixes

  • ✨ New binning algorithms

  • 📚 Documentation improvements

  • 🧪 Additional test cases

  • 🎯 Performance optimizations

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Developed by TheDAALab

A modern, type-safe binning framework for Python data science workflows.

Powered by Python Built with NumPy Compatible with Pandas Integrates with Scikit-learn Development Status Contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

binlearn-1.0.1.tar.gz (6.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

binlearn-1.0.1-py3-none-any.whl (135.9 kB view details)

Uploaded Python 3

File details

Details for the file binlearn-1.0.1.tar.gz.

File metadata

  • Download URL: binlearn-1.0.1.tar.gz
  • Upload date:
  • Size: 6.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for binlearn-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e567b9246ab6d4d0bd034d68922ba7ef570ddd8de112edb5481c23462bf306a6
MD5 4b3b64d7ec1b7456b19602a97fe775a7
BLAKE2b-256 aa215f33d8efbc91f92d45df95980cc862593971d30174d65fdcd63c110fdacb

See more details on using hashes here.

Provenance

The following attestation bundles were made for binlearn-1.0.1.tar.gz:

Publisher: release.yml on TheDAALab/binlearn

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file binlearn-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: binlearn-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 135.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for binlearn-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8ed94b73f2bb34a54f7a7932134d080367013ee1048a71d05edb15a850458a04
MD5 8773c9d41cf4167fc6e6e9bbefa96a2e
BLAKE2b-256 02e8381631160bbb763f015ff879deac08fae2f1620b1269bda383526fc0b318

See more details on using hashes here.

Provenance

The following attestation bundles were made for binlearn-1.0.1-py3-none-any.whl:

Publisher: release.yml on TheDAALab/binlearn

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page