A comprehensive binning and discretization library for machine learning
Project description
A modern, type-safe Python library for data binning and discretization with comprehensive error handling, sklearn compatibility, and DataFrame support.
🚀 Key Features
- ✨ Multiple Binning Methods
EqualWidthBinning - Equal-width intervals across data range
EqualFrequencyBinning - Equal-frequency (quantile-based) bins
KMeansBinning - K-means clustering-based discretization
GaussianMixtureBinning - Gaussian mixture model clustering-based binning
DBSCANBinning - Density-based clustering for natural groupings
EqualWidthMinimumWeightBinning - Weight-constrained equal-width binning
TreeBinning - Decision tree-based supervised binning for classification and regression
Chi2Binning - Chi-square statistic-based supervised binning for optimal class separation
IsotonicBinning - Isotonic regression-based supervised binning for monotonic relationships
ManualIntervalBinning - Custom interval boundary specification
ManualFlexibleBinning - Mixed interval and singleton bin definitions
SingletonBinning - Creates one bin per unique numeric value
- 🔧 Framework Integration
Pandas DataFrames - Native support with column name preservation
Polars DataFrames - High-performance columnar data support (optional)
NumPy Arrays - Efficient numerical array processing
Scikit-learn Pipelines - Full transformer compatibility
- ⚡ Modern Code Quality
Type Safety - 100% mypy compliance with comprehensive type annotations
Code Quality - 100% ruff compliance with modern Python syntax
Error Handling - Comprehensive validation with helpful error messages and suggestions
Test Coverage - 100% code coverage with 841 comprehensive tests
Documentation - Extensive examples and API documentation
📦 Installation
pip install binlearn
🔥 Quick Start
import numpy as np
import pandas as pd
from binlearn import EqualWidthBinning, TreeBinning, SingletonBinning, Chi2Binning
# Create sample data
data = pd.DataFrame({
'age': np.random.normal(35, 10, 1000),
'income': np.random.lognormal(10, 0.5, 1000),
'score': np.random.uniform(0, 100, 1000)
})
# Equal-width binning with DataFrame preservation
binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True)
data_binned = binner.fit_transform(data)
print(f"Original shape: {data.shape}")
print(f"Binned shape: {data_binned.shape}")
print(f"Bin edges for age: {binner.bin_edges_['age']}")
# SingletonBinning for numeric discrete values
numeric_discrete_data = pd.DataFrame({
'category_id': [1, 2, 1, 3, 2, 1],
'rating': [1, 2, 1, 3, 2, 1]
})
singleton_binner = SingletonBinning(preserve_dataframe=True)
numeric_binned = singleton_binner.fit_transform(numeric_discrete_data)
print(f"Numeric discrete binning: {numeric_binned.shape}")
🎯 Supervised Binning Example
from binlearn import TreeBinning
import numpy as np
from sklearn.datasets import make_classification
# Create classification dataset
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
# Method 1: Using guidance_columns (binlearn style)
# Combine features and target into single dataset
X_with_target = np.column_stack([X, y])
sup_binner1 = TreeBinning(
guidance_columns=[4], # Use the target column to guide binning
task_type='classification',
tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
X_binned1 = sup_binner1.fit_transform(X_with_target)
# Method 2: Using X and y parameters (sklearn style)
# Pass features and target separately like sklearn
sup_binner2 = TreeBinning(
task_type='classification',
tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)
sup_binner2.fit(X, y) # y is automatically used as guidance
X_binned2 = sup_binner2.transform(X)
print(f"Method 1 - Input shape: {X_with_target.shape}, Output shape: {X_binned1.shape}")
print(f"Method 2 - Input shape: {X.shape}, Output shape: {X_binned2.shape}")
print(f"Both methods create same bins: {np.array_equal(X_binned1, X_binned2)}")
🛠️ Scikit-learn Integration
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from binlearn import EqualFrequencyBinning
# Use the same classification dataset from previous example
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create ML pipeline with binning preprocessing
pipeline = Pipeline([
('binning', EqualFrequencyBinning(n_bins=5)),
('classifier', RandomForestClassifier(random_state=42))
])
# Train and evaluate
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")
📚 Available Methods
Interval-based Methods (Unsupervised):
EqualWidthBinning - Creates bins of equal width across the data range
EqualFrequencyBinning - Creates bins with approximately equal number of samples
KMeansBinning - Uses K-means clustering to determine bin boundaries
GaussianMixtureBinning - Uses Gaussian mixture models for probabilistic clustering
DBSCANBinning - Uses density-based clustering for natural groupings
EqualWidthMinimumWeightBinning - Equal-width bins with weight constraints
ManualIntervalBinning - Specify custom interval boundaries
Supervised Methods:
TreeBinning - Decision tree-based binning optimized for target variables (classification and regression)
Chi2Binning - Chi-square statistic-based binning for optimal feature-target association
IsotonicBinning - Isotonic regression-based binning for monotonic relationships
Flexible Methods:
ManualFlexibleBinning - Define mixed interval and singleton bins
SingletonBinning - Creates one bin per unique numeric value
⚙️ Requirements
Python Versions: 3.10, 3.11, 3.12, 3.13
- Core Dependencies:
NumPy >= 1.21.0
SciPy >= 1.7.0
Scikit-learn >= 1.0.0
kmeans1d >= 0.3.0
- Optional Dependencies:
Pandas >= 1.3.0 (for DataFrame support)
Polars >= 0.15.0 (for Polars DataFrame support)
- Development Dependencies:
pytest >= 6.0 (for testing)
ruff >= 0.1.0 (for linting and formatting)
mypy >= 1.0.0 (for type checking)
🧪 Development Setup
# Clone repository
git clone https://github.com/TheDAALab/binlearn.git
cd binlearn
# Install in development mode with all dependencies
pip install -e ".[tests,dev,pandas,polars]"
# Run all tests
pytest
# Run code quality checks
ruff check binlearn/
mypy binlearn/ --ignore-missing-imports
# Build documentation
cd docs && make html
🏆 Code Quality Standards
✅ 100% Test Coverage - Comprehensive test suite with 841 tests
✅ 100% Type Safety - Complete mypy compliance with modern type annotations
✅ 100% Code Quality - Full ruff compliance with modern Python standards
✅ Comprehensive Documentation - Detailed API docs and examples
✅ Modern Python - Uses latest Python features and best practices
✅ Robust Error Handling - Helpful error messages with actionable suggestions
🤝 Contributing
We welcome contributions! Here’s how to get started:
Fork the repository on GitHub
Create a feature branch: git checkout -b feature/your-feature
Make your changes and add tests
Ensure all quality checks pass:
pytest # Run tests ruff check binlearn/ # Check code quality mypy binlearn/ --ignore-missing-imports # Check typesSubmit a pull request
- Areas for Contribution:
🐛 Bug reports and fixes
✨ New binning algorithms
📚 Documentation improvements
🧪 Additional test cases
🎯 Performance optimizations
🔗 Links
GitHub Repository: https://github.com/TheDAALab/binlearn
Issue Tracker: https://github.com/TheDAALab/binlearn/issues
Documentation: https://binlearn.readthedocs.io/
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
Developed by TheDAALab
A modern, type-safe binning framework for Python data science workflows.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file binlearn-1.0.1.tar.gz.
File metadata
- Download URL: binlearn-1.0.1.tar.gz
- Upload date:
- Size: 6.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e567b9246ab6d4d0bd034d68922ba7ef570ddd8de112edb5481c23462bf306a6
|
|
| MD5 |
4b3b64d7ec1b7456b19602a97fe775a7
|
|
| BLAKE2b-256 |
aa215f33d8efbc91f92d45df95980cc862593971d30174d65fdcd63c110fdacb
|
Provenance
The following attestation bundles were made for binlearn-1.0.1.tar.gz:
Publisher:
release.yml on TheDAALab/binlearn
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
binlearn-1.0.1.tar.gz -
Subject digest:
e567b9246ab6d4d0bd034d68922ba7ef570ddd8de112edb5481c23462bf306a6 - Sigstore transparency entry: 377196914
- Sigstore integration time:
-
Permalink:
TheDAALab/binlearn@869d550d309370fae0b0c58065ed2d2ff1e14f2e -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/TheDAALab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@869d550d309370fae0b0c58065ed2d2ff1e14f2e -
Trigger Event:
release
-
Statement type:
File details
Details for the file binlearn-1.0.1-py3-none-any.whl.
File metadata
- Download URL: binlearn-1.0.1-py3-none-any.whl
- Upload date:
- Size: 135.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ed94b73f2bb34a54f7a7932134d080367013ee1048a71d05edb15a850458a04
|
|
| MD5 |
8773c9d41cf4167fc6e6e9bbefa96a2e
|
|
| BLAKE2b-256 |
02e8381631160bbb763f015ff879deac08fae2f1620b1269bda383526fc0b318
|
Provenance
The following attestation bundles were made for binlearn-1.0.1-py3-none-any.whl:
Publisher:
release.yml on TheDAALab/binlearn
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
binlearn-1.0.1-py3-none-any.whl -
Subject digest:
8ed94b73f2bb34a54f7a7932134d080367013ee1048a71d05edb15a850458a04 - Sigstore transparency entry: 377196930
- Sigstore integration time:
-
Permalink:
TheDAALab/binlearn@869d550d309370fae0b0c58065ed2d2ff1e14f2e -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/TheDAALab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@869d550d309370fae0b0c58065ed2d2ff1e14f2e -
Trigger Event:
release
-
Statement type: