A robust and flexible Python package designed for selecting the most discriminatory features in both **binary and multi-class classification problems** using the Kolmogorov-Smirnov (K-S) test. It provides advanced options for handling multi-class scenarios and aggregating p-values.
Project description
KSFeatureSelector is a robust and flexible Python package designed for selecting the most discriminatory features in both binary and multi-class classification problems using the Kolmogorov-Smirnov (K-S) test. It provides advanced options for handling multi-class scenarios and aggregating p-values.
Features
Uses the K-S test to rank features by their ability to separate classes.
Handles target variables with more than two categories (up to 10 classes internally).
- Flexible Comparison Strategies:
pairwise: Performs K-S tests between every unique pair of classes.
one-vs-rest: Compares each class against all other classes combined.
- Multiple P-Value Aggregation Methods:
fisher: Uses Fisher’s combined probability test (default, generally recommended).
min: Takes the minimum p-value from all comparisons for a feature.
max: Takes the maximum p-value from all comparisons for a feature.
Scikit-learn Style API: Offers a class-based interface (KSFeatureSelector with fit, transform) for seamless integration into machine learning pipelines.
Convenience Function: Provides a simple select_ks_features wrapper for quick, one-off feature selection.
Robust Validation & Warnings: Includes comprehensive input validation and issues UserWarning for data quality issues, such as categories with too few observations or insufficient samples for K-S tests.
Pure Python: Built using pandas, scipy, and numpy.
Installation
pip install ksfeatureselector
For local installation:
pip install -e .
Usage
from ksfeatureselector import select_ks_features
significant_features = select_ks_features(
df, x_cols, y_var,
top_p=0.01,
aggregation_method='one-vs-rest',
p_value_aggregation_method='min'
)
print(f"Significant features (one-vs-rest, min p-value <= 0.01): {significant_features}")
# Example 3: Select top 3 features using 'pairwise' comparison
# and 'max' p-value aggregation
top_3_features_max_agg = select_ks_features(
df, x_cols, y_var,
top_n=3,
aggregation_method='pairwise',
p_value_aggregation_method='max'
)
print(f"Top 3 features (pairwise, max p-value): {top_3_features_max_agg}")
Arguments
df (pd.DataFrame): The input DataFrame containing feature columns and the binary target column.
x_cols (List[str]): A list of column names in df representing the features you want to evaluate.
y_var (str): The name of the column in df representing the binary target variable (0/1 or similar).
top_p (float, optional): If provided, only features with a K-S test p-value less than top_p will be selected.
top_n (int, optional): If provided, the top n features with the lowest p-values will be selected.
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ksfeatureselector-0.2.0.tar.gz.
File metadata
- Download URL: ksfeatureselector-0.2.0.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4320680116b5b13a85d97c7c60da1bc7c57cc1b47348b484327a9b51e16f8c74
|
|
| MD5 |
8a174ce614729603958161fce8c67c67
|
|
| BLAKE2b-256 |
a59bfb0ba489293e750606cb276a40d3aab93652fa7f8ec386405a821646e672
|
File details
Details for the file ksfeatureselector-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ksfeatureselector-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abb31f153f12f3d93948d128e5d6bfea8336b85ee8852bb88d416358d9c0e67d
|
|
| MD5 |
893dba3df33c91f0107f2c4502744b74
|
|
| BLAKE2b-256 |
d1b1253d653ceb246bac1a3fc71fa028fe5919da04151d7d37b9380da1225609
|