Scikit-learn compatible feature selection for binary and multi-class classification using the Kolmogorov-Smirnov (K-S) test, with pairwise / one-vs-rest comparison and Fisher / min / max p-value aggregation.
Project description
KSFeatureSelector is a scikit-learn compatible feature selector that ranks features by how well they separate the classes of a binary or multi-class target, using the two-sample Kolmogorov-Smirnov (K-S) test. It subclasses scikit-learn’s SelectorMixin, passes check_estimator, and plugs directly into Pipeline and GridSearchCV.
Features
Ranks features by their K-S test p-value (lower p-value is more discriminative).
Handles binary and multi-class targets (2 to 10 classes).
Two class-comparison strategies for multi-class targets:
pairwise: K-S test between every pair of classes.
one-vs-rest: each class against the rest.
Three p-value aggregation methods: fisher (default), min, max.
Select features by a count (top_n) or a p-value threshold (top_p).
Full scikit-learn API: fit, transform, get_support, get_feature_names_out, inverse_transform.
A select_ks_features convenience function for quick one-off selection.
Installation
pip install ksfeatureselector
Usage
import numpy as np
from ksfeatureselector import KSFeatureSelector
rng = np.random.RandomState(0)
X = rng.normal(size=(200, 5))
y = (X[:, 0] + X[:, 1] > 0).astype(int)
selector = KSFeatureSelector(top_n=2).fit(X, y)
X_reduced = selector.transform(X)
print(selector.get_support())
print(selector.get_feature_p_values())
In a scikit-learn pipeline:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("ks", KSFeatureSelector(top_p=0.05)),
("clf", LogisticRegression()),
])
pipe.fit(X, y)
Convenience function for DataFrames:
from ksfeatureselector import select_ks_features
selected = select_ks_features(
df, x_cols=["f1", "f2", "f3"], y_var="target",
top_p=0.01,
aggregation_method="one-vs-rest",
p_value_aggregation_method="min",
)
Parameters
top_n (int, optional): keep this many top-ranked features.
top_p (float in [0, 1], optional): keep features whose aggregated p-value is at most this value.
aggregation_method ({"pairwise", "one-vs-rest"}): class comparison strategy for multi-class targets.
p_value_aggregation_method ({"fisher", "min", "max"}): per-feature p-value aggregation method.
top_n and top_p are mutually exclusive. If neither is set, all features are kept (ranked by p-value).
License
BSD 3-Clause License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ksfeatureselector-0.3.0.tar.gz.
File metadata
- Download URL: ksfeatureselector-0.3.0.tar.gz
- Upload date:
- Size: 9.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28fdb468dfcd7eef4f0d69e911a8d3ef0c9b6ef4b3402ab7945657339cc713d5
|
|
| MD5 |
6fb0418d92a8948c4818e3c051273301
|
|
| BLAKE2b-256 |
46fd07284bc605cc3efe52e2120e98b742607c69086ad5a7d3f9fda1f73eeb0d
|
File details
Details for the file ksfeatureselector-0.3.0-py3-none-any.whl.
File metadata
- Download URL: ksfeatureselector-0.3.0-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7802ebf9cb5c3f1ab0de514f89617f166564f83cad676dc766263a56cd5ca403
|
|
| MD5 |
67eff3b4dae3f6d7e26a4fb1ac5571bc
|
|
| BLAKE2b-256 |
605e27dfc17f9c2ada4b9c36968826c4102514ebc2be6e90d33c7cd76dfa86b0
|