A Python library for feature selection in tabular datasets
Project description
dataclr: The feature selection library
dataclr is a Python library for feature selection, enabling data scientists and ML engineers to identify optimal features from tabular datasets. By combining filter and wrapper methods, it achieves state-of-the-art results, enhancing model performance and simplifying feature engineering.
Features
-
Comprehensive Methods:
-
Filter Methods: Statistical and data-driven approaches like
ANOVA,MutualInformation, andVarianceThreshold.Method Regression Classification ANOVAYes Yes Chi2No Yes CumulativeDistributionFunctionYes Yes CohensDNo Yes CramersVNo Yes DistanceCorrelationYes Yes EntropyYes Yes KendallCorrelationYes Yes KurtosisYes Yes LinearCorrelationYes Yes MaximalInformationCoefficientYes Yes MeanAbsoluteDeviationYes Yes mRMRYes Yes MutualInformationYes Yes SkewnessYes Yes SpearmanCorrelationYes Yes VarianceThresholdYes Yes VarianceInflationFactorYes Yes ZScoreYes Yes -
Wrapper Methods: Model-based iterative methods like
BorutaMethod,ShapMethod, andOptunaMethod.Method Regression Classification BorutaMethodYes Yes HyperoptMethodYes Yes OptunaMethodYes Yes ShapMethodYes Yes
-
-
Flexible and Scalable:
- Supports both regression and classification tasks.
- Handles high-dimensional datasets efficiently.
-
Interpretable Results:
- Provides ranked feature lists with detailed importance scores.
- Shows used methods along with their parameters.
-
Seamless Integration:
- Works with popular Python libraries like
pandasandscikit-learn.
- Works with popular Python libraries like
Installation
Install dataclr using pip:
pip install dataclr
Getting Started
1. Load Your Dataset
Prepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Example dataset
X = pd.DataFrame({...}) # Replace with your feature matrix
y = pd.Series([...]) # Replace with your target variable
# Preprocessing
X_encoded = pd.get_dummies(X) # Encode categorical features
scaler = StandardScaler()
X_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)
2. Use FeatureSelector
The FeatureSelector is a high-level API that combines multiple methods to select the best feature subsets:
from sklearn.ensemble import RandomForestClassifier
from dataclr.feature_selection import FeatureSelector
# Define a scikit-learn model
my_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Initialize the FeatureSelector
selector = FeatureSelector(
model=my_model,
metric="accuracy",
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test,
)
# Perform feature selection
selected_features = selector.select_features(n_results=5)
print(selected_features)
3. Use Singular Methods
For granular control, you can use individual feature selection methods:
from sklearn.linear_model import LogisticRegression
from dataclr.methods import MutualInformation
# Define a scikit-learn model
my_model = LogisticRegression(solver="liblinear", max_iter=1000)
# Initialize a method
method = MutualInformation(model=my_model, metric="accuracy")
# Fit and transform
results = method.fit_transform(X_train, X_test, y_train, y_test)
print(results)
Benchmarks
As our algorithm produces multiple results, we selected benchmark results that balance feature count with performance, while being capable of achieving the best performance if needed.
Documentation
Explore the full documentation for detailed usage instructions, API references, and examples.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataclr-0.1.3.tar.gz.
File metadata
- Download URL: dataclr-0.1.3.tar.gz
- Upload date:
- Size: 36.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70e3482cbd514de0b6f71bffebedf2e34bdf0d77b1522bd480bfe0caf78d4cd1
|
|
| MD5 |
d09b298a02bff9d3938653d3cee891c0
|
|
| BLAKE2b-256 |
035f8c3f810f8104879148c9484a08a143c431aa9b6bc5de7e3d9081ee6511de
|
Provenance
The following attestation bundles were made for dataclr-0.1.3.tar.gz:
Publisher:
release.yml on dataclr/dataclr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataclr-0.1.3.tar.gz -
Subject digest:
70e3482cbd514de0b6f71bffebedf2e34bdf0d77b1522bd480bfe0caf78d4cd1 - Sigstore transparency entry: 159330952
- Sigstore integration time:
-
Permalink:
dataclr/dataclr@841239ac7d41b1e12d378d2faad0f885a3566b84 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/dataclr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@841239ac7d41b1e12d378d2faad0f885a3566b84 -
Trigger Event:
release
-
Statement type:
File details
Details for the file dataclr-0.1.3-py3-none-any.whl.
File metadata
- Download URL: dataclr-0.1.3-py3-none-any.whl
- Upload date:
- Size: 57.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
245f784e135ce44f42c75ede7d730b4d3e7a961b65dd19f68e3bf556f2af942b
|
|
| MD5 |
1db209a24b70618f5640a4d352fda6a2
|
|
| BLAKE2b-256 |
47b724b7c4ff5677f856f1f1607ef067dbb9e26ff5eeb1aad947a3da5ede2c06
|
Provenance
The following attestation bundles were made for dataclr-0.1.3-py3-none-any.whl:
Publisher:
release.yml on dataclr/dataclr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataclr-0.1.3-py3-none-any.whl -
Subject digest:
245f784e135ce44f42c75ede7d730b4d3e7a961b65dd19f68e3bf556f2af942b - Sigstore transparency entry: 159330953
- Sigstore integration time:
-
Permalink:
dataclr/dataclr@841239ac7d41b1e12d378d2faad0f885a3566b84 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/dataclr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@841239ac7d41b1e12d378d2faad0f885a3566b84 -
Trigger Event:
release
-
Statement type: