Skip to main content

A Python library for feature selection in tabular datasets

Project description

dataclr

dataclr is a Python library for feature selection, designed to help machine learning engineers and data scientists quickly identify the best features from tabular datasets. By combining a wide range of filter, wrapper, and embedded methods, dataclr provides a robust and versatile approach to improve model performance and streamline feature engineering.

Features

  • Comprehensive Methods:

    • Filter Methods: Statistical and data-driven approaches like ANOVA, MutualInformation, and VarianceThreshold.

      Method Regression Classification
      ANOVA Yes Yes
      Chi2 No Yes
      CumulativeDistributionFunction Yes Yes
      CohensD No Yes
      CramersV No Yes
      DistanceCorrelation Yes Yes
      Entropy Yes Yes
      KendallCorrelation Yes Yes
      Kurtosis Yes Yes
      LinearCorrelation Yes Yes
      MaximalInformationCoefficient Yes Yes
      MeanAbsoluteDeviation Yes Yes
      mRMR Yes Yes
      MutualInformation Yes Yes
      Skewness Yes Yes
      SpearmanCorrelation Yes Yes
      VarianceThreshold Yes Yes
      VarianceInflationFactor Yes Yes
      ZScore Yes Yes
    • Wrapper Methods: Model-based iterative methods like BorutaMethod, ShapMethod, and OptunaMethod.

      Method Regression Classification
      BorutaMethod Yes Yes
      HyperoptMethod Yes Yes
      OptunaMethod Yes Yes
      ShapMethod Yes Yes
  • Flexible and Scalable:

    • Supports both regression and classification tasks.
    • Handles high-dimensional datasets efficiently.
  • Interpretable Results:

    • Provides ranked feature lists with detailed importance scores.
    • Supports visualization and reporting.
  • Seamless Integration:

    • Works with popular Python libraries like pandas, scikit-learn, and statsmodels.

Installation

Install dataclr using pip:

pip install dataclr

Getting Started

1. Load Your Dataset

Prepare your dataset as pandas DataFrames or Series and preprocess it (e.g., encode categorical features and normalize numerical values):

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Example dataset
X = pd.DataFrame({...})  # Replace with your feature matrix
y = pd.Series([...])     # Replace with your target variable

# Preprocessing
X_encoded = pd.get_dummies(X)  # Encode categorical features
scaler = StandardScaler()
X_normalized = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)

2. Use FeatureSelector

The FeatureSelector is a high-level API that combines multiple methods to select the best feature subsets:

from dataclr.feature_selection import FeatureSelector

# Initialize the FeatureSelector
selector = FeatureSelector(
    model=my_model,  # Replace with your model
    metric="accuracy",
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)

# Perform feature selection
selected_features = selector.select_features(n_results=5)
print(selected_features)

3. Use Singular Methods

For granular control, you can use individual feature selection methods:

from dataclr.methods import MutualInformation

# Initialize a method
method = MutualInformation(model=my_model, metric="accuracy")

# Fit and transform
results = method.fit_transform(X_train, X_test, y_train, y_test)
print(results)

Documentation

Explore the full documentation for detailed usage instructions, API references, and examples.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataclr-0.1.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataclr-0.1.0-py3-none-any.whl (53.0 kB view details)

Uploaded Python 3

File details

Details for the file dataclr-0.1.0.tar.gz.

File metadata

  • Download URL: dataclr-0.1.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dataclr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 29834dc1a124893f81239e51e00ceb9a193957984c94c4d36b6a1c9af4e00947
MD5 276dd47042b2a2da6e17b6f1a65a30d6
BLAKE2b-256 3e022fbb00d63b53db11aa2e5fd3c9e5217e1d7201bf1038f07aeddd53bab07a

See more details on using hashes here.

File details

Details for the file dataclr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataclr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.0

File hashes

Hashes for dataclr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85b326c8cc348d0f74214568ee75309b2dc2934ed3007a3e312ae668f9972130
MD5 ce94014cc44307e33837b6294c2a7613
BLAKE2b-256 ff1b721049c1dc95d735607247f1d5f1cff07abf394dde63370ed74162e87aa2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page