Skip to main content

Python package to help you in variable selection.

Project description

https://raw.githubusercontent.com/Kaketo/bcselector/master/docs/img/logo_small.png https://img.shields.io/badge/python-3.7-blue.svg https://badge.fury.io/py/bcselector.svg https://travis-ci.com/Kaketo/bcselector.svg?branch=master https://codecov.io/gh/Kaketo/bcselector/branch/master/graph/badge.svg https://img.shields.io/badge/License-MIT-yellow.svg

What is it?

Feature selection is a crucial problem in many machine learning tasks. Usually the considered variables are cheap to collect and store but in some situations the acquisition of feature values can be problematic. For example, when predicting the occurrence of the disease we may consider the results of some diagnostic tests which can be very expensive. The existing feature selection methods usually ignore costs associated with the considered features. The goal of cost- sensitive feature selection is to select a subset of features which allow to predict the target variable (e.g. occurrence of the diseases) successfully within the assumed budget.

The main purpose of this package is to provide filter methods of feature selection based on information theory and to propose new variants of these methods considering feature costs.

Installation

bcselector can be installed from [PyPI] (https://pypi.org/project/bcselector):

pip install bcselector

Quickstart

First of all we must have a dataset with classification target variable and a cost assigned to each feature. Good sample data could be hepatitis from UCI repository [1].

Lets say that that we have dataset loaded to Python, we need to create Selector class and call fit method with proper arguments on it:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

from bcselector.variable_selection import FractionVariableSelector
from bcselector.datasets import load_sample

# Arguments for feature selection
# r - cost scaling parameter,
# beta - kwarg for j_criterion_func,
# model - model that is fitted on data.
r = 1
beta = 0.5
model = LogisticRegression(max_iter=1000)

# Data
X,y,costs = load_sample()

# Feature selection
fvs = FractionVariableSelector()
fvs.fit(data=X, target_variable=y, costs=costs, r=r, j_criterion_func='cife', beta=beta)

Now we can obtain feature selection results by calling simple getter:

fvs.get_cost_results()

Or we can score and plot our results with any sklearn model and classification metric:

fvs.score(model=model, scoring_function=roc_auc_score)
fvs.plot_scores(compare_no_cost_method=True, model=model, annotate=True)

Which results in BC-plot:

https://raw.githubusercontent.com/Kaketo/bcselector/master/docs/img/bc_plot.png

On OX axis we have accumulated cost and on OY axis we see test set score of currently selected set of features:

  • Blue line is cost-sensitive method selected features order.

  • Red line is NO-cost method selected features order.

  • Blue vertical line is maximum budget avaliable (user parameter)

Small numbers above or below the curve are indexes of selected features. Therefore we can see that first variable selected by cost-sensitive method is on 14th column in dataset X.

Bibliography

  • [1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Citations

TBD

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bcselector-0.0.42.tar.gz (11.2 MB view details)

Uploaded Source

Built Distribution

bcselector-0.0.42-py3-none-any.whl (11.6 MB view details)

Uploaded Python 3

File details

Details for the file bcselector-0.0.42.tar.gz.

File metadata

  • Download URL: bcselector-0.0.42.tar.gz
  • Upload date:
  • Size: 11.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for bcselector-0.0.42.tar.gz
Algorithm Hash digest
SHA256 a6d1397a937a11a755ef5eaea032f8a30278fd36ec8ec5e44ba8dc295e878e12
MD5 d6341d6b6b9ceae4a47e0c1cce53e2af
BLAKE2b-256 fc5ac3b9b56732db1eae3709a6234a5ecd5b0fd46424ba1b7ee79da8a3cd64a4

See more details on using hashes here.

File details

Details for the file bcselector-0.0.42-py3-none-any.whl.

File metadata

  • Download URL: bcselector-0.0.42-py3-none-any.whl
  • Upload date:
  • Size: 11.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for bcselector-0.0.42-py3-none-any.whl
Algorithm Hash digest
SHA256 f3e0fe775ab8d38f61eb499f687b13226baabbfd76c117298037fdc1e0f25c4d
MD5 d2cb4b6548bf1f50782041dc53ed5d62
BLAKE2b-256 87f1eef28384de663d70ac7be8ca8fd4ccc4b3ba99033297d44bd9bb3f648a88

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page