Skip to main content

Supervised discretization with Rust

Project description

Discrust

Supervised discretization with Rust

PyPI version shields.io

The discrust package provides a supervised discretization algorithm. Under the hood it implements a decision tree, using information value to find the optimal splits, and provides several different methods to constrain the final discretization scheme.

The package draws heavily from the ivpy package, both in the algorithm and the parameter controls. Why make another package? This package serves as a proof of concept of building a python package using Rust and pyo3. Additionally the goal is for this package to better align with the scikit-learn API, and possibly be used in other Rust based credit score building tools.

Usage

The package has a single user facing class, Discretizer that can be instantiated with the following arguments.

  • min_obs (Optional[float], optional): Minimum number of observations required in a bin. Defaults to 5.
  • max_bins (Optional[int], optional): Maximum number of bins to split the variable into. Defaults to 10.
  • min_iv (Optional[float], optional): Minimum information value required to make a split. Defaults to 0.001.
  • min_pos (Optional[float], optional): Minimum number of records with a value of one that should be present in a split. Defaults to 5.
  • mono (Optional[int], optional): The monotonicity required between the binned variable and the binary performance outcome. A value of -1 will result in negative correlation between the binned x and y variables, while a value of 1 will result in a positive correlation between the binned x variable and the y variable. Specifying a value of 0 will result in binning x, with no monotonicity constraint. If a value of None is specified the monotonicity will be determined the monotonicity of the first split. Defaults to None.

The fit method can be called on data and accepts the following parameters.

  • x (ArrayLike): An arraylike numeric field that will be discretized based on the values of y, and the constraints the Discretizer was initialized with.
  • y (ArrayLike): An arraylike binary field.
  • sample_weight (Optional[ArrayLike], optional): Optional sample weight array to be used when calculating the optimal breaks. Defaults to None.
  • exception_values (Optional[List[float]], optional): Optional list specifying exception values. These values are held out of the binning process, additionally, their respective weight of evidence, and summary information can be found in the exception_values_ attribute once the discretizer has been fit.

A np.nan value may be present in the list of possible exception values. If there are np.nan values present in the x variable, and np.nan is not listed as a possible exception value, an error will be raised. Additionally, an error will be raised if np.nan is found to be in y or the sample_weight arrays.

This method will fit the decision tree and find the optimal split values for the feature given the constraints. After being fit the discretizer will have a splits_ attribute with the optimal split values.

import seaborn as sns

df = sns.load_dataset("titanic")

from discrust import Discretizer

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
ds.splits_
# [-inf, 6.95, 7.125, 7.7292, 10.4625, 15.1, 50.4958, 52.0, 73.5, 79.65, inf]

Here we show what the results are if exception values are also specified. These exception values will be held out when calculating the bins.

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["age"], df["survived"], exception_values=[np.nan, 1.0])
ds.exception_values_
# {'vals_': [nan, 1.0],
#  'totals_ct_': [177.0, 7.0],
#  'iv_': [0.03054206173541801, 0.015253257689460616],
#  'ones_ct_': [52.0, 5.0],
#  'woe_': [-0.40378231427394834, 1.3895784363210804],
#  'zero_ct_': [125.0, 2.0]}

The exception_values_ dictionary has the following keys.

  • vals_: The exception values passed to the Discretizer.
  • totals_ct_: The total number of each respective exception value present in the x variable used for fitting.
  • ones_ct_: Total count of the positive class for each exception value.
  • zero_ct_: Total count of zeros for each respective value.
  • woe_: The weight of evidence for each respective exception value.
  • iv_: The information value for each respective exception value.

The predict method can be called and will discretize the feature, and then perform weight of evidence substitution on each binned level. This method takes the following arguments.

  • x (ArrayLike): An arraylike numeric field.

For exception values found in x, the calculated weight of evidence for that exception value will be returned. If the exception value was never present in the x variable when the Discretizer was fit, then the returned weight of evidence will be zero for the exception value.

ds.predict(df["fare"])[0:5]
array([-0.84846814, 0.78344263, -0.787529, 0.78344263, -0.787529])

Installation

From PyPi

For Windows users, the package can be installed directly from pypi with the following command.

python -m pip install discrust

Building from Source

The package can be built from source, it utalizes the maturin tool as a build backend. This tool requires you have python, and a working Rust compiler installed, see here for details. If these two requirements are met, you can clone this repository, and run the following command in the repositories root directory.

python -m pip install . -v

This should invoke the maturin tool, which will handle the building of the Rust code and installation of the package. Alternativly, if you simply want to build a wheel, you can run the following command after installing maturin.

maturin build --release

I have had some problems building packages with maturin directly in a conda environment, this is actually a bug on anaconda's side that will hopefully be resolved. If this does give you any problems, it's usually easiest to build a wheel inside of a venv and then install the wheel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discrust-0.1.3.tar.gz (24.9 kB view details)

Uploaded Source

Built Distributions

discrust-0.1.3-cp39-none-win_amd64.whl (160.3 kB view details)

Uploaded CPython 3.9 Windows x86-64

discrust-0.1.3-cp38-none-win_amd64.whl (160.5 kB view details)

Uploaded CPython 3.8 Windows x86-64

File details

Details for the file discrust-0.1.3.tar.gz.

File metadata

  • Download URL: discrust-0.1.3.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.11.5

File hashes

Hashes for discrust-0.1.3.tar.gz
Algorithm Hash digest
SHA256 4157070e4691dba6ecd6f6e3dc6012ae8e86e977a76cad40962529f5b9ed3271
MD5 691d96a65cbf8001849075c5ee520481
BLAKE2b-256 db44efcb5bf8322fe8e2e19b6105ec52febce2b25f22faec7f63c7724fb5d213

See more details on using hashes here.

File details

Details for the file discrust-0.1.3-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.3-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 0010df921ad041726733b5dbc1237d81cd7f7f6c593818219500bd2e0bff1049
MD5 d3635255e7670d61a8790143508c8d1b
BLAKE2b-256 5a597664a12a53f8fda955d96b62c2359766bda1a61c4f09e30866dce2540296

See more details on using hashes here.

File details

Details for the file discrust-0.1.3-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.3-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 1f8f052d7a59e379874e4388499b99b5f2e9bb4085d4bb44fc4ef18370b32a34
MD5 fdf250d1955e862dffa0f4daa0d8f8af
BLAKE2b-256 3e6990c937520a28a8fded20522b760ce5f0b1c50b3e881fffac0094c8e62096

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page