Skip to main content

Supervised discretization with Rust

Project description

Discrust

Supervised discretization with Rust

PyPI version shields.io

The discrust package provides a supervised discretization algorithm. Under the hood it implements a decision tree, using information value to find the optimal splits, and provides several different methods to constrain the final discretization scheme.

The package draws heavily from the ivpy package, both in the algorithm and the parameter controls. Why make another package? This package serves as a proof of concept of building a python package using Rust and pyo3. Additionally the goal is for this package to better align with the scikit-learn API, and possibly be used in other Rust based credit score building tools.

Usage

The package has a single user facing class, Discretizer that can be instantiated with the following arguments.

  • min_obs (Optional[float], optional): Minimum number of observations required in a bin. Defaults to 5.
  • max_bins (Optional[int], optional): Maximum number of bins to split the variable into. Defaults to 10.
  • min_iv (Optional[float], optional): Minimum information value required to make a split. Defaults to 0.001.
  • min_pos (Optional[float], optional): Minimum number of records with a value of one that should be present in a split. Defaults to 5.
  • mono (Optional[int], optional): The monotonicity required between the binned variable and the binary performance outcome. A value of -1 will result in negative correlation between the binned x and y variables, while a value of 1 will result in a positive correlation between the binned x variable and the y variable. Specifying a value of 0 will result in binning x, with no monotonicity constraint. If a value of None is specified the monotonicity will be determined the monotonicity of the first split. Defaults to None.

The fit method can be called on data and accepts the following parameters.

  • x (ArrayLike): An arraylike numeric field that will be discretized based on the values of y, and the constraints the Discretizer was initialized with.
  • y (ArrayLike): An arraylike binary field.
  • sample_weight (Optional[ArrayLike], optional): Optional sample weight array to be used when calculating the optimal breaks. Defaults to None.
  • exception_values (Optional[List[float]], optional): Optional list specifying exception values. These values are held out of the binning process, additionally, their respective weight of evidence, and summary information can be found in the exception_values_ attribute once the discretizer has been fit.

A np.nan value may be present in the list of possible exception values. If there are np.nan values present in the x variable, and np.nan is not listed as a possible exception value, an error will be raised. Additionally, an error will be raised if np.nan is found to be in y or the sample_weight arrays.

This method will fit the decision tree and find the optimal split values for the feature given the constraints. After being fit the discretizer will have a splits_ attribute with the optimal split values.

import seaborn as sns

df = sns.load_dataset("titanic")

from discrust import Discretizer

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
ds.splits_
# [-inf, 6.95, 7.125, 7.7292, 10.4625, 15.1, 50.4958, 52.0, 73.5, 79.65, inf]

Here we show what the results are if exception values are also specified. These exception values will be held out when calculating the bins.

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["age"], df["survived"], exception_values=[np.nan, 1.0])
ds.exception_values_
# {'vals_': [nan, 1.0],
#  'totals_ct_': [177.0, 7.0],
#  'iv_': [0.03054206173541801, 0.015253257689460616],
#  'ones_ct_': [52.0, 5.0],
#  'woe_': [-0.40378231427394834, 1.3895784363210804],
#  'zero_ct_': [125.0, 2.0]}

The exception_values_ dictionary has the following keys.

  • vals_: The exception values passed to the Discretizer.
  • totals_ct_: The total number of each respective exception value present in the x variable used for fitting.
  • ones_ct_: Total count of the positive class for each exception value.
  • zero_ct_: Total count of zeros for each respective value.
  • woe_: The weight of evidence for each respective exception value.
  • iv_: The information value for each respective exception value.

The predict method can be called and will discretize the feature, and then perform either weight of evidence substitution on each binned level, or return the bin index. This method takes the following arguments.

  • x (ArrayLike): An arraylike numeric field.

  • prediction_type (str, optional): A string specifying which prediction type should be returned. The string specified must be one of "woe" or "index". Defaults to "woe".

    • If "woe" is supplied, weight evidence subtitution will be performed for each value, and the weight of evidence of the bin the value should fall in will be returned. For exception values found in x, the calculated weight of evidence for that exception value will be returned. If the exception value was never present in the x variable when the Discretizer was fit, then the returned weight of evidence will be zero for the exception value.
    • If "index" is specified, each value will be converted to the relevant bin index. These bins will be created from the splits_ attribute and will be zero indexed. Any exception values will be encoded starting with -1 to -N, where N is the number of exception values present in the exception_values_ attribute. The order of the exception values will be equivalent to the vals_ key in this attribute.
ds.predict(df["fare"])[0:5]
array([-0.84846814, 0.78344263, -0.787529, 0.78344263, -0.787529])

Specifying prediction_type to "index" will be equivalent to use the pandas cut method with the splits_ on the Discretizer object used as the bins.

import pandas as pd

ds = Discretizer(min_obs=5, max_bins=5, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
pd.cut(df["fare"], bins=ds.splits_).value_counts().sort_index()
# (-inf, 6.95]        26
# (6.95, 7.125]       16
# (7.125, 10.462]    297
# (10.462, 73.5]     455
# (73.5, inf]         97
# Name: fare, dtype: int64

pd.value_counts(ds.predict(df["fare"], prediction_type="index")).sort_index()
# 0     26
# 1     16
# 2    297
# 3    455
# 4     97
# dtype: int64

On of the main benefit of using the predict method over the pandas cut function directly, is the built in support for exception values.

ds = Discretizer(min_obs=5, max_bins=4, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["age"], df["survived"], exception_values=[np.nan, 1.0])

pd.value_counts(ds.predict(df["age"], prediction_type="index")).sort_index()
# -2      7
# -1    177
#  0      6
#  1     34
#  2    654
#  3     13
# dtype: int64

ds.exception_values_["vals_"]
# [nan, 1.0]
ds.exception_values_["totals_ct_"]
# [177.0, 7.0]

Installation

From PyPi

For Windows users, the package can be installed directly from pypi with the following command.

python -m pip install discrust

Building from Source

The package can be built from source, it utalizes the maturin tool as a build backend. This tool requires you have python, and a working Rust compiler installed, see here for details. If these two requirements are met, you can clone this repository, and run the following command in the repositories root directory.

python -m pip install . -v

This should invoke the maturin tool, which will handle the building of the Rust code and installation of the package. Alternativly, if you simply want to build a wheel, you can run the following command after installing maturin.

maturin build --release

I have had some problems building packages with maturin directly in a conda environment, this is actually a bug on anaconda's side that will hopefully be resolved. If this does give you any problems, it's usually easiest to build a wheel inside of a venv and then install the wheel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discrust-0.1.4.tar.gz (31.0 kB view details)

Uploaded Source

Built Distributions

discrust-0.1.4-cp39-none-win_amd64.whl (165.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

discrust-0.1.4-cp39-cp39-macosx_10_7_x86_64.whl (215.9 kB view details)

Uploaded CPython 3.9 macOS 10.7+ x86-64

discrust-0.1.4-cp38-none-win_amd64.whl (166.0 kB view details)

Uploaded CPython 3.8 Windows x86-64

discrust-0.1.4-cp38-cp38-macosx_10_7_x86_64.whl (215.9 kB view details)

Uploaded CPython 3.8 macOS 10.7+ x86-64

File details

Details for the file discrust-0.1.4.tar.gz.

File metadata

  • Download URL: discrust-0.1.4.tar.gz
  • Upload date:
  • Size: 31.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.11.5

File hashes

Hashes for discrust-0.1.4.tar.gz
Algorithm Hash digest
SHA256 6fbc0273a661fe729d98d112cf62cfd3b0464cf3d73009fb317efbb205f58e0e
MD5 d0e7cd02db19c91f22b9995ae1d4b649
BLAKE2b-256 018fda8451035b90d9a5d6fa08528be7e1050386644cfae9c75be555ce1704a1

See more details on using hashes here.

File details

Details for the file discrust-0.1.4-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.4-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 464567993a6e131584ed137bc5e64c6252fca0061e93e9f2f4414bc7cd634932
MD5 88830f744b680dfc810fdab217fe32a5
BLAKE2b-256 7c0e4ec924cd3bf03c7acffb46ef5e9a9075e6912497a8ace1f2f2d3b7f6fb46

See more details on using hashes here.

File details

Details for the file discrust-0.1.4-cp39-cp39-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.4-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 d04e3c8de6fcc27c7aa6183c4cf2a5084f3224d5a6bfc0fe2cb441362843f97b
MD5 47d238f4b54137572f143d9a424a9f0f
BLAKE2b-256 7427a8d1638e6c4bb16f4a0be1c2960b820b7dd922d2c73b4cf46abdd2c3e0ce

See more details on using hashes here.

File details

Details for the file discrust-0.1.4-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.4-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 7f9052ad6159a7c5feba86af5f1c6fe966c1cde9091be0baaacb93853b8be93d
MD5 36fe4b7603b8a1f81ac8faf6a876c116
BLAKE2b-256 ed2cdffe5c509ee569c87daf94f8b1a825afa352bbf4a136efc732af02999d6c

See more details on using hashes here.

File details

Details for the file discrust-0.1.4-cp38-cp38-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.4-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 ab33d76cf7427a21c3e087be3ffce06fb5fb3ca06855bd48839e0620e636dd59
MD5 a33452870e74412a870f3cdad63b7e63
BLAKE2b-256 fa9481b5a0060f48cbd35c5831cac34b4bffdb4e7cb0083e118e82e032fe4b62

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page