Skip to main content

Supervised discretization with Rust

Project description

Discrust

Supervised discretization with Rust

PyPI version shields.io

The discrust package provides a supervised discretization algorithm. Under the hood it implements a decision tree, using information value to find the optimal splits, and provides several different methods to constrain the final discretization scheme. This algorithm identifies the optimal way to split a continuous variable into discrete bins, while maximizing the predictive value of those bins with respect to some binary dependent variable.

The Rust code for the actual algorithm implementation can be found in the crates/discrust_core directory. The code for the python bindings can be found in the src directory.

Usage

The package has a single user facing class, Discretizer that can be instantiated with the following arguments.

  • min_obs (Optional[float], optional): Minimum number of observations required in a bin. Defaults to 5.
  • max_bins (Optional[int], optional): Maximum number of bins to split the variable into. Defaults to 10.
  • min_iv (Optional[float], optional): Minimum information value required to make a split. Defaults to 0.001.
  • min_pos (Optional[float], optional): Minimum number of records with a value of one that should be present in a split. Defaults to 5.
  • mono (Optional[int], optional): The monotonicity required between the binned variable and the binary performance outcome. A value of -1 will result in negative correlation between the binned x and y variables, while a value of 1 will result in a positive correlation between the binned x variable and the y variable. Specifying a value of 0 will result in binning x, with no monotonicity constraint. If a value of None is specified the monotonicity will be determined the monotonicity of the first split. Defaults to None.

The fit method can be called on data and accepts the following parameters.

  • x (ArrayLike): An arraylike numeric field that will be discretized based on the values of y, and the constraints the Discretizer was initialized with.
  • y (ArrayLike): An arraylike binary field.
  • sample_weight (Optional[ArrayLike], optional): Optional sample weight array to be used when calculating the optimal breaks. Defaults to None.
  • exception_values (Optional[List[float]], optional): Optional list specifying exception values. These values are held out of the binning process, additionally, their respective weight of evidence, and summary information can be found in the exception_values_ attribute once the discretizer has been fit.

A np.nan value may be present in the list of possible exception values. If there are np.nan values present in the x variable, and np.nan is not listed as a possible exception value, an error will be raised. Additionally, an error will be raised if np.nan is found to be in y or the sample_weight arrays.

This method will fit the decision tree and find the optimal split values for the feature given the constraints. After being fit the discretizer will have a splits_ attribute with the optimal split values.

import seaborn as sns

df = sns.load_dataset("titanic")

from discrust import Discretizer

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
ds.splits_
# [-inf, 6.95, 7.125, 7.7292, 10.4625, 15.1, 50.4958, 52.0, 73.5, 79.65, inf]

Here we show what the results are if exception values are also specified. These exception values will be held out when calculating the bins.

ds = Discretizer(min_obs=5, max_bins=10, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["age"], df["survived"], exception_values=[np.nan, 1.0])
ds.exception_values_
# {'vals_': [nan, 1.0],
#  'totals_ct_': [177.0, 7.0],
#  'iv_': [0.03054206173541801, 0.015253257689460616],
#  'ones_ct_': [52.0, 5.0],
#  'woe_': [-0.40378231427394834, 1.3895784363210804],
#  'zero_ct_': [125.0, 2.0]}

The exception_values_ dictionary has the following keys.

  • vals_: The exception values passed to the Discretizer.
  • totals_ct_: The total number of each respective exception value present in the x variable used for fitting.
  • ones_ct_: Total count of the positive class for each exception value.
  • zero_ct_: Total count of zeros for each respective value.
  • woe_: The weight of evidence for each respective exception value.
  • iv_: The information value for each respective exception value.

The predict method can be called and will discretize the feature, and then perform either weight of evidence substitution on each binned level, or return the bin index. This method takes the following arguments.

  • x (ArrayLike): An arraylike numeric field.

  • prediction_type (str, optional): A string specifying which prediction type should be returned. The string specified must be one of "woe" or "index". Defaults to "woe".

    • If "woe" is supplied, weight evidence subtitution will be performed for each value, and the weight of evidence of the bin the value should fall in will be returned. For exception values found in x, the calculated weight of evidence for that exception value will be returned. If the exception value was never present in the x variable when the Discretizer was fit, then the returned weight of evidence will be zero for the exception value.
    • If "index" is specified, each value will be converted to the relevant bin index. These bins will be created from the splits_ attribute and will be zero indexed. Any exception values will be encoded starting with -1 to -N, where N is the number of exception values present in the exception_values_ attribute. The order of the exception values will be equivalent to the vals_ key in this attribute.
ds.predict(df["fare"])[0:5]
array([-0.84846814, 0.78344263, -0.787529, 0.78344263, -0.787529])

Specifying prediction_type to "index" will be equivalent to use the pandas cut method with the splits_ on the Discretizer object used as the bins.

import pandas as pd

ds = Discretizer(min_obs=5, max_bins=5, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["fare"], df["survived"])
pd.cut(df["fare"], bins=ds.splits_).value_counts().sort_index()
# (-inf, 6.95]        26
# (6.95, 7.125]       16
# (7.125, 10.462]    297
# (10.462, 73.5]     455
# (73.5, inf]         97
# Name: fare, dtype: int64

pd.value_counts(ds.predict(df["fare"], prediction_type="index")).sort_index()
# 0     26
# 1     16
# 2    297
# 3    455
# 4     97
# dtype: int64

One of the main benefits of using the predict method over the pandas cut function directly, is the built in support for exception values.

ds = Discretizer(min_obs=5, max_bins=4, min_iv=0.001, min_pos=1.0, mono=None)
ds.fit(df["age"], df["survived"], exception_values=[np.nan, 1.0])

pd.value_counts(ds.predict(df["age"], prediction_type="index")).sort_index()
# -2      7
# -1    177
#  0      6
#  1     34
#  2    654
#  3     13
# dtype: int64

ds.exception_values_["vals_"]
# [nan, 1.0]
ds.exception_values_["totals_ct_"]
# [177.0, 7.0]

Installation

From PyPi

For Windows users, the package can be installed directly from pypi with the following command.

python -m pip install discrust

Building from Source

The package can be built from source, it utalizes the maturin tool as a build backend. This tool requires you have python, and a working Rust compiler installed, see here for details. If these two requirements are met, you can clone this repository, and run the following command in the repositories root directory.

python -m pip install . -v

This should invoke the maturin tool, which will handle the building of the Rust code and installation of the package. Alternativly, if you simply want to build a wheel, you can run the following command after installing maturin.

maturin build --release

I have had some problems building packages with maturin directly in a conda environment, this is actually a bug on anaconda's side that will hopefully be resolved. If this does give you any problems, it's usually easiest to build a wheel inside of a venv and then install the wheel.

Acknowledgments

The package draws heavily from the ivpy package, both in the algorithm and the parameter controls. Why make another package? This package serves as a proof of concept of building a python package using Rust and pyo3, as well as offers cleaner methods for dealing with exception values. Additionally the goal is for this package to better align with the scikit-learn API, and possibly be used in other Rust based credit score building tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discrust-0.1.6.tar.gz (35.7 kB view details)

Uploaded Source

Built Distributions

discrust-0.1.6-cp310-none-win_amd64.whl (162.2 kB view details)

Uploaded CPython 3.10 Windows x86-64

discrust-0.1.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (231.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

discrust-0.1.6-cp310-cp310-macosx_10_7_x86_64.whl (213.9 kB view details)

Uploaded CPython 3.10 macOS 10.7+ x86-64

discrust-0.1.6-cp39-none-win_amd64.whl (162.4 kB view details)

Uploaded CPython 3.9 Windows x86-64

discrust-0.1.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (231.0 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

discrust-0.1.6-cp39-cp39-macosx_10_7_x86_64.whl (213.9 kB view details)

Uploaded CPython 3.9 macOS 10.7+ x86-64

discrust-0.1.6-cp38-none-win_amd64.whl (162.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

discrust-0.1.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (230.8 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

discrust-0.1.6-cp38-cp38-macosx_10_7_x86_64.whl (213.5 kB view details)

Uploaded CPython 3.8 macOS 10.7+ x86-64

discrust-0.1.6-cp37-none-win_amd64.whl (162.2 kB view details)

Uploaded CPython 3.7 Windows x86-64

discrust-0.1.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (230.9 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

discrust-0.1.6-cp37-cp37m-macosx_10_7_x86_64.whl (213.5 kB view details)

Uploaded CPython 3.7m macOS 10.7+ x86-64

File details

Details for the file discrust-0.1.6.tar.gz.

File metadata

  • Download URL: discrust-0.1.6.tar.gz
  • Upload date:
  • Size: 35.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.12.15

File hashes

Hashes for discrust-0.1.6.tar.gz
Algorithm Hash digest
SHA256 9ec8a00a530379e96c93a3976a60c666dd38978c14e0fb1e5e61c76643e7c67f
MD5 e50158234fcac7029ff04baa0ccc8b02
BLAKE2b-256 ca7bb9e5bc8a2cf909db9bfb5a6be5f1fdb3e6b344fd9956abe22d989145b628

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 f053619fca50a7e19ae76d052c9bd1496b084ec3c20339161c9ac3296f1f8392
MD5 a990ffb192e8067fa6f89bfcb262a679
BLAKE2b-256 beeac1ff0f81d284204a609eae1b89929b2ec86b527e44573dde9806af12d992

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d2513d841f425565d2f2f0dc87bd7e0e21821d4dac7c60ac05b80684dc1fbb21
MD5 2cd4c421361e7b1b5dae0830b8e1db27
BLAKE2b-256 02e66564436be04518d2f4dfc5d05e1bd7be2ffe32268598bb182d202e565eec

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp310-cp310-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 e50b49be255f5e5ed3ce0d97fa4493982d3d435051b0aa6eafb8f45b9a0dde90
MD5 7419d4adc43c0d9d359e9b38a2eda2fb
BLAKE2b-256 82b76cf156defa7f18cb2a04adcc1aa2a26c08f40ae0a12db7c3f342d15f25f5

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 b78d0a53f3762acc44c7921d05a7b3edc271285fb5b5716a9e16582e1b901c71
MD5 0b6d970a1223e2299f99130103f4a633
BLAKE2b-256 f5a19f9bf19b8c5465be212aa126eb038a6bc328545c5c05fa40f55032d76455

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e8be2ba621a396f15c7d759f55665b4c4b5ce56eb1e3b10c582879c9e085f680
MD5 46fdc7f6d40be3ca61b54341fdf7a923
BLAKE2b-256 29f306b012f95e7adac3eb0152a3ceb30909f6e5850c6cdae912af63aaa20eb5

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp39-cp39-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 61b93e1a3ac095afff6f78f7f56887872e809dd1e0f76d539fd443b953ae31c3
MD5 a7a68364938a48736cc6016046e049bf
BLAKE2b-256 cdce930d39b6a61f0e0b4631449d0b7969fc0ced8431c17a7fe9e23a2a3857a8

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 451a1078c7f58dc922de76133dbd0c29342a09188219cf4f01c3f0ded78834ef
MD5 957d7234a63ba8e5cb97d4fe144eccee
BLAKE2b-256 f4305eb61d6d45386864247f1ce9409913e4bdd7f628727aedf4a68a7972df25

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9d84dbf50a1468298310a46fb874b928db884efcdf8a7edd1966fc046ab17f73
MD5 fa63c508097b0c551fe802abfbf061fa
BLAKE2b-256 afbc1614991bf1af6560afc318f709806d8b3cec3a5bc8202a91ac87275988bc

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp38-cp38-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 ca151d8c045960cace2771520e635ed814edfcf83d900259024013323875129c
MD5 189eaffd90bed9d864a07365b38fdde5
BLAKE2b-256 49642ef2a02bf5c2fb7745e562ce8e78f2e23f7b8c8828041122fb11745d6578

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp37-none-win_amd64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp37-none-win_amd64.whl
Algorithm Hash digest
SHA256 3a145071de0f1a8c523462d8b47cfc0a5f32124f504ed01b882a531c5aa4ec44
MD5 03449132b0cfcc64fd8ca853dc7467e1
BLAKE2b-256 0422e8ed9d3a8d108aa30fa86f8318dfe53715af97d607d975a869104a48850f

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 61b584eddd135a8170722dd51d56544cf0ccbda5ad9989e2d77ed8e7b05d34b5
MD5 996694ee772d430b729eb81a02991b5f
BLAKE2b-256 f2188d5485aeb47d0f070d403aa57c085553a6bc534df226ce91438901c59c89

See more details on using hashes here.

File details

Details for the file discrust-0.1.6-cp37-cp37m-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for discrust-0.1.6-cp37-cp37m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 d526ffd89b6cb3b6f6eda7f6a294a0a4ad1c663a0205a0ba06c357a1d0c7bd43
MD5 18f4f1f57c058f4fd35d3658cc831f8b
BLAKE2b-256 2484a07f40cbe00cbc7089667914e83d5798be082d57797909d4ad712d8462cb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page