Skip to main content

A small package for feature autoBinning

Project description

auto binning 分箱工具

安装

pip install autoBinning

基础工具 (simpleMethods)

from autoBinning.utils.simpleMethods import *
my_list = [1,1,2,2,2,2,3,3,4,5,6,7,8,9,10,10,20,20,20,20,30,30,40,50,60,70,80,90,100]
my_list_y = [1,1,2,2,2,2,1,1,1,2,2,2,1,1]
t = simpleMethods(my_list)
t.equalSize(3)
# 每个分箱样本数平均
print(t.bins) # [  1.           5.33333333  20.         100.        ]
# 等间距划分分箱
t.equalValue(4)
print(t.bins) # [  1.    25.75  50.5   75.25 100.  ]
# 基于numpy histogram分箱
t.equalHist(4)
print(t.bins) # [  1.    25.75  50.5   75.25 100.  ]

基于标签的有监督自动分箱

向前迭代方法 (forward method)

# load data
import pandas as pd
df = pd.read_csv('credit_old.csv')
df = df[['Age','target']]
df = df.dropna()

基于最大woe分裂分箱

在得到尽可能细粒度的细分箱之后,寻找上下分箱woe差异最大的初始切割点,并得到woe趋势,之后迭代找到下一个woe差异最大且趋势相同的切割点,直到满足woe差异不大于一个阈值或分箱数(切割点数)满足要求

from autoBinning.utils.forwardSplit import *
t = forwardSplit(df['Age'], df['target'])
t.fit(sby='woe',minv=0.01,init_split=20)
print(t.bins) # [16. 25. 29. 33. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 55. 58. 60. 63. 72. 94.]
t = forwardSplit(df['Age'], df['target'])
t.fit(sby='woe',num_split=4,init_split=20)
print(t.bins) # [16. 42. 44. 48. 50. 94.]
print("bin\twoe")
for i in range(len(t.bins)-1):
    v = t.value[(t.x < t.bins[i+1]) & (t.x >= t.bins[i])]
    woe = t._cal_woe(v)
    print((t.bins[i], t.bins[i+1]),woe)

bin	woe
(16.0, 25.0) 0.11373232830301286
(25.0, 42.0) 0.07217546872710079
(42.0, 50.0) 0.04972042405868509
(50.0, 72.0) -0.07172614369435065
(72.0, 94.0) -0.13778318584223453

avatar avatar

基于最大iv分裂分箱

与最大woe分裂分箱方法类似,在得到尽可能细粒度的细分箱之后,寻找iv值最大的切割点,并得到woe趋势,之后迭代找到下一个iv最大且woe趋势相同的切割点,直到分箱数(切割点数)满足要求

from autoBinning.utils.forwardSplit import *
# sby='woeiv'时考虑woe趋势,sby='iv'时不考虑woe趋势
t = forwardSplit(df['Age'], df['target'])
t.fit(sby='iv',minv=0.1,init_split=20)
print(t.bins) # [16. 25. 29. 33. 36. 38. 40. 42. 44. 46. 48. 50. 58. 60. 63. 94.]
t = forwardSplit(df['Age'], df['target'])
t.fit(sby='iv',num_split=4,init_split=20)
print(t.bins) # [16. 25. 33. 36. 38. 94.]
t.fit(sby='woeiv',num_split=4,init_split=20)
print(t.bins) # [16. 25. 33. 36. 38. 94.]

print("bin\twoe")
for i in range(len(t.bins)-1):
    v = t.value[(t.x < t.bins[i+1]) & (t.x >= t.bins[i])]
    woe = t._cal_woe(v)
    print((t.bins[i], t.bins[i+1]),woe)

bin	woe
(16.0, 25.0) 0.11373232830301286
(25.0, 33.0) 0.06679187564362839
(33.0, 40.0) 0.06638021747875023
(40.0, 50.0) 0.05894173616389541
(50.0, 94.0) -0.07934608583946329

t = forwardSplit(df['Branch'], df['target'],missing=-1,categorical=True)
t.fit(sby='woeiv',minv=0,init_split=0,num_split=4) # [['B19'], ['B15'], ['B14'], ['B16'], ['B7', 'B18', 'B2', 'B9', 'B5', 'B6', 'B1', 'B17', 'B4', 'B10', 'B8', 'B3', 'B12', 'B13', 'B11']]

向后迭代方法 (backward method)

基于最大iv合并分箱

迭代每次删除一个分箱切点,是去掉后整体iv最大

from autoBinning.utils.backwardSplit import *
t = backwardSplit(df['Age'], df['target'])
t.fit(sby='iv',num_split=5)
print(t.bins) # [16.  17.5 18.5 85.5 95. ]

基于卡方检验的合并分箱

1. 得到尽可能细粒度的细分箱切点

2. 每个切点计算上下相邻分箱的卡方检验值

3. 将卡方检验值最低的两个分箱合并

4. 重复前两步直到达到分裂最小分裂切点数

1. First the input range is initialized by splitting it into sub-intervals with each sample getting own interval.

2. For every pair of adjacent sub-intervals a chi-square value is computed.

3. Merge pair with lowest chi-square into single bin.

4. Repeat 1 and 2 until number of bins meets predefined threshold.

from autoBinning.utils.backwardSplit import *
t = backwardSplit(df['Age'], df['target'])
t.fit(sby='chi',num_split=7)
print(t.bins) # [16.  72.5 73.5 87.5 89.5 90.5 95. ]

基于spearman相关性做向后等频分箱

from autoBinning.utils.backwardSplit import *
t = backwardSplit(df['Age'], df['target'])
t.fit_by_spearman(min_v=5, init_split=20)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoBinning-0.1.7.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

autoBinning-0.1.7-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file autoBinning-0.1.7.tar.gz.

File metadata

  • Download URL: autoBinning-0.1.7.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for autoBinning-0.1.7.tar.gz
Algorithm Hash digest
SHA256 2ce1724086badf67205341aa54066b3761b5de7773c70208e0f9b5a2ffd9dc83
MD5 ed814681d4caab8d7891e5433c4d2115
BLAKE2b-256 390c516e7864c84ed6258cc79f93627a14b67ec63a4514d86dd1504926e841f3

See more details on using hashes here.

File details

Details for the file autoBinning-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: autoBinning-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for autoBinning-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 a85b5d507e07d2ed1631dae1d081595b53e8a16e2cbc1b0a9a7914bd8947ca2b
MD5 cb5233e1f579ffeeac0f2038ffa55d3c
BLAKE2b-256 6e3c0b0d7e197a63d9f4b558d22df4da111782591ee815e86e3dac7a21bb77cd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page