Skip to main content

自定义的风控自动建模工具包

Project description

自动建模

建议在自动建模前完成数据的清洗和预处理工作。

当前支持评分卡和 XGB 两种模型架构,新增其他模型架构只需继承AutomatedModeling类并实现对应接口即可。

评分卡

评分卡自动建模流程:

  1. 划分 train, validation 和 oot
  2. 原始特征基于信息价值和相关性进行初筛,确定潜在的入模变量
  3. 对潜在的入模变量进行单调分箱,并根据分箱结果进行 WOE 变换
  4. 从潜在的入模变量中剔除稳定性较差的变量
  5. 从稳定性较好的变量中剔除 iv 值较低的变量
  6. 对剩余变量进行逐步回归,确定最终的入模变量
  7. 使用入模变量拟合评分卡

评分卡自动化建模调用示例:

"""
注意,`target`和`time`都是`data`中字段,其中`target`列中的值必须是0-1变量,`time`列中的值必须是 datetime.date 对象
run 方法中的 split_time 参数也必须传入一个 datetime.date 对象
"""
model: AutomatedScoreCard = AutomatedScoreCard(data=data, target=target, time="time", not_features=not_features) # 传入数据以实例化一个 AutomatedScoreCard 对象
model.run(split_time=critical_date, split_validation_set=False) # 调用对象的 run 方法

如何发挥主观能动性?

放弃直接调用run方法,手动执行run方法中执行的代码并 DIY 成您的形状即可

def run(self, split_time: date, split_validation_set: bool = False, validation_pct: float = 0.2,
            empty: float = 0.9, origin_iv: float = 0.02, corr: float = 0.7,
            max_n_bins: int = 5,
            psi_threshold: float = 0.1, psi_original: bool = True,
            woe_iv: float = 0.02, eliminate_low_iv_oot: bool = True,
            estimator: str = "ols", direction: str = "both", criterion: str = "aic",
            model_score: str = "model_score",
            n_bins: int = 50
            ) -> None:
        """自动建模的主方法

        Args:
            split_time (date): 划分 train 和 oot 的边界日期
            split_validation_set (bool, optional): 是否要在 train 中随机划分 validation. Defaults to False.
            validation_pct (float, optional): 验证集占训练集的比例. Defaults to 0.2.
            empty (float, optional): 初筛时的缺失值阈值. Defaults to 0.9.
            origin_iv (float, optional): 初筛时的 iv 阈值. Defaults to 0.02.
            corr (float, optional): 初筛时的相关系数阈值. Defaults to 0.7.
            max_n_bins (int, optional): 潜在入模变量单调分箱时最大分箱个数
            psi_threshold (float, optional): 剔除不稳定变量时的 PSI 阈值. Defaults to 0.1.
            psi_original (bool, optional): 使用原始数据还是离散化后的数据(True: 原始数据, False: 离散化后的数据). Defaults to True.
            woe_iv (float, optional): 离散化后用 iv 值进行筛选时的 iv 阈值. Defaults to 0.02.
            eliminate_low_iv_oot (bool, optional): 是否将离散化后测试集上 iv 值较低的变量也剔除. Defaults to True.
            estimator (str, optional): 用于拟合的模型,["ols" | "lr" | "lasso" | "ridge"]. Defaults to "ols".
            direction (str, optional): 逐步回归的方向,["forward" | "backward" | "both"]. Defaults to "both".
            criterion (str, optional): 评判标准,["aic" | "bic" | "ks" | "auc"]. Defaults to "aic".
            model_score (str, optional): 模型分字段命名. Defaults to "model_score".
            n_bins (int, optional): 计算模型分 PSI 时的分箱个数. Defaults to 50.
        """
        self.split_train_oot(split_time=split_time, split_validation_set=split_validation_set, validation_pct=validation_pct) # 划分训练集、验证集和测试集
        print(f"训练集有 {self.train.shape[0]} 条样本,bad_rate 为 {self.train[self.target].mean()};验证集有 {self.validation.shape[0]} 条样本,bad_rate 为 {self.validation[self.target].mean()};测试集有 {self.oot.shape[0]} 条样本,bad_rate 为 {self.oot[self.target].mean()}")
        self.initial_screening(empty=empty, iv=origin_iv, corr=corr) # 原始特征信息价值和相关性初筛
        print(f"原始数据 IV 和相关性初筛后剩余 {len(self.latent_features)} 个变量,分别是 {self.latent_features}")
        self.monotonic_trend_binning(max_n_bins=max_n_bins) # 将潜在入模变量进行单调分箱并进行 WOE 变换
        stable_features: List[str] = self.eliminate_unstable_features(selected_features=self.latent_features, psi_threshold=psi_threshold, psi_original=psi_original) # 从潜在入模变量中剔除稳定性较差的变量
        print(f"从潜在入模变量中剔除不稳定变量后剩余 {len(stable_features)} 个变量,分别是 {stable_features}")
        # 从稳定性较好的变量中再剔除 iv 值较低的变量
        selected_features: List[str] = self.eliminate_low_iv_features(selected_features=stable_features, iv=woe_iv, train_validation_oot=1)
        if eliminate_low_iv_oot: # 如果需要剔除离散化后测试集上 iv 值较低的变量
            selected_features: List[str] = self.eliminate_low_iv_features(selected_features=selected_features, iv=woe_iv, train_validation_oot=0)
        print(f"从稳定性较好的变量中再剔除 iv 值较低的变量后剩余 {len(selected_features)} 个变量,分别是 {selected_features}")
        used_features: List[str] = self.stepwise_after_woe_transformer(selected_features=selected_features, estimator=estimator, direction=direction, criterion=criterion) # 逐步回顾确定最终的入模变量
        print(f"经逐步回归确定入模变量共 {len(used_features)} 个,分别是 {used_features}")
        self.fit(used_features=used_features, model_score=model_score) # 拟合评分卡
        evaluation: Dict[str, float] = self.evaluate(n_bins=n_bins) # 模型评价指标
        print(f"训练集上 KS 值为:{evaluation['train_ks']},AUC 值为:{evaluation['train_auc']};验证集上 KS 值为:{evaluation['validation_ks']},AUC 值为 {evaluation['validation_auc']};测试集上 KS 值为:{evaluation['oot_ks']},AUC 值为 {evaluation['oot_auc']};模型分的 PSI 为 {evaluation['model_psi']}")

通过AutomatedScoreCard对象的adjust_binning_rules方法来手动分箱;注意,您只能对潜在的入模变量(AutomatedScoreCard对象的latent_features属性)进行手动分箱。如果您想直接干预最终的入模变量,修改传入给fit方法的used_features参数即可。

XGB

XGB 自动建模流程:

  1. 划分 train, validation 和 oot
  2. 剔除稳定性较差的变量
  3. 剔除 iv 值较低的变量
  4. 将剩余变量丢入 XGB 进行拟合

XGB 自动化建模调用示例:

"""
注意,`target`和`time`都是`data`中字段,其中`target`列中的值必须是0-1变量,`time`列中的值必须是 datetime.date 对象
run 方法中的 split_time 参数也必须传入一个 datetime.date 对象
"""
model: AutomatedXGBoost = AutomatedXGBoost(data=data, target=target, time="time", not_features=not_features) # 传入数据以实例化一个 AutomatedScoreCard 对象
model.run(split_time=critical_date, split_validation_set=False) # 调用对象的 run 方法

您可以通过AutomatedXGBoost的静态属性params来查看默认的 xgb 训练参数。

PS: AutomatedXGBoost使用的是原生的xgboost,而非scikit-learn库的接口

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

automatedmodeling-0.2.0.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

automatedmodeling-0.2.0-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file automatedmodeling-0.2.0.tar.gz.

File metadata

  • Download URL: automatedmodeling-0.2.0.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.2

File hashes

Hashes for automatedmodeling-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e85834c53e0c7119664525afed7f649f5919e02c5820232680dba9ccef767fab
MD5 c3d7631dfd75e3b6528141807c241b9a
BLAKE2b-256 9d8ba49c20cad892ca8e9d32c7fd8741bbf0d53373d8e94065ac9825beccd50d

See more details on using hashes here.

File details

Details for the file automatedmodeling-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for automatedmodeling-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dea568b427fa9fc564059bc3040652a370f2ef6042f251d8549eb4d48c4c2eb5
MD5 028056374e42def85ffe5e2ad627ad1f
BLAKE2b-256 df8a6a2f8289846e7e9ce4cbeb37178552f96c614192dddab9e55fff6906ea65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page