KMDS Featurization Service and Orchestration Pipeline
Project description
KMDS Featurization
This repository provides a configurable, stage-based featurization engine for SBA modeling workflows.
The design goal is simple:
- keep stage logic understandable and composable
- keep orchestration/configuration centralized
- keep modeling flow leakage-safe (fit on train only, reuse on val/active)
What This Produces
The pipeline writes two CSV outputs:
- featurized_data.csv: consolidated engineered dataset (modeled + active partitions)
- model_ready_numeric_data.csv: numeric model-ready export from the final stage output
Additional diagnostic artifact:
- feature_selection_knee_curve.png: ranked feature-importance knee plot saved in the featurization output directory
For the current SBA flow, the model-ready dataset is:
- numeric/bool only
- train-fitted feature-selected
- schema-aligned across train/val/active
- persisted with index=False (no index artifact column)
Core Concepts
- Anchor index: record_id
- Stage contract: method(context, stage_cfg) -> DataFrame
- Waterfall behavior: each stage can shrink survivor rows by index
- Horizontal feature assembly: stage outputs are concatenated by index
- Controlled index expansion: only stages marked allow_new_indices may re-introduce rows
Pipeline Layout (Current Hybrid Design)
Front section (feature assembly):
- record_id_definition
- borrower_geo_coding
- prepare_categorical_data
- prepare_numerical_data
- merge_categorical_and_numerical
- merge_with_borrower_geo
Merge stage design:
- package component: src/tabular/merge_ops.py
- user wrappers: featurization_scripts/featurization.py
- merge key: record_id index
Leakage-safe modeling section: 7. low_count_featurization_of_cat_vars 8. hierarchical_low_count_var_encoding 9. loan_status_recoding 10. filter_modeling_universe 11. stratified_train_val_split 12. target_encode_categorical_vars 13. harmonize_and_project_feature_space 14. merge_modeled_and_active_partitions
Current encoding rule:
- if both raw and rarity-corrected categorical variants exist (x and x_rcs), only x_rcs is target-encoded
Tree-Based Feature Selection
Feature selection runs in harmonize_and_project_feature_space using train rows only.
Supported selector modes:
- threshold
- tree_ensemble
Supported tree models:
- gbm
- random_forest
- xgboost (optional dependency)
All selector choices are config-driven via featurizer_config.yaml and surfaced through PathCoordinator (no stage-level hardcoded constants).
Feature-count tuning for kneedle mode:
- FEATURE_SELECTION_TOP_K_MODE: kneedle
- FEATURE_SELECTION_TOP_K_MIN_RATIO: conservative default floor, e.g. 0.5
- FEATURE_SELECTION_MIN_FEATURE_COUNT: hard floor for retained features
- FEATURE_SELECTION_TARGET_FEATURE_COUNT: explicit count override when the curve is too aggressive
- FEATURE_SELECTION_REQUIRE_KNEEDLE: fail loudly if the knee cannot be determined
Repository Organization
- src/featurization/core: orchestration, configuration bootstrap, path resolution
- src/featurization/transforms: reusable transformation primitives
- src/tabular: reusable tabular feature modules (encoding, splitting, feature space)
- src/tabular/merge_ops.py: reusable index-aligned tabular merge helper
- tests: package-level smoke and behavior checks
- documents: architecture and configuration references
Package Component Buckets
The tabular package modules are intentionally split into two modeling buckets:
-
Row-selection components:
- src/tabular/modeling_filter.py
- src/tabular/train_val_split.py
- Purpose: decide which records participate in training and how records are partitioned.
-
Column-selection components:
- src/tabular/feature_space.py
- src/tabular/target_encoding.py
- src/tabular/low_count_cat_var_encoding.py
- src/tabular/hierarchical_low_count_var_encoding.py
- Purpose: decide which feature columns are engineered, selected, encoded, and projected.
-
Assembly components:
- src/tabular/merge_ops.py
- Purpose: index-aligned horizontal composition of prepared payloads.
CLI
Initialize config:
featurization-cli init \
--working-dir /path/to/workspace \
--metadata-file sba_loans_metadata_table.csv \
--data-file sba_loans_user_cleaned.csv
Run pipeline:
featurization-cli run --working-dir /path/to/workspace
Run smoke test in this repo:
pytest -q tests/test_sba_pipeline.py
How To Extend Safely
- Add reusable logic in src/tabular first whenever possible.
- Keep stage wrappers in workspace featurization_scripts/featurization.py thin and explicit.
- Add new tunables to:
- featurizer_config.yaml
- src/featurization/core/path_coordinator.py
- src/featurization/core/featurization_init.py
- Preserve leakage rules:
- fit artifacts on train only
- transform val/active using train-fitted artifacts
- Validate with tests after each change.
Recommended Read Order
- documents/sba_pipeline_featurization.md
- documents/config_blueprint.md
- documents/path_coordinator_function.md
- src/featurization/core/sequential_pipeline_runner.py
- src/tabular/feature_space.py
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kmds_featurization-0.1.0.tar.gz.
File metadata
- Download URL: kmds_featurization-0.1.0.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d70d5ffdcd4c8804805a501fda67f600a69076a1c62d357c28e90d3837b615c
|
|
| MD5 |
01d5bf947c4cc1cdfb61884613c889b9
|
|
| BLAKE2b-256 |
63cc97a643537dcaf6b81a1db20b79329d833b3e30bbd6eb56ff6867f61202b2
|
File details
Details for the file kmds_featurization-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kmds_featurization-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d0a8e6a8fad3560835890a4409a0e075eae879053c4135771efa3b5e8abd170
|
|
| MD5 |
b7ac54963eb5faf07207f69cec26941e
|
|
| BLAKE2b-256 |
8f3cf6efbbb27c58882aa20e2d20ba06775d4eab51bfebdc4936be470fd4db63
|