Skip to main content

KMDS Featurization Service and Orchestration Pipeline

Project description

KMDS Featurization

This repository provides a configurable, stage-based featurization engine for SBA modeling workflows.

The design goal is simple:

  • keep stage logic understandable and composable
  • keep orchestration/configuration centralized
  • keep modeling flow leakage-safe (fit on train only, reuse on val/active)

What This Produces

The pipeline writes two CSV outputs:

  • featurized_data.csv: consolidated engineered dataset (modeled + active partitions)
  • model_ready_numeric_data.csv: numeric model-ready export from the final stage output

Additional diagnostic artifact:

  • feature_selection_knee_curve.png: ranked feature-importance knee plot saved in the featurization output directory

For the current SBA flow, the model-ready dataset is:

  • numeric/bool only
  • train-fitted feature-selected
  • schema-aligned across train/val/active
  • persisted with index=False (no index artifact column)

Core Concepts

  • Anchor index: record_id
  • Stage contract: method(context, stage_cfg) -> DataFrame
  • Waterfall behavior: each stage can shrink survivor rows by index
  • Horizontal feature assembly: stage outputs are concatenated by index
  • Controlled index expansion: only stages marked allow_new_indices may re-introduce rows

Pipeline Layout (Current Hybrid Design)

Front section (feature assembly):

  1. record_id_definition
  2. borrower_geo_coding
  3. prepare_categorical_data
  4. prepare_numerical_data
  5. merge_categorical_and_numerical
  6. merge_with_borrower_geo

Merge stage design:

  • package component: src/tabular/merge_ops.py
  • user wrappers: featurization_scripts/featurization.py
  • merge key: record_id index

Leakage-safe modeling section: 7. low_count_featurization_of_cat_vars 8. hierarchical_low_count_var_encoding 9. loan_status_recoding 10. filter_modeling_universe 11. stratified_train_val_split 12. target_encode_categorical_vars 13. harmonize_and_project_feature_space 14. merge_modeled_and_active_partitions

Current encoding rule:

  • if both raw and rarity-corrected categorical variants exist (x and x_rcs), only x_rcs is target-encoded

Tree-Based Feature Selection

Feature selection runs in harmonize_and_project_feature_space using train rows only.

Supported selector modes:

  • threshold
  • tree_ensemble

Supported tree models:

  • gbm
  • random_forest
  • xgboost (optional dependency)

All selector choices are config-driven via featurizer_config.yaml and surfaced through PathCoordinator (no stage-level hardcoded constants).

Feature-count tuning for kneedle mode:

  • FEATURE_SELECTION_TOP_K_MODE: kneedle
  • FEATURE_SELECTION_TOP_K_MIN_RATIO: conservative default floor, e.g. 0.5
  • FEATURE_SELECTION_MIN_FEATURE_COUNT: hard floor for retained features
  • FEATURE_SELECTION_TARGET_FEATURE_COUNT: explicit count override when the curve is too aggressive
  • FEATURE_SELECTION_REQUIRE_KNEEDLE: fail loudly if the knee cannot be determined

Repository Organization

  • src/featurization/core: orchestration, configuration bootstrap, path resolution
  • src/featurization/transforms: reusable transformation primitives
  • src/tabular: reusable tabular feature modules (encoding, splitting, feature space)
  • src/tabular/merge_ops.py: reusable index-aligned tabular merge helper
  • tests: package-level smoke and behavior checks
  • documents: architecture and configuration references

Package Component Buckets

The tabular package modules are intentionally split into two modeling buckets:

  • Row-selection components:

    • src/tabular/modeling_filter.py
    • src/tabular/train_val_split.py
    • Purpose: decide which records participate in training and how records are partitioned.
  • Column-selection components:

    • src/tabular/feature_space.py
    • src/tabular/target_encoding.py
    • src/tabular/low_count_cat_var_encoding.py
    • src/tabular/hierarchical_low_count_var_encoding.py
    • Purpose: decide which feature columns are engineered, selected, encoded, and projected.
  • Assembly components:

    • src/tabular/merge_ops.py
    • Purpose: index-aligned horizontal composition of prepared payloads.

CLI

Initialize config:

featurization-cli init \
  --working-dir /path/to/workspace \
  --metadata-file sba_loans_metadata_table.csv \
  --data-file sba_loans_user_cleaned.csv

Run pipeline:

featurization-cli run --working-dir /path/to/workspace

Run smoke test in this repo:

pytest -q tests/test_sba_pipeline.py

How To Extend Safely

  1. Add reusable logic in src/tabular first whenever possible.
  2. Keep stage wrappers in workspace featurization_scripts/featurization.py thin and explicit.
  3. Add new tunables to:
    • featurizer_config.yaml
    • src/featurization/core/path_coordinator.py
    • src/featurization/core/featurization_init.py
  4. Preserve leakage rules:
    • fit artifacts on train only
    • transform val/active using train-fitted artifacts
  5. Validate with tests after each change.

Recommended Read Order

  1. documents/sba_pipeline_featurization.md
  2. documents/config_blueprint.md
  3. documents/path_coordinator_function.md
  4. src/featurization/core/sequential_pipeline_runner.py
  5. src/tabular/feature_space.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds_featurization-0.1.0.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmds_featurization-0.1.0-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file kmds_featurization-0.1.0.tar.gz.

File metadata

  • Download URL: kmds_featurization-0.1.0.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6d70d5ffdcd4c8804805a501fda67f600a69076a1c62d357c28e90d3837b615c
MD5 01d5bf947c4cc1cdfb61884613c889b9
BLAKE2b-256 63cc97a643537dcaf6b81a1db20b79329d833b3e30bbd6eb56ff6867f61202b2

See more details on using hashes here.

File details

Details for the file kmds_featurization-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kmds_featurization-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5d0a8e6a8fad3560835890a4409a0e075eae879053c4135771efa3b5e8abd170
MD5 b7ac54963eb5faf07207f69cec26941e
BLAKE2b-256 8f3cf6efbbb27c58882aa20e2d20ba06775d4eab51bfebdc4936be470fd4db63

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page