Skip to main content

KMDS Featurization Service and Orchestration Pipeline

Project description

KMDS Featurization

This repository provides a configurable, stage-based featurization engine for KMDS modeling workflows.

The design goal is simple:

  • keep stage logic understandable and composable
  • keep orchestration/configuration centralized
  • keep modeling flow leakage-safe (fit on train only, reuse on val/active)

What This Produces

The pipeline writes two CSV outputs:

  • featurized_data.csv: consolidated engineered dataset (modeled + active partitions)
  • model_ready_numeric_data.csv: numeric model-ready export from the final stage output

Additional diagnostic artifact:

  • feature_selection_knee_curve.png: ranked feature-importance knee plot saved in the featurization output directory

For the current SBA flow, the model-ready dataset is:

  • numeric/bool only
  • train-fitted feature-selected
  • schema-aligned across train/val/active
  • persisted with index=False (no index artifact column)

Core Concepts

  • Anchor index: record_id
  • Stage contract: method(context, stage_cfg) -> DataFrame
  • Waterfall behavior: each stage can shrink survivor rows by index
  • Horizontal feature assembly: stage outputs are concatenated by index
  • Controlled index expansion: only stages marked allow_new_indices may re-introduce rows

Pipeline Layout (Current Hybrid Design)

Front section (feature assembly):

  1. record_id_definition
  2. borrower_geo_coding
  3. prepare_categorical_data
  4. prepare_numerical_data
  5. merge_categorical_and_numerical
  6. merge_with_borrower_geo

Merge stage design:

  • package component: src/tabular/merge_ops.py
  • user wrappers: featurization_scripts/featurization.py
  • merge key: record_id index

Leakage-safe modeling section: 7. low_count_featurization_of_cat_vars 8. hierarchical_low_count_var_encoding 9. loan_status_recoding 10. filter_modeling_universe 11. stratified_train_val_split 12. target_encode_categorical_vars 13. harmonize_and_project_feature_space 14. merge_modeled_and_active_partitions

Current encoding rule:

  • if both raw and rarity-corrected categorical variants exist (x and x_rcs), only x_rcs is target-encoded

Tree-Based Feature Selection

Feature selection runs in harmonize_and_project_feature_space using train rows only.

Supported selector modes:

  • threshold
  • tree_ensemble

Supported tree models:

  • gbm
  • random_forest
  • xgboost (optional dependency)

All selector choices are config-driven via featurizer_config.yaml and surfaced through PathCoordinator (no stage-level hardcoded constants).

Feature-count tuning for kneedle mode:

  • FEATURE_SELECTION_TOP_K_MODE: kneedle
  • FEATURE_SELECTION_TOP_K_MIN_RATIO: conservative default floor, e.g. 0.5
  • FEATURE_SELECTION_MIN_FEATURE_COUNT: hard floor for retained features
  • FEATURE_SELECTION_TARGET_FEATURE_COUNT: explicit count override when the curve is too aggressive
  • FEATURE_SELECTION_REQUIRE_KNEEDLE: fail loudly if the knee cannot be determined

Repository Organization

  • src/featurization/core: orchestration, configuration bootstrap, path resolution
  • src/featurization/transforms: reusable transformation primitives
  • src/tabular: reusable tabular feature modules (encoding, splitting, feature space)
  • src/tabular/merge_ops.py: reusable index-aligned tabular merge helper
  • tests: package-level smoke and behavior checks
  • documents: architecture and configuration references

Package Component Buckets

The tabular package modules are intentionally split into two modeling buckets:

  • Row-selection components:

    • src/tabular/modeling_filter.py
    • src/tabular/train_val_split.py
    • Purpose: decide which records participate in training and how records are partitioned.
  • Column-selection components:

    • src/tabular/feature_space.py
    • src/tabular/target_encoding.py
    • src/tabular/low_count_cat_var_encoding.py
    • src/tabular/hierarchical_low_count_var_encoding.py
    • Purpose: decide which feature columns are engineered, selected, encoded, and projected.
  • Assembly components:

    • src/tabular/merge_ops.py
    • Purpose: index-aligned horizontal composition of prepared payloads.

CLI

Initialize config:

featurization-cli init \
  --working-dir /path/to/workspace \
  --metadata-file sba_loans_metadata_table.csv \
  --data-file sba_loans_user_cleaned.csv

Run pipeline:

featurization-cli run --working-dir /path/to/workspace

Run smoke test in this repo:

pytest -q tests/test_sba_pipeline.py

How To Extend Safely

  1. Add reusable logic in src/tabular first whenever possible.
  2. Keep stage wrappers in workspace featurization_scripts/featurization.py thin and explicit.
  3. Add new tunables to:
    • featurizer_config.yaml
    • src/featurization/core/path_coordinator.py
    • src/featurization/core/featurization_init.py
  4. Preserve leakage rules:
    • fit artifacts on train only
    • transform val/active using train-fitted artifacts
  5. Validate with tests after each change.

Recommended Read Order

  1. documents/sba_pipeline_featurization.md
  2. documents/config_blueprint.md
  3. documents/path_coordinator_function.md
  4. src/featurization/core/sequential_pipeline_runner.py
  5. src/tabular/feature_space.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds_featurization-0.1.1.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmds_featurization-0.1.1-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file kmds_featurization-0.1.1.tar.gz.

File metadata

  • Download URL: kmds_featurization-0.1.1.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9da286ad4a726af0e914e7a1ed27de2ed1112cbe54b2c196173f117c4f0c4fef
MD5 af48298e0902a02966020f4727153e2d
BLAKE2b-256 2feb7c02ceb764f978557b8445d2f0e978ca1d54927f6e3a6d4376557a309b1c

See more details on using hashes here.

File details

Details for the file kmds_featurization-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kmds_featurization-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c89aadec8a2bb43f265db5d23c75422dfdf9f2bc640bb7e0d6e7b849fd790f3c
MD5 5a6d71f2364360a0bd8d712b453b1bd8
BLAKE2b-256 09cfd771c5cc23f01fefa2ee1fcb0f27f32a1e4c196c11caac54acb71253cfa2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page