Skip to main content

KMDS Featurization Service and Orchestration Pipeline

Project description

KMDS Featurization

This repository provides a configurable, stage-based featurization engine for KMDS modeling workflows.

The design goal is simple:

  • keep stage logic understandable and composable
  • keep orchestration/configuration centralized
  • keep modeling flow leakage-safe (fit on train only, reuse on val/active)

What This Produces

The pipeline writes two CSV outputs:

  • featurized_data.csv: consolidated engineered dataset (modeled + active partitions)
  • model_ready_numeric_data.csv: numeric model-ready export from the final stage output

Additional diagnostic artifact:

  • feature_selection_knee_curve.png: ranked feature-importance knee plot saved in the featurization output directory

For the current SBA flow, the model-ready dataset is:

  • numeric/bool only
  • train-fitted feature-selected
  • schema-aligned across train/val/active
  • persisted with index=False (no index artifact column)

Core Concepts

  • Anchor index: record_id
  • Stage contract: method(context, stage_cfg) -> DataFrame
  • Waterfall behavior: each stage can shrink survivor rows by index
  • Horizontal feature assembly: stage outputs are concatenated by index
  • Controlled index expansion: only stages marked allow_new_indices may re-introduce rows

Pipeline Layout (Current Hybrid Design)

Front section (feature assembly):

  1. record_id_definition
  2. entity_coding
  3. prepare_categorical_data
  4. prepare_numerical_data
  5. merge_categorical_and_numerical
  6. merge_with_entity_coding

Merge stage design:

  • package component: src/tabular/merge_ops.py
  • user wrappers: featurization_scripts/featurization.py
  • merge key: record_id index

Leakage-safe modeling section: 7. low_count_featurization_of_cat_vars 8. hierarchical_low_count_var_encoding 9. target_status_recoding 10. filter_modeling_universe 11. stratified_train_val_split 12. target_encode_categorical_vars 13. harmonize_and_project_feature_space 14. merge_modeled_and_active_partitions

Current encoding rule:

  • if both raw and rarity-corrected categorical variants exist (x and x_rcs), only x_rcs is target-encoded

Tree-Based Feature Selection

Feature selection runs in harmonize_and_project_feature_space using train rows only.

Supported selector modes:

  • threshold
  • tree_ensemble

Supported tree models:

  • gbm
  • random_forest
  • xgboost (optional dependency)

All selector choices are config-driven via featurizer_config.yaml and surfaced through PathCoordinator (no stage-level hardcoded constants).

Feature-count tuning for kneedle mode:

  • FEATURE_SELECTION_TOP_K_MODE: kneedle
  • FEATURE_SELECTION_TOP_K_MIN_RATIO: conservative default floor, e.g. 0.5
  • FEATURE_SELECTION_MIN_FEATURE_COUNT: hard floor for retained features
  • FEATURE_SELECTION_TARGET_FEATURE_COUNT: explicit count override when the curve is too aggressive
  • FEATURE_SELECTION_REQUIRE_KNEEDLE: fail loudly if the knee cannot be determined

Repository Organization

  • src/featurization/core: orchestration, configuration bootstrap, path resolution
  • src/featurization/transforms: reusable transformation primitives
  • src/tabular: reusable tabular feature modules (encoding, splitting, feature space)
  • src/tabular/merge_ops.py: reusable index-aligned tabular merge helper
  • tests: package-level smoke and behavior checks
  • documents: architecture and configuration references

Package Component Buckets

The tabular package modules are intentionally split into two modeling buckets:

  • Row-selection components:

    • src/tabular/modeling_filter.py
    • src/tabular/train_val_split.py
    • Purpose: decide which records participate in training and how records are partitioned.
  • Column-selection components:

    • src/tabular/feature_space.py
    • src/tabular/target_encoding.py
    • src/tabular/low_count_cat_var_encoding.py
    • src/tabular/hierarchical_low_count_var_encoding.py
    • Purpose: decide which feature columns are engineered, selected, encoded, and projected.
  • Assembly components:

    • src/tabular/merge_ops.py
    • Purpose: index-aligned horizontal composition of prepared payloads.

CLI

Initialize config:

featurization-cli init \
  --working-dir /path/to/workspace \
  --metadata-file sba_loans_metadata_table.csv \
  --data-file sba_loans_user_cleaned.csv

Run pipeline:

featurization-cli run --working-dir /path/to/workspace

Run smoke test in this repo:

pytest -q tests/test_sba_pipeline.py

How To Extend Safely

  1. Add reusable logic in src/tabular first whenever possible.
  2. Keep stage wrappers in workspace featurization_scripts/featurization.py thin and explicit.
  3. Add new tunables to:
    • featurizer_config.yaml
    • src/featurization/core/path_coordinator.py
    • src/featurization/core/featurization_init.py
  4. Preserve leakage rules:
    • fit artifacts on train only
    • transform val/active using train-fitted artifacts
  5. Validate with tests after each change.

Recommended Read Order

  1. documents/sba_pipeline_featurization.md
  2. documents/config_blueprint.md
  3. documents/path_coordinator_function.md
  4. src/featurization/core/sequential_pipeline_runner.py
  5. src/tabular/feature_space.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds_featurization-0.1.2.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmds_featurization-0.1.2-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file kmds_featurization-0.1.2.tar.gz.

File metadata

  • Download URL: kmds_featurization-0.1.2.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d710061d54b2577687cb470ee7ad2d366f955ec0fa3a6130b8c5813930b0b7a5
MD5 ae68077da1ddb21d2f38af2b2f72d6a7
BLAKE2b-256 576c9c9d7145536d80521b8e123f022643b9c0be6b459de0a0edf5b60acc8172

See more details on using hashes here.

File details

Details for the file kmds_featurization-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: kmds_featurization-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8ce0f4cfee0fbc0a18ae5a86f42dc8bca95d43a8d7540d7c82dedf265f7e4bbf
MD5 7c2c2ead834882930a71f871316f0845
BLAKE2b-256 b0c536af71ce543043cc052ac521b29bc52ed79e4d7c5573a07a00417b66bf95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page