Skip to main content

KMDS Featurization Service and Orchestration Pipeline

Project description

KMDS Featurization

This repository provides a configurable, stage-based featurization package for KMDS project datasets.

It is designed to be dataset-agnostic at the package level:

  • stage orchestration is generic and configuration-driven
  • reusable feature logic lives in package modules
  • modeling flow remains leakage-safe (fit on train only, reuse on val/active)

SBA-specific file names and stage examples in this repo are reference defaults, not a package constraint.

What This Produces

The pipeline writes two CSV outputs:

  • featurized_data.csv: consolidated engineered dataset
  • model_ready_numeric_data.csv: numeric model-ready export from the final stage output

Additional diagnostic artifact:

  • feature_selection_knee_curve.png: ranked feature-importance knee plot in the featurization output directory

Model-ready output behavior:

  • numeric and bool columns only
  • train-fitted feature-selected schema
  • aligned schema across modeled and active partitions
  • persisted with index=False to avoid index artifact columns

Core Runtime Contract

  • Anchor index: record_id
  • Stage contract: method(context, stage_cfg) -> DataFrame
  • Waterfall behavior: each stage can reduce the survivor universe by index
  • Horizontal assembly: stage outputs are concatenated by index
  • Controlled expansion: only stages with allow_new_indices may intentionally re-introduce rows

Package Architecture

Core orchestration:

  • src/featurization/core/sequential_pipeline_runner.py
  • src/featurization/core/path_coordinator.py
  • src/featurization/core/featurization_init.py

Reusable tabular modules:

  • src/tabular/modeling_filter.py
  • src/tabular/train_val_split.py
  • src/tabular/target_encoding.py
  • src/tabular/feature_space.py
  • src/tabular/low_count_cat_var_encoding.py
  • src/tabular/hierarchical_low_count_var_encoding.py
  • src/tabular/merge_ops.py

Design split:

  • Row-selection components decide participation and partitioning
  • Column-selection components decide engineering, encoding, and projection
  • Assembly components perform index-aligned merges

Feature Selection

Feature selection runs in harmonize_and_project_feature_space on train rows only.

Supported selector modes:

  • threshold
  • tree_ensemble

Supported tree models:

  • gbm
  • random_forest
  • xgboost (optional dependency)

All selector behavior is configuration-driven through featurizer_config.yaml.

Key kneedle controls:

  • FEATURE_SELECTION_TOP_K_MODE
  • FEATURE_SELECTION_TOP_K_MIN_RATIO
  • FEATURE_SELECTION_MIN_FEATURE_COUNT
  • FEATURE_SELECTION_TARGET_FEATURE_COUNT
  • FEATURE_SELECTION_REQUIRE_KNEEDLE

CLI

Initialize a workspace config:

featurization-cli init
--working-dir /path/to/workspace
--metadata-file your_metadata.csv
--data-file your_cleaned_dataset.csv

Run the pipeline:

featurization-cli run --working-dir /path/to/workspace

Run package smoke tests:

pytest -q tests/test_sba_pipeline.py

How To Extend Safely

  1. Put reusable transformations in src/tabular first.
  2. Keep workspace stage wrappers thin and explicit.
  3. Add new tunables in all three locations:
    • featurizer_config.yaml
    • src/featurization/core/path_coordinator.py
    • src/featurization/core/featurization_init.py
  4. Preserve leakage-safe modeling flow:
    • fit artifacts on train only
    • transform val/active with train-fitted artifacts
  5. Validate with package tests and workspace integration runs.

Recommended Read Order

  1. documents/sba_pipeline_featurization.md
  2. documents/config_blueprint.md
  3. documents/path_coordinator_function.md
  4. src/featurization/core/sequential_pipeline_runner.py
  5. src/tabular/feature_space.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds_featurization-0.1.4.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmds_featurization-0.1.4-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file kmds_featurization-0.1.4.tar.gz.

File metadata

  • Download URL: kmds_featurization-0.1.4.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.4.tar.gz
Algorithm Hash digest
SHA256 a5d963547640574476af9d7c95e9a19404c214c940e1b17aeaa5774efa77a507
MD5 29a2c2d8de75455efbe50fa066ed32b5
BLAKE2b-256 47045a7196dc43ca67d76bde8a288c27a3ad25ec2a03dc9b3cba4fd6700b3a3d

See more details on using hashes here.

File details

Details for the file kmds_featurization-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: kmds_featurization-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 49cc1f7085ff857a9d6f2c2b8acac1f967a49160409dfd10a40dbf178a3f0e54
MD5 eecc7bea57f29b0e0d22e084ee5a6c4e
BLAKE2b-256 1cf8199a40f371bc437c55ad41f3db7f657a15b0929c5bafb35bb162362ff530

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page