Skip to main content

KMDS Featurization Service and Orchestration Pipeline

Project description

KMDS Featurization

This repository provides a configurable, stage-based featurization package for KMDS project datasets.

It is designed to be dataset-agnostic at the package level:

  • stage orchestration is generic and configuration-driven
  • reusable feature logic lives in package modules
  • modeling flow remains leakage-safe (fit on train only, reuse on val/active)

SBA-specific file names and stage examples in this repo are reference defaults, not a package constraint.

What This Produces

The pipeline writes two CSV outputs:

  • featurized_data.csv: consolidated engineered dataset
  • model_ready_numeric_data.csv: numeric model-ready export from the final stage output

Additional diagnostic artifact:

  • feature_selection_knee_curve.png: ranked feature-importance knee plot in the featurization output directory

Model-ready output behavior:

  • numeric and bool columns only
  • train-fitted feature-selected schema
  • aligned schema across modeled and active partitions
  • persisted with index=False to avoid index artifact columns

Core Runtime Contract

  • Anchor index: record_id
  • Stage contract: method(context, stage_cfg) -> DataFrame
  • Waterfall behavior: each stage can reduce the survivor universe by index
  • Horizontal assembly: stage outputs are concatenated by index
  • Controlled expansion: only stages with allow_new_indices may intentionally re-introduce rows

Package Architecture

Core orchestration:

  • src/featurization/core/sequential_pipeline_runner.py
  • src/featurization/core/path_coordinator.py
  • src/featurization/core/featurization_init.py

Reusable tabular modules:

  • src/tabular/modeling_filter.py
  • src/tabular/train_val_split.py
  • src/tabular/target_encoding.py
  • src/tabular/feature_space.py
  • src/tabular/low_count_cat_var_encoding.py
  • src/tabular/hierarchical_low_count_var_encoding.py
  • src/tabular/merge_ops.py

Design split:

  • Row-selection components decide participation and partitioning
  • Column-selection components decide engineering, encoding, and projection
  • Assembly components perform index-aligned merges

Feature Selection

Feature selection runs in harmonize_and_project_feature_space on train rows only.

Supported selector modes:

  • threshold
  • tree_ensemble

Supported tree models:

  • gbm
  • random_forest
  • xgboost (optional dependency)

All selector behavior is configuration-driven through featurizer_config.yaml.

Key kneedle controls:

  • FEATURE_SELECTION_TOP_K_MODE
  • FEATURE_SELECTION_TOP_K_MIN_RATIO
  • FEATURE_SELECTION_MIN_FEATURE_COUNT
  • FEATURE_SELECTION_TARGET_FEATURE_COUNT
  • FEATURE_SELECTION_REQUIRE_KNEEDLE

CLI

Initialize a workspace config:

featurization-cli init
--working-dir /path/to/workspace
--metadata-file your_metadata.csv
--data-file your_cleaned_dataset.csv

Create a provisional starter config:

featurization-cli bootstrap
--working-dir /path/to/workspace
--metadata-file your_metadata.csv
--data-file your_cleaned_dataset.csv

Use --overwrite to replace an existing provisional config file if needed:

featurization-cli bootstrap
--working-dir /path/to/workspace
--metadata-file your_metadata.csv
--data-file your_cleaned_dataset.csv
--overwrite

Run the pipeline:

featurization-cli run --working-dir /path/to/workspace

Run package smoke tests:

pytest -q tests/test_sba_pipeline.py

How To Extend Safely

  1. Put reusable transformations in src/tabular first.
  2. Keep workspace stage wrappers thin and explicit.
  3. Add new tunables in all three locations:
    • featurizer_config.yaml
    • src/featurization/core/path_coordinator.py
    • src/featurization/core/featurization_init.py
  4. Preserve leakage-safe modeling flow:
    • fit artifacts on train only
    • transform val/active with train-fitted artifacts
  5. Validate with package tests and workspace integration runs.

Recommended Read Order

  1. documents/sba_pipeline_featurization.md
  2. documents/config_blueprint.md
  3. documents/path_coordinator_function.md
  4. src/featurization/core/sequential_pipeline_runner.py
  5. src/tabular/feature_space.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds_featurization-0.2.0.tar.gz (37.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmds_featurization-0.2.0-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file kmds_featurization-0.2.0.tar.gz.

File metadata

  • Download URL: kmds_featurization-0.2.0.tar.gz
  • Upload date:
  • Size: 37.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.2.0.tar.gz
Algorithm Hash digest
SHA256 01b6a8f9219092e4315f3f0be61c05997bc17e42143b463459fd1b84bc81f649
MD5 7681abcfe1f229404f169514184f8d9c
BLAKE2b-256 008fb7bf71b846524872fd3b90046173028c2a49b730f5873a33fe8d4d6a2a4e

See more details on using hashes here.

File details

Details for the file kmds_featurization-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: kmds_featurization-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e5bace049af0b19a3baa733fdc00b3ce37aeb71c1e14c39accb85606b1ea5e5c
MD5 29b6dffd6af23fa1a93a08a85df7ed18
BLAKE2b-256 243bed8f2f53273bd78624d4aef2724357e78b4921d880fcf4961e8fba6736ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page