Skip to main content

KMDS Featurization Service and Orchestration Pipeline

Project description

KMDS Featurization

This repository provides a configurable, stage-based featurization package for KMDS project datasets.

It is designed to be dataset-agnostic at the package level:

  • stage orchestration is generic and configuration-driven
  • reusable feature logic lives in package modules
  • modeling flow remains leakage-safe (fit on train only, reuse on val/active)

SBA-specific file names and stage examples in this repo are reference defaults, not a package constraint.

What This Produces

The pipeline writes two CSV outputs:

  • featurized_data.csv: consolidated engineered dataset
  • model_ready_numeric_data.csv: numeric model-ready export from the final stage output

Additional diagnostic artifact:

  • feature_selection_knee_curve.png: ranked feature-importance knee plot in the featurization output directory

Model-ready output behavior:

  • numeric and bool columns only
  • train-fitted feature-selected schema
  • aligned schema across modeled and active partitions
  • persisted with index=False to avoid index artifact columns

Core Runtime Contract

  • Anchor index: record_id
  • Stage contract: method(context, stage_cfg) -> DataFrame
  • Waterfall behavior: each stage can reduce the survivor universe by index
  • Horizontal assembly: stage outputs are concatenated by index
  • Controlled expansion: only stages with allow_new_indices may intentionally re-introduce rows

Package Architecture

Core orchestration:

  • src/featurization/core/sequential_pipeline_runner.py
  • src/featurization/core/path_coordinator.py
  • src/featurization/core/featurization_init.py

Reusable tabular modules:

  • src/tabular/modeling_filter.py
  • src/tabular/train_val_split.py
  • src/tabular/target_encoding.py
  • src/tabular/feature_space.py
  • src/tabular/low_count_cat_var_encoding.py
  • src/tabular/hierarchical_low_count_var_encoding.py
  • src/tabular/merge_ops.py

Design split:

  • Row-selection components decide participation and partitioning
  • Column-selection components decide engineering, encoding, and projection
  • Assembly components perform index-aligned merges

Feature Selection

Feature selection runs in harmonize_and_project_feature_space on train rows only.

Supported selector modes:

  • threshold
  • tree_ensemble

Supported tree models:

  • gbm
  • random_forest
  • xgboost (optional dependency)

All selector behavior is configuration-driven through featurizer_config.yaml.

Key kneedle controls:

  • FEATURE_SELECTION_TOP_K_MODE
  • FEATURE_SELECTION_TOP_K_MIN_RATIO
  • FEATURE_SELECTION_MIN_FEATURE_COUNT
  • FEATURE_SELECTION_TARGET_FEATURE_COUNT
  • FEATURE_SELECTION_REQUIRE_KNEEDLE

CLI

Initialize a workspace config:

featurization-cli init
--working-dir /path/to/workspace
--metadata-file your_metadata.csv
--data-file your_cleaned_dataset.csv

Run the pipeline:

featurization-cli run --working-dir /path/to/workspace

Run package smoke tests:

pytest -q tests/test_sba_pipeline.py

How To Extend Safely

  1. Put reusable transformations in src/tabular first.
  2. Keep workspace stage wrappers thin and explicit.
  3. Add new tunables in all three locations:
    • featurizer_config.yaml
    • src/featurization/core/path_coordinator.py
    • src/featurization/core/featurization_init.py
  4. Preserve leakage-safe modeling flow:
    • fit artifacts on train only
    • transform val/active with train-fitted artifacts
  5. Validate with package tests and workspace integration runs.

Recommended Read Order

  1. documents/sba_pipeline_featurization.md
  2. documents/config_blueprint.md
  3. documents/path_coordinator_function.md
  4. src/featurization/core/sequential_pipeline_runner.py
  5. src/tabular/feature_space.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds_featurization-0.1.5.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kmds_featurization-0.1.5-py3-none-any.whl (27.6 kB view details)

Uploaded Python 3

File details

Details for the file kmds_featurization-0.1.5.tar.gz.

File metadata

  • Download URL: kmds_featurization-0.1.5.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.5.tar.gz
Algorithm Hash digest
SHA256 1ab83383c6eadc62617eb2a7f3bda65372c56c1fad1447e9c2d3c71d11b0c00a
MD5 c70448e6552998777e8740ad62a4a1c3
BLAKE2b-256 8f99000519ca02f7bb59a3c08524c7f461d468b0e8c4bdb08a388ac889b65b2e

See more details on using hashes here.

File details

Details for the file kmds_featurization-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: kmds_featurization-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 27.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 672324ed747e5aa897caa92af688bf6f67e9892385c88463a5c07a7de1fea391
MD5 d795ab6a71915791f41fb159b7ab0201
BLAKE2b-256 5991662207816a4f400561da8ab1d76360f9cb0bc140cb3ceb90aa3c9be455fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page