KMDS Featurization Service and Orchestration Pipeline
Project description
KMDS Featurization
This repository provides a configurable, stage-based featurization package for KMDS project datasets.
It is designed to be dataset-agnostic at the package level:
- stage orchestration is generic and configuration-driven
- reusable feature logic lives in package modules
- modeling flow remains leakage-safe (fit on train only, reuse on val/active)
SBA-specific file names and stage examples in this repo are reference defaults, not a package constraint.
What This Produces
The pipeline writes two CSV outputs:
- featurized_data.csv: consolidated engineered dataset
- model_ready_numeric_data.csv: numeric model-ready export from the final stage output
Additional diagnostic artifact:
- feature_selection_knee_curve.png: ranked feature-importance knee plot in the featurization output directory
Model-ready output behavior:
- numeric and bool columns only
- train-fitted feature-selected schema
- aligned schema across modeled and active partitions
- persisted with index=False to avoid index artifact columns
Core Runtime Contract
- Anchor index: record_id
- Stage contract: method(context, stage_cfg) -> DataFrame
- Waterfall behavior: each stage can reduce the survivor universe by index
- Horizontal assembly: stage outputs are concatenated by index
- Controlled expansion: only stages with allow_new_indices may intentionally re-introduce rows
Package Architecture
Core orchestration:
- src/featurization/core/sequential_pipeline_runner.py
- src/featurization/core/path_coordinator.py
- src/featurization/core/featurization_init.py
Reusable tabular modules:
- src/tabular/modeling_filter.py
- src/tabular/train_val_split.py
- src/tabular/target_encoding.py
- src/tabular/feature_space.py
- src/tabular/low_count_cat_var_encoding.py
- src/tabular/hierarchical_low_count_var_encoding.py
- src/tabular/merge_ops.py
Design split:
- Row-selection components decide participation and partitioning
- Column-selection components decide engineering, encoding, and projection
- Assembly components perform index-aligned merges
Feature Selection
Feature selection runs in harmonize_and_project_feature_space on train rows only.
Supported selector modes:
- threshold
- tree_ensemble
Supported tree models:
- gbm
- random_forest
- xgboost (optional dependency)
All selector behavior is configuration-driven through featurizer_config.yaml.
Key kneedle controls:
- FEATURE_SELECTION_TOP_K_MODE
- FEATURE_SELECTION_TOP_K_MIN_RATIO
- FEATURE_SELECTION_MIN_FEATURE_COUNT
- FEATURE_SELECTION_TARGET_FEATURE_COUNT
- FEATURE_SELECTION_REQUIRE_KNEEDLE
CLI
Initialize a workspace config:
featurization-cli init
--working-dir /path/to/workspace
--metadata-file your_metadata.csv
--data-file your_cleaned_dataset.csv
Run the pipeline:
featurization-cli run --working-dir /path/to/workspace
Run package smoke tests:
pytest -q tests/test_sba_pipeline.py
How To Extend Safely
- Put reusable transformations in src/tabular first.
- Keep workspace stage wrappers thin and explicit.
- Add new tunables in all three locations:
- featurizer_config.yaml
- src/featurization/core/path_coordinator.py
- src/featurization/core/featurization_init.py
- Preserve leakage-safe modeling flow:
- fit artifacts on train only
- transform val/active with train-fitted artifacts
- Validate with package tests and workspace integration runs.
Recommended Read Order
- documents/sba_pipeline_featurization.md
- documents/config_blueprint.md
- documents/path_coordinator_function.md
- src/featurization/core/sequential_pipeline_runner.py
- src/tabular/feature_space.py
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kmds_featurization-0.1.5.tar.gz.
File metadata
- Download URL: kmds_featurization-0.1.5.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ab83383c6eadc62617eb2a7f3bda65372c56c1fad1447e9c2d3c71d11b0c00a
|
|
| MD5 |
c70448e6552998777e8740ad62a4a1c3
|
|
| BLAKE2b-256 |
8f99000519ca02f7bb59a3c08524c7f461d468b0e8c4bdb08a388ac889b65b2e
|
File details
Details for the file kmds_featurization-0.1.5-py3-none-any.whl.
File metadata
- Download URL: kmds_featurization-0.1.5-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
672324ed747e5aa897caa92af688bf6f67e9892385c88463a5c07a7de1fea391
|
|
| MD5 |
d795ab6a71915791f41fb159b7ab0201
|
|
| BLAKE2b-256 |
5991662207816a4f400561da8ab1d76360f9cb0bc140cb3ceb90aa3c9be455fa
|