KMDS Featurization Service and Orchestration Pipeline

These details have not been verified by PyPI

Project description

KMDS Featurization

This repository provides a configurable, stage-based featurization engine for KMDS modeling workflows.

The design goal is simple:

keep stage logic understandable and composable
keep orchestration/configuration centralized
keep modeling flow leakage-safe (fit on train only, reuse on val/active)

What This Produces

The pipeline writes two CSV outputs:

featurized_data.csv: consolidated engineered dataset (modeled + active partitions)
model_ready_numeric_data.csv: numeric model-ready export from the final stage output

Additional diagnostic artifact:

feature_selection_knee_curve.png: ranked feature-importance knee plot saved in the featurization output directory

For the current SBA flow, the model-ready dataset is:

numeric/bool only
train-fitted feature-selected
schema-aligned across train/val/active
persisted with index=False (no index artifact column)

Core Concepts

Anchor index: record_id
Stage contract: method(context, stage_cfg) -> DataFrame
Waterfall behavior: each stage can shrink survivor rows by index
Horizontal feature assembly: stage outputs are concatenated by index
Controlled index expansion: only stages marked allow_new_indices may re-introduce rows

Pipeline Layout (Current Hybrid Design)

Front section (feature assembly):

record_id_definition
entity_coding
prepare_categorical_data
prepare_numerical_data
merge_categorical_and_numerical
merge_with_entity_coding

Merge stage design:

package component: src/tabular/merge_ops.py
user wrappers: featurization_scripts/featurization.py
merge key: record_id index

Leakage-safe modeling section: 7. low_count_featurization_of_cat_vars 8. hierarchical_low_count_var_encoding 9. target_status_recoding 10. filter_modeling_universe 11. stratified_train_val_split 12. target_encode_categorical_vars 13. harmonize_and_project_feature_space 14. merge_modeled_and_active_partitions

Current encoding rule:

if both raw and rarity-corrected categorical variants exist (x and x_rcs), only x_rcs is target-encoded

Tree-Based Feature Selection

Feature selection runs in harmonize_and_project_feature_space using train rows only.

Supported selector modes:

threshold
tree_ensemble

Supported tree models:

gbm
random_forest
xgboost (optional dependency)

All selector choices are config-driven via featurizer_config.yaml and surfaced through PathCoordinator (no stage-level hardcoded constants).

Feature-count tuning for kneedle mode:

FEATURE_SELECTION_TOP_K_MODE: kneedle
FEATURE_SELECTION_TOP_K_MIN_RATIO: conservative default floor, e.g. 0.5
FEATURE_SELECTION_MIN_FEATURE_COUNT: hard floor for retained features
FEATURE_SELECTION_TARGET_FEATURE_COUNT: explicit count override when the curve is too aggressive
FEATURE_SELECTION_REQUIRE_KNEEDLE: fail loudly if the knee cannot be determined

Repository Organization

src/featurization/core: orchestration, configuration bootstrap, path resolution
src/featurization/transforms: reusable transformation primitives
src/tabular: reusable tabular feature modules (encoding, splitting, feature space)
src/tabular/merge_ops.py: reusable index-aligned tabular merge helper
tests: package-level smoke and behavior checks
documents: architecture and configuration references

Package Component Buckets

The tabular package modules are intentionally split into two modeling buckets:

Row-selection components:
- src/tabular/modeling_filter.py
- src/tabular/train_val_split.py
- Purpose: decide which records participate in training and how records are partitioned.
Column-selection components:
- src/tabular/feature_space.py
- src/tabular/target_encoding.py
- src/tabular/low_count_cat_var_encoding.py
- src/tabular/hierarchical_low_count_var_encoding.py
- Purpose: decide which feature columns are engineered, selected, encoded, and projected.
Assembly components:
- src/tabular/merge_ops.py
- Purpose: index-aligned horizontal composition of prepared payloads.

CLI

Initialize config:

featurization-cli init \
  --working-dir /path/to/workspace \
  --metadata-file sba_loans_metadata_table.csv \
  --data-file sba_loans_user_cleaned.csv

Run pipeline:

featurization-cli run --working-dir /path/to/workspace

Run smoke test in this repo:

pytest -q tests/test_sba_pipeline.py

How To Extend Safely

Add reusable logic in src/tabular first whenever possible.
Keep stage wrappers in workspace featurization_scripts/featurization.py thin and explicit.
Add new tunables to:
- featurizer_config.yaml
- src/featurization/core/path_coordinator.py
- src/featurization/core/featurization_init.py
Preserve leakage rules:
- fit artifacts on train only
- transform val/active using train-fitted artifacts
Validate with tests after each change.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Jun 15, 2026

0.1.5

Jun 11, 2026

0.1.4

Jun 10, 2026

This version

0.1.2

Jun 10, 2026

0.1.1

Jun 10, 2026

0.1.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds_featurization-0.1.2.tar.gz (27.8 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kmds_featurization-0.1.2-py3-none-any.whl (27.1 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file kmds_featurization-0.1.2.tar.gz.

File metadata

Download URL: kmds_featurization-0.1.2.tar.gz
Upload date: Jun 10, 2026
Size: 27.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`d710061d54b2577687cb470ee7ad2d366f955ec0fa3a6130b8c5813930b0b7a5`
MD5	`ae68077da1ddb21d2f38af2b2f72d6a7`
BLAKE2b-256	`576c9c9d7145536d80521b8e123f022643b9c0be6b459de0a0edf5b60acc8172`

See more details on using hashes here.

File details

Details for the file kmds_featurization-0.1.2-py3-none-any.whl.

File metadata

Download URL: kmds_featurization-0.1.2-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for kmds_featurization-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8ce0f4cfee0fbc0a18ae5a86f42dc8bca95d43a8d7540d7c82dedf265f7e4bbf`
MD5	`7c2c2ead834882930a71f871316f0845`
BLAKE2b-256	`b0c536af71ce543043cc052ac521b29bc52ed79e4d7c5573a07a00417b66bf95`

See more details on using hashes here.

kmds-featurization 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

KMDS Featurization

What This Produces

Core Concepts

Pipeline Layout (Current Hybrid Design)

Tree-Based Feature Selection

Repository Organization

Package Component Buckets

CLI

How To Extend Safely

Recommended Read Order

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes