Skip to main content

A feature selection toolkit for ML practitioners: visualize relationships, rank feature importance, and understand feature contributions. Choose wisely.

Project description

megumi

Megumi Icon

PyPI version Python versions CI codecov License: MIT

megumi is a Python package for feature selection in machine learning workflows.

Inspired by Jujutsu Kaisen's Megumi Fushiguro, this package helps you to work strategically: not about using every feature available, but about choosing the right ones.


What it does

Feature selection is one of the most impactful steps in building a machine learning model, and also one of the easiest to rush. megumi gives you the tools to do it properly, across three areas:

1. Visual exploration: Understand your features before modelling. See how each feature relates to the target, spot distributions that separate classes, catch missing value patterns, and identify multicollinearity, all in a single function call.

2. Importance scoring: Go beyond intuition. Use machine learning-based methods to quantify which features actually carry predictive power.

3. Contribution analysis: Quantify how much a set of new features improves your model. Test candidate features against your current ones across any metric, including custom business metrics, and get statistical significance out of the box.


Modules

gyokuken - Visual feature analysis

Named after Megumi's shikigami gyokuken (玉犬, the Divine Dogs), used for tracking and sensing. This module helps you track and sense the true nature of your features.

Function Description
plot_bivariate Mean target rate per feature bucket vs. observation counts. Adapts automatically to continuous and categorical features. Missing values get their own bucket.
plot_distribution Feature distribution grouped by target. Supports histogram, KDE, violin, and boxplot.
plot_correlation Lower triangle correlation heatmap. Supports Pearson, Spearman, and Kendall.
plot_missing Horizontal bar chart of missing value percentages, sorted by severity.

Usage example:

from megumi.gyokuken import plot_bivariate, plot_distribution, plot_correlation, plot_missing

plot_bivariate(df, feature="age", target="default")
plot_distribution(df, feature="income", target="default", kind="violin")
plot_correlation(df)
plot_missing(df)

All visualisations adapt to the target type automatically: binary classification, multiclass classification, or regression.

bansho - SHAP-based feature importance scoring

Named after Megumi's shikigami Banshō (万象, Max Elephant) (a heavy, water releasing shikigami). This module uses machine learning and SHAP values to reveal which features carry real predictive power.

Two synthetic random features (RANDOM_1, RANDOM_2) are introduced as baselines before fitting a vanilla machine learning model. Every input feature is ranked by its mean absolute SHAP value and labelled in relation to those baselines:

Label Meaning
predictive Mean |SHAP| beats both random features - a genuinely informative feature.
marginal Mean |SHAP| beats one random feature - weak signal, use with caution.
noise Mean |SHAP| beats neither random feature - no detectable predictive power.
Function Description
score_features Fit a vanilla model, compute SHAP values, and return a ranked DataFrame of features labelled by predictive power.

Usage example:

from sklearn.model_selection import train_test_split
from megumi.bansho import score_features

df_train, df_val = train_test_split(df, test_size=0.2, random_state=42)

result = score_features(df_train, features=["age", "income", "zip"], target="default",
                        df_val=df_val, random_state=42)
# returns:
#      feature predictive_power
# 0     income       predictive
# 1        age         marginal
# 2        zip            noise

Passing df_val is recommended: the forest is fitted on the training set and SHAP values are computed on the held-out set, producing more conservative importance estimates. If omitted, SHAP is computed on the training set directly.

Supports binary classification and regression targets. The strategy parameter is reserved for future model types (e.g. "linear"); currently only "tree" (random forest) is available.


nue - Feature contribution analysis

Named after Megumi's shikigami Nue (鵺), a chimeric creature used to survey and strike from above. This module answers the question: "If I add these features, how much improvement do I get?"

Three random forests are fitted per cross-validation fold: one on the base features alone, one on base + candidate features (real values), and one on base + candidate features with the new columns row-permuted (the null model). Significance is measured by a paired t-test between the real augmented model and the null model, which isolates genuine signal from the Random Forest diversification effect that can inflate performance even when adding pure noise to a fixed max_features budget. Both sklearn metrics and udf callables are supported, including business metrics that depend on extra columns in the dataset.

Function Description
evaluate_contribution Compare model metrics before and after adding candidate features and report whether the improvement is statistically significant.

Built-in metrics: "roc_auc", "recall", "precision", "f1", "accuracy" (classification); "rmse", "mae", "r2" (regression).

Usage example:

from megumi.nue import evaluate_contribution

result = evaluate_contribution(
    df,
    base_features=["age", "income", "credit_score"],
    new_features=["vendor_A", "vendor_B"],
    target="default",
    metrics=["roc_auc", "recall"],
    random_state=42,
)
# returns:
#     metric  base_score  augmented_score   delta  pct_change  p_value  significant
# 0  roc_auc      0.7421           0.8103  0.0682        9.19     0.003         True
# 1   recall      0.6830           0.7512  0.0682        9.98     0.021         True

Custom metric callables that accept a third argument receive the full test-fold DataFrame, enabling business metrics such as expected loss:

def loss_avoided(y_true, y_pred_proba, df_fold, threshold=0.5):
    flagged = y_pred_proba >= threshold
    return df_fold.loc[y_true.astype(bool) & ~flagged, "loan_amount"].sum()

result = evaluate_contribution(
    df,
    base_features=["age", "income"],
    new_features=["vendor_score"],
    target="default",
    metrics=[loss_avoided],
    random_state=42,
)

Supports binary classification and regression targets.


Installation

pip install megumi

Or, to set up a development environment using conda:

conda env create -f environment.yml
conda activate megumi-dev

Status

megumi is under active development. All three modules are available: gyokuken for visual feature exploration, bansho for SHAP-based importance scoring, and nue for feature contribution analysis. Contributions and feedback are welcome.


"With this treasure, I summon..."

amazing feature selection?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

megumi-0.1.2.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

megumi-0.1.2-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file megumi-0.1.2.tar.gz.

File metadata

  • Download URL: megumi-0.1.2.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for megumi-0.1.2.tar.gz
Algorithm Hash digest
SHA256 25801d1630db66773dbd6c812d318c553ba7aead317807b2f13f3bd34ab925b1
MD5 4d39b371e2f03ff1b95fa1309ec95e9b
BLAKE2b-256 57595c5bc916181ab653f2624f705771c4026632c681518def0a8348f9824c31

See more details on using hashes here.

File details

Details for the file megumi-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: megumi-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for megumi-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 52f1c521cb7061a47ac88ded189e969d92c65a9cbd83f34162a921b593490377
MD5 7b6ca55d57fa33f0e9f9d176520d1790
BLAKE2b-256 1788f58002e113722069606e325358989388e638a54105e066e5a321b687af73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page