Skip to main content

Fast engineered feature discovery from LightGBM paths

Project description

features-goldmine banner

Features Goldmine

Are you stuck building your ML pipeline? Are you searching for creative ideas for new features? Looking for a quick, easy, and performant way to do feature engineering?

We would like to introduce features_goldmine, a Python package built exactly for that problem. It runs multiple feature engineering strategies on your raw tabular data, creates new candidate features, filters weak ideas, and proposes only the features that are worth checking.

The main goal of features_goldmine is simple: improve ML pipeline accuracy with minimal code changes.

California Housing Example

Use your existing train/test split, generate golden features in one line, retrain your model.

from features_goldmine import GoldenFeatures

# 1) baseline model on X_train / X_test
# 2) add golden features

gf = GoldenFeatures(verbose=1, selectivity="balanced")
X_train_gold = gf.fit_transform(X_train, y_train)
X_test_gold = gf.transform(X_test)

X_train_aug = pd.concat([X_train, X_train_gold], axis=1)
X_test_aug = pd.concat([X_test, X_test_gold], axis=1)

# 3) train your model on augmented data and compare metric

Real run from this repo:

uv run python examples/california_housing.py
Baseline RMSE (single split): 0.450530
Golden   RMSE (single split): 0.447588
Delta RMSE (golden - baseline): -0.002942
RMSE improvement vs baseline: +0.65%

Selected examples of created features:

AveOccup_div_MedInc
MedInc_div_AveOccup
ctx_raw_Longitude_z_k15
HouseAge_mul_MedInc
Latitude_mul_Longitude
Full California Housing training output
uv run python examples/california_housing.py
Dataset: california_housing
Rows=20640, Features=8
Mode: single train/test split (test_size=0.25), selectivity=balanced, gf_verbose=1
[GoldenFeatures] fit: validating inputs
[GoldenFeatures] fit: task=regression, rows=15480, total_features=8, numeric_features=8, categorical_features=0, selectivity=relaxed, max_selected_features=None, enabled_strategies=['categorical_frequency', 'categorical_group_deviation', 'categorical_hash_cross', 'categorical_oof_target', 'categorical_prototypes', 'context_knn', 'grouped_row_stats', 'path', 'projection_ica', 'projection_pca', 'residual_numeric']
[GoldenFeatures] stage1: training fast LightGBM on raw features (3 repeats)
[GoldenFeatures] stage1: repeat=1/3, seed=42, paths=1223
[GoldenFeatures] stage1: repeat=2/3, seed=143, paths=1247
[GoldenFeatures] stage1: repeat=3/3, seed=244, paths=1245
[GoldenFeatures] stage1: trained with 8 raw features
[GoldenFeatures] stage2: extracted total 3715 paths across repeats
[GoldenFeatures] stage3: ranking feature interactions
[GoldenFeatures] stage3: ranked 28 interaction pairs
[GoldenFeatures] stage4: building candidate engineered features (enabled=['categorical_frequency', 'categorical_group_deviation', 'categorical_hash_cross', 'categorical_oof_target', 'categorical_prototypes', 'context_knn', 'grouped_row_stats', 'path', 'projection_ica', 'projection_pca', 'residual_numeric'])
[GoldenFeatures] stage4: generated 160 path candidates + 3 projection candidates + 3 ica candidates + 20 grouped-stats candidates + 30 context-knn candidates + 50 residual candidates = 266 total
[GoldenFeatures] stage5: quick filtering candidates
[GoldenFeatures] stage5: kept 159 candidates, rejected 107
[GoldenFeatures] stage6: survival competition with repeated LightGBM
[GoldenFeatures] stage6: 16 candidates survived
[GoldenFeatures] stage7: redundancy pruning
[GoldenFeatures] stage7: final survivors after pruning = 15
[GoldenFeatures] fit: completed (candidates=266, after_filter=159, survivors=16, final=15)
[GoldenFeatures] transform: generating 15 golden features
[GoldenFeatures] transform: generating 15 golden features
[Split] created=15 features: ['AveOccup_div_MedInc', 'MedInc_div_AveOccup', 'MedInc_div_AveRooms', 'Latitude_div_Longitude', 'ctx_raw_Longitude_z_k15', 'HouseAge_mul_MedInc', 'grpstat_002_mean', 'AveOccup_absdiff_MedInc', 'AveRooms_div_MedInc', 'AveRooms_sub_MedInc', 'Latitude_mul_Longitude', 'AveBedrms_mul_MedInc', 'AveBedrms_sub_Longitude', 'Latitude_sub_Longitude', 'AveOccup_sub_MedInc']
Baseline RMSE (single split): 0.450530
Golden   RMSE (single split): 0.447588
Delta RMSE (golden - baseline): -0.002942
RMSE improvement vs baseline: +0.65%

Iris Multiclass Example

uv run python examples/iris_multiclass.py
Baseline LogLoss (single split): 0.801166
Golden   LogLoss (single split): 0.279341
Delta LogLoss (golden - baseline): -0.521825
LogLoss improvement vs baseline: +65.13%

Selected examples of created features:

petal_length_cm_mul_petal_width_cm
rule_petal_length_cm_gt_2p450_and_petal_width_cm_le_1p550_002
sepal_length_cm_div_petal_width_cm
pca_comp_001
petal_width_cm_mul_sepal_length_cm
Full Iris Multiclass training output
uv run python examples/iris_multiclass.py
Dataset: iris
Rows=150, Features=4, Classes=3
Mode: single train/test split (test_size=0.25), selectivity=balanced, gf_verbose=1
[GoldenFeatures] fit: validating inputs
[GoldenFeatures] fit: task=multiclass, rows=112, total_features=4, numeric_features=4, categorical_features=0, selectivity=balanced, max_selected_features=None, enabled_strategies=['categorical_frequency', 'categorical_group_deviation', 'categorical_hash_cross', 'categorical_oof_target', 'categorical_prototypes', 'context_knn', 'grouped_row_stats', 'path', 'projection_ica', 'projection_pca', 'residual_numeric']
[GoldenFeatures] stage1: training fast LightGBM on raw features (3 repeats)
[GoldenFeatures] stage1: repeat=1/3, seed=42, paths=208
[GoldenFeatures] stage1: repeat=2/3, seed=143, paths=206
[GoldenFeatures] stage1: repeat=3/3, seed=244, paths=208
[GoldenFeatures] stage1: trained with 4 raw features
[GoldenFeatures] stage2: extracted total 622 paths across repeats
[GoldenFeatures] stage3: ranking feature interactions
[GoldenFeatures] stage3: ranked 5 interaction pairs
[GoldenFeatures] stage4: building candidate engineered features (enabled=['categorical_frequency', 'categorical_group_deviation', 'categorical_hash_cross', 'categorical_oof_target', 'categorical_prototypes', 'context_knn', 'grouped_row_stats', 'path', 'projection_ica', 'projection_pca', 'residual_numeric'])
[GoldenFeatures] stage4: generated 45 path candidates + 3 projection candidates + 12 grouped-stats candidates = 60 total
[GoldenFeatures] stage5: quick filtering candidates
[GoldenFeatures] stage5: kept 38 candidates, rejected 22
[GoldenFeatures] stage6: survival competition with repeated LightGBM
[GoldenFeatures] stage6: 9 candidates survived
[GoldenFeatures] stage7: redundancy pruning
[GoldenFeatures] stage7: final survivors after pruning = 9
[GoldenFeatures] fit: completed (candidates=60, after_filter=38, survivors=9, final=9)
[GoldenFeatures] transform: generating 9 golden features
[GoldenFeatures] transform: generating 9 golden features
[Split] created=9 features: ['petal_length_cm_mul_petal_width_cm', 'rule_petal_length_cm_gt_2p450_and_petal_width_cm_le_1p550_002', 'sepal_length_cm_div_petal_width_cm', 'pca_comp_001', 'petal_length_cm_absdiff_sepal_width_cm', 'petal_width_cm_mul_sepal_length_cm', 'petal_width_cm_div_sepal_width_cm', 'grpstat_003_std', 'petal_length_cm_absdiff_sepal_length_cm']
Baseline LogLoss (single split): 0.801166
Golden   LogLoss (single split): 0.279341
Delta LogLoss (golden - baseline): -0.521825
LogLoss improvement vs baseline: +65.13%

Other Examples

These example scripts are included in this repository:

uv run python examples/breast_cancer_binary.py
uv run python examples/credit_scoring.py
uv run python examples/house_prices_rmse.py

How It Works

features_goldmine starts with your raw tabular data: a pandas DataFrame X and target y.

First, the package generates many candidate features using several strategies. Some strategies look for interactions discovered by LightGBM tree paths. Others create numeric transformations, projection features, row-group statistics, categorical encodings, categorical-numeric deviation features, and context-style features.

Next, features_goldmine runs a fast initial filtering step. This removes candidates that are obviously not useful: constant columns, near-constant columns, invalid values, too many missing values, duplicates, and features that are too similar to their parent columns.

Finally, it trains several small LightGBM models and lets the candidates compete against the raw features. Candidate features are selected based on repeated feature importance: features that consistently receive useful gain and rank highly across runs are kept. Redundant survivors are pruned, and the final output is a clean DataFrame containing only the selected engineered features.

In short:

raw X, y
  -> generate many candidate features
  -> remove obviously bad candidates
  -> train several small LightGBM models
  -> keep candidates with strong, stable importance
  -> return final golden features

features engineering pipeline

Performance

We tested features_goldmine by comparing two models:

  • LGBM_Baseline: a simple LightGBM model trained on raw data.
  • LGBM_GoldenFeatures: the same LightGBM parameters, trained on raw data plus golden features.

The comparison was run on TabArena Lite, across 51 datasets. Lower metric_error is better.

LGBM_GoldenFeatures vs LGBM_Baseline
Better: 27 datasets
Worse : 24 datasets
Win rate: 52.9%

The practical takeaway: golden features help often, but not always. Feature engineering is data-dependent, so features_goldmine is designed to make it fast and easy to check whether engineered features improve your pipeline.

Simple API

from features_goldmine import GoldenFeatures

gf = GoldenFeatures()
X_gold = gf.fit_transform(X, y)

Constructor arguments:

gf = GoldenFeatures(
    random_state=42,
    verbose=0,
    selectivity="balanced",
    max_selected_features=None,
    include_strategies=None,
    exclude_strategies=None,
)
  • random_state
    • Controls randomness for repeatable results.
    • Use the same value if you want the same generated features across runs.
  • verbose
    • Set to 1 to print detailed logs for every stage.
    • Keep 0 for quiet mode.
  • selectivity
    • Controls how strict the feature survival test is.
    • Options: relaxed, balanced, strict.
    • balanced is the default and a good starting point.
  • max_selected_features
    • Limits the final number of selected golden features.
    • Example: max_selected_features=3 keeps only the top 3.
  • include_strategies
    • Optional list of strategies to use.
    • If None, all built-in strategies are enabled.
  • exclude_strategies
    • Optional list of strategies to disable.
    • Useful when you want to turn off a specific family, for example categorical_frequency.

That is the core API. Three methods:

  • fit(X, y)
    • What it does: learns which engineered features are useful from your training data.
    • Use it when: you want to fit once, then transform multiple datasets later.
    • Returns: the same GoldenFeatures object (self).
  • transform(X)
    • What it does: creates the selected engineered features for new data.
    • Use it when: you already called fit (or loaded a fitted model) and now want features for validation/test/production data.
    • Returns: a DataFrame with only engineered features.
  • fit_transform(X, y)
    • What it does: fit + transform in one call.
    • Use it when: you just want engineered features for your training split quickly.
    • Returns: a DataFrame with only engineered features.

Beginner rule of thumb:

  • training split: fit_transform
  • validation/test split: transform

Example:

gf = GoldenFeatures()

# train split
X_train_gold = gf.fit_transform(X_train, y_train)

# validation/test split (same fitted feature logic)
X_valid_gold = gf.transform(X_valid)
X_test_gold = gf.transform(X_test)

Other useful methods:

  • save(path)
    • What it does: saves a fitted GoldenFeatures object to disk.
    • Use it when: you want to reuse the exact same feature logic later.
  • GoldenFeatures.load(path)
    • What it does: loads a previously saved fitted object.
    • Use it when: you want consistent features in another script or production job.

Useful attributes:

  • selected_feature_names_
  • golden_features_
  • report_

Strategy Control and Available Strategies

Defaults work well for first use. If needed:

gf = GoldenFeatures(
    include_strategies=["path", "context_knn", "categorical_group_deviation"],
    exclude_strategies=["categorical_frequency"],
)

Current strategy keys:

  • path
    • finds feature interactions that tree models actually used, then creates formulas like multiply/divide/subtract and simple split-based rules.
    • Usually helps when: non-linear numeric interactions matter.
  • projection_pca
    • creates compact summary features (principal components) from many numeric columns.
    • Usually helps when: numeric features are correlated and you want cleaner combined signals.
  • projection_ica
    • creates independent numeric components that can reveal hidden patterns different from PCA.
    • Usually helps when: signals are mixed and not well captured by simple linear combinations.
  • grouped_row_stats
    • computes row-level stats (mean/std/min/max) over related feature groups.
    • Usually helps when: relative scale inside a group of columns matters.
  • context_knn
    • compares each row to nearby rows in numeric space (local context), generating deviation-style features.
    • Usually helps when: local neighborhood behavior is informative.
  • residual_numeric
    • for regression, creates numeric interactions that correlate with baseline model residual errors.
    • Usually helps when: baseline model leaves structured numeric errors.
  • categorical_frequency
    • encodes how common each category value is in the data.
    • Usually helps when: rare vs common category values carry signal.
  • categorical_oof_target
    • leak-safe target encoding using out-of-fold averages per category.
    • Usually helps when: category values strongly relate to target.
  • categorical_group_deviation
    • compares a numeric value to what is typical for its category (for example value - category_mean).
    • Usually helps when: "higher/lower than typical for this category" matters.
  • categorical_prototypes
    • measures distance between a row and category-specific numeric prototypes.
    • Usually helps when: each category has a characteristic numeric profile.
  • categorical_hash_cross
    • builds compact crossed-category signals via hashed category pairs.
    • Usually helps when: interactions between two categorical columns matter but full one-hot crosses would explode.

Install

# from PyPI (recommended)
pip install features_goldmine
uv add features_goldmine

# local editable install (development)
uv pip install -e .
pip install -e .

Useful Tricks

You can experiment with different selection settings. In practice, it is often worth trying a few variants because the best setting depends on your data.

If you want to increase the number of proposed golden features, try relaxed:

gf = GoldenFeatures(selectivity="relaxed", verbose=1)

If you want a smaller, more conservative set of features, try strict:

gf = GoldenFeatures(selectivity="strict", verbose=1)

If you want only the top few golden features, use max_selected_features. For example, keep only the top 3:

gf = GoldenFeatures(max_selected_features=3, verbose=1)

License

Apache 2.0

features-goldmine footer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

features_goldmine-1.0.0.tar.gz (39.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

features_goldmine-1.0.0-py3-none-any.whl (45.6 kB view details)

Uploaded Python 3

File details

Details for the file features_goldmine-1.0.0.tar.gz.

File metadata

  • Download URL: features_goldmine-1.0.0.tar.gz
  • Upload date:
  • Size: 39.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.10

File hashes

Hashes for features_goldmine-1.0.0.tar.gz
Algorithm Hash digest
SHA256 35303fc9267776b56cddccf444163b6edf5dfffca033defc228f5bd14a03b39f
MD5 fedc9b0625705cf9071aa1f4be265dba
BLAKE2b-256 d03fbc64dabf1f491e17ac733b6e8fcb74b0bef36b6b9bef5436ff0bc47207e2

See more details on using hashes here.

File details

Details for the file features_goldmine-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for features_goldmine-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8ad1f7992a0b8d5c8c1cb7762ba80a7629d3357e48aeca2e71531a27a7d2c57d
MD5 13ad9cc61d444ddeba58e1737c73b510
BLAKE2b-256 3e164b451342476fd947c390775550bd132b1e7dccb01b90732b6c4292f4c52f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page