Skip to main content

AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Project description

๐Ÿงช AutoFE-PG

Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Python 3.8+ License: MIT CI

AutoFE-PG is a production-ready library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models โ€” with zero target leakage.

Designed specifically for Kaggle Playground competitions where synthetic data is common, it includes specialized strategies for domain alignment, Bayesian priors from external data, dual-representation features, and cross-dataset density analysis.


โœจ Key Features

Feature Description
Auto column detection Automatically identifies categorical vs. numerical columns
25+ feature strategies Target encoding, domain alignment, Bayesian priors, dual representation, cross-dataset frequency, count encoding, digit extraction, arithmetic interactions, group statistics, and more
Zero target leakage All target-dependent features use strict out-of-fold encoding
Greedy forward selection Adds features one-by-one, keeping only those that improve CV score
Optional backward pruning Removes redundant features after forward selection
Original data integration Snap synthetic values to real clinical grids and inject historical priors
GPU acceleration Automatically uses XGBoost GPU if available
Time budget Set a wall-clock limit; the search stops gracefully
Sampling support Evaluate on a subsample for faster iteration
Custom XGBoost params Pass your own hyperparameters
Score variance tracking Reports mean ยฑ std across folds
Classification & regression Supports both tasks with auto-detection
Detailed reports Auto-generated .txt report with full selection history

๐Ÿš€ Quick Start

Installation

pip install autofepg

Or install dependencies directly:

pip install -r requirements.txt

Minimal Example

import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600,
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Baseline AUC: {result['base_score']:.6f}")
print(f"Best AUC:     {result['best_score']:.6f}")
print(f"Features added: {len(result['selected_features'])}")

With Original Data (Domain Alignment + Bayesian Priors)

When working with Kaggle Playground competitions where synthetic data is generated from a real dataset, you can pass the original data to unlock powerful de-noising and prior-injection strategies:

import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
original = pd.read_csv("original.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

X_original = original.drop(columns=["target"])
y_original = original["target"]

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600,
    original_df=X_original,
    original_target=y_original,
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Baseline AUC: {result['base_score']:.6f}")
print(f"Best AUC:     {result['best_score']:.6f}")
print(f"Features added: {len(result['selected_features'])}")

Using the Class API

from autofepg import AutoFE
import pandas as pd

original = pd.read_csv("original.csv")

autofe = AutoFE(
    task="classification",
    n_folds=5,
    time_budget=1800,
    improvement_threshold=0.0001,
    backward_selection=True,
    sample=10000,
    original_df=original.drop(columns=["target"]),
    original_target=original["target"],
    xgb_params={
        "n_estimators": 1000,
        "max_depth": 8,
        "learning_rate": 0.05,
    },
)

X_train_new, X_test_new = autofe.fit_select(
    X_train, y_train, X_test,
    aux_target_cols=["employment_status", "debt_to_income_ratio"],
)

# Inspect results
print(autofe.get_selected_feature_names())
history_df = autofe.get_history()
details_df = autofe.get_selection_details()

๐Ÿ“– How It Works

1. Feature Generation

AutoFE-PG generates candidates from a hardcoded priority sequence ordered by expected impact:

Priority Strategy Description Leakage-free?
1 Domain Alignment Snap synthetic values to nearest real-data grid point; expose residual โœ… No target
2 Bayesian Priors Inject P(target | value) from original dataset as external knowledge โœ… No train target
3 Target Encoding (single) OOF mean-target per category โœ… OOF
4 Count Encoding (single) Value counts per category โœ… No target
5 Dual Representation Continuous + label-encoded copy of each numerical column โœ… No target
6 Target Encoding on pairs OOF TE on column pair interactions โœ… OOF
7 Count Encoding on pairs Value counts on column pair interactions โœ… No target
8 Frequency Encoding Normalized value counts โœ… No target
9 Cross-Dataset Frequency & Rarity How common/rare a value is across train+test+original โœ… No target
10 Missing Indicators Binary NaN flags โœ… No target
11 TE with auxiliary targets OOF TE using a different column as target โœ… OOF
12 Unary transforms log1p, sqrt, square, reciprocal โœ… No target
13 Arithmetic interactions add, sub, mul, div between numerical pairs โœ… No target
14 Polynomial features Square and cross-product terms โœ… No target
15 Pairwise label interactions Label-encoded column pairs โœ… No target
16 TE/CE on digits Target/count encoding on extracted digits โœ… OOF / No target
17 Digit ร— Category TE Digit-category interaction with OOF TE โœ… OOF
18 Quantile binning Equal-frequency bins โœ… No target
19 Raw digit extraction i-th digit of numerical values โœ… No target
20 Digit interactions Within-feature and cross-feature digit combos โœ… No target
21 Rounding features Round to various decimal places / magnitudes โœ… No target
22 Num-to-Cat conversion Equal-width binning โœ… No target
23 Group statistics & deviations Mean, std, min, max, median by group; diff/ratio to group โœ… No target

2. Greedy Forward Selection

Each candidate is evaluated by adding it to the current feature set and running XGBoost K-fold CV. A feature is kept only if it improves the score beyond the configured threshold.

3. Optional Backward Pruning

After forward selection, features are tested for removal. If removing a feature improves (or maintains) the score, it is permanently dropped.


๐Ÿงฌ Synthetic Data Strategies

AutoFE-PG includes four strategies specifically designed for Kaggle Playground competitions where the training data is synthetically generated from a real-world dataset.

Domain Alignment (De-noising)

The synthetic generation process often introduces "fuzzy" values that wouldn't exist in a real clinical setting. Domain Alignment forces every continuous value in the synthetic set to its nearest neighbor in the original dataset, effectively "snapping" the data back to its true clinical grid. The residual (distance to the snap point) is also exposed as a feature, since it encodes how much the synthetic process perturbed the value.

from autofepg.generators import DomainAlignmentFeature
import numpy as np

# Reference values from original dataset
ref_vals = original["blood_pressure"].dropna().unique()
gen = DomainAlignmentFeature("blood_pressure", reference_values=ref_vals)

Bayesian-Style Priors (External Mapping)

Instead of letting the model learn strictly from the training data, Bayesian Priors import external knowledge from the original dataset. By calculating P(target | value) in the original file and injecting those probabilities as features, the model starts with a "hint" about which values are clinically dangerous. This uses no information from the training target โ€” zero leakage.

from autofepg.generators import BayesianPriorFeature

# Pre-computed from original data
prior_map = original.groupby("cholesterol")["heart_disease"].mean().to_dict()
gen = BayesianPriorFeature("cholesterol", prior_map=prior_map)

Dimensionality Expansion (Dual Representation)

The model uses a "dual-representation" strategy for numerical features:

  • Continuous copy: Treated as a number to capture linear or threshold trends
  • Categorical copy: Treated as a discrete label-encoded value to allow the tree to create very specific, non-linear splits on exact values
from autofepg.generators import DualRepresentationFeature

gen = DualRepresentationFeature("age")
# Produces: dual__age_cont (float) + dual__age_cat (int label)

Frequency and Density Analysis

Cross-dataset frequency analysis calculates the rarity of values across the entire data ecosystem (train, test, and original). This helps the model identify if a specific data point is an outlier or part of a common cluster โ€” a strong signal in synthetic datasets where certain "modes" are over-represented.

from autofepg.generators import CrossDatasetFrequencyFeature, ValueRarityFeature
import pandas as pd

# Combine counts across all datasets
combined = pd.concat([train["age"], test["age"], original["age"]])
eco_counts = combined.value_counts()
eco_total = len(combined)

freq_gen = CrossDatasetFrequencyFeature("age", eco_counts, eco_total)
rare_gen = ValueRarityFeature("age", eco_counts, eco_total)

Note: When you pass original_df, original_target, and X_test to AutoFE or select_features, all four strategies are automatically generated and evaluated. No manual setup required.


โš™๏ธ Configuration

Parameter Type Default Description
task str "auto" "classification", "regression", or "auto"
n_folds int 5 Number of CV folds
time_budget float None Max seconds (wall clock)
improvement_threshold float 1e-7 Min score delta to keep a feature
sample int None Subsample rows for faster CV
backward_selection bool False Run backward pruning after forward
max_pair_cols int 20 Max columns for pairwise features
max_digit_positions int 4 Max digit positions to extract
xgb_params dict None Custom XGBoost hyperparameters
metric_fn callable None Custom metric (y_true, y_pred) -> float
metric_direction str None "maximize" or "minimize"
random_state int 42 Random seed
verbose bool True Print progress
original_df DataFrame None Original (real) dataset features for domain alignment & priors
original_target Series None Original dataset target for Bayesian prior computation
report_path str "autofepg_report.txt" Path for detailed selection report

๐Ÿ“Š Output

The select_features() function returns a dictionary:

{
    "X_train": pd.DataFrame,          # Augmented training data
    "X_test": pd.DataFrame,           # Augmented test data (if provided)
    "autofe": AutoFE,                 # Fitted AutoFE object
    "history": pd.DataFrame,          # Full selection history
    "selected_features": List[str],   # Names of kept features
    "selection_details": pd.DataFrame, # Per-feature improvement details
    "base_score": float,              # Baseline CV mean
    "base_score_std": float,          # Baseline CV std
    "best_score": float,              # Final CV mean
    "best_score_std": float,          # Final CV std
}

๐Ÿงช Running Tests

pytest tests/ -v

๐Ÿ“ Project Structure

autofepg/
โ”œโ”€โ”€ autofepg/
โ”‚   โ”œโ”€โ”€ __init__.py          # Public API & exports
โ”‚   โ”œโ”€โ”€ utils.py             # GPU detection, task inference, metrics
โ”‚   โ”œโ”€โ”€ generators.py        # All feature generator classes (25+)
โ”‚   โ”œโ”€โ”€ builder.py           # FeatureCandidateBuilder
โ”‚   โ”œโ”€โ”€ engine.py            # XGBoost CV engine
โ”‚   โ””โ”€โ”€ core.py              # AutoFE class + select_features()
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ test_autofepg.py     # Unit and integration tests
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ example_classification.py
โ”‚   โ”œโ”€โ”€ example_regression.py
โ”‚   โ””โ”€โ”€ example_with_original.py
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ””โ”€โ”€ ci.yml
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ CHANGELOG.md
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ Makefile
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ setup.py
โ””โ”€โ”€ requirements.txt

๐Ÿ“‹ Generator Reference

Original Strategies

Generator Class Target used?
Target Encoding TargetEncoding โœ… OOF
Count Encoding CountEncoding โŒ
Frequency Encoding FrequencyEncoding โŒ
Pair Interaction PairInteraction โŒ
TE on Pairs TargetEncodingOnPair โœ… OOF
CE on Pairs CountEncodingOnPair โŒ
Digit Extraction DigitFeature โŒ
Digit Interaction DigitInteraction โŒ
TE on Digits TargetEncodingOnDigit โœ… OOF
CE on Digits CountEncodingOnDigit โŒ
Digit ร— Cat TE DigitBasePairTE โœ… OOF
Rounding RoundFeature โŒ
Quantile Binning QuantileBinFeature โŒ
Num-to-Cat NumToCat โŒ
TE with Aux Target TargetEncodingAuxTarget โœ… OOF (aux)
Arithmetic Interaction ArithmeticInteraction โŒ
Missing Indicator MissingIndicator โŒ
Group Statistics GroupStatFeature โŒ
Group Deviation GroupDeviationFeature โŒ
Unary Transform UnaryTransform โŒ
Polynomial Feature PolynomialFeature โŒ

Synthetic Data Strategies (NEW in v0.2.0)

Generator Class Requires Target used?
Domain Alignment DomainAlignmentFeature original_df โŒ
Bayesian Prior BayesianPriorFeature original_df + original_target โŒ (external only)
Dual Representation DualRepresentationFeature โ€” โŒ
Cross-Dataset Frequency CrossDatasetFrequencyFeature original_df or X_test โŒ
Value Rarity ValueRarityFeature original_df or X_test โŒ

๐Ÿ“ Changelog

v0.2.0

  • Domain Alignment: Snap synthetic values to nearest real-data grid point with residual feature
  • Bayesian Priors: Inject external P(target|value) from original dataset
  • Dual Representation: Continuous + categorical copy of numerical features
  • Cross-Dataset Frequency: Value frequency across train+test+original ecosystem
  • Value Rarity: Log-inverse-frequency score for outlier detection
  • Added original_df and original_target parameters to AutoFE and select_features
  • Report now includes original data status
  • Version bump to 0.2.0

v0.1.3

  • Initial public release
  • 20+ feature generation strategies
  • Greedy forward selection with optional backward pruning
  • GPU acceleration support
  • Detailed text report generation

๐Ÿ“„ License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autofepg-0.2.0.tar.gz (36.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autofepg-0.2.0-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file autofepg-0.2.0.tar.gz.

File metadata

  • Download URL: autofepg-0.2.0.tar.gz
  • Upload date:
  • Size: 36.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autofepg-0.2.0.tar.gz
Algorithm Hash digest
SHA256 16bff906112a55f7019ae463ede92c54502e575e0f7f6f2d62b1b216a66ed861
MD5 c51b69118c69bc6ff90ea399c449d297
BLAKE2b-256 1e458056f0e3751e50922257a8b49ee39d652a45b1f2cc7f1fec6b9d7dc3ab0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for autofepg-0.2.0.tar.gz:

Publisher: publish.yml on thomastschinkel/autofepg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autofepg-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: autofepg-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autofepg-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b1b1e2362a2a2cc3f11f6c23730d9de2d1070947d8390fbc2e88124e71a6449
MD5 f26a448e83ab35e38c799878c6468a8f
BLAKE2b-256 1681e2252ea630456f0bb2e91bedb31207590f418abb85faa0ec7183e18b73a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for autofepg-0.2.0-py3-none-any.whl:

Publisher: publish.yml on thomastschinkel/autofepg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page