Skip to main content

AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Project description

๐Ÿงช AutoFE-PG

Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Python 3.8+ License: MIT CI

AutoFE-PG is a production-ready library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models โ€” with zero target leakage.


โœจ Key Features

Feature Description
Auto column detection Automatically identifies categorical vs. numerical columns
20+ feature strategies Target encoding, count encoding, digit extraction, arithmetic interactions, group statistics, and more
Zero target leakage All target-dependent features use strict out-of-fold encoding
Greedy forward selection Adds features one-by-one, keeping only those that improve CV score
Optional backward pruning Removes redundant features after forward selection
GPU acceleration Automatically uses XGBoost GPU if available
Time budget Set a wall-clock limit; the search stops gracefully
Sampling support Evaluate on a subsample for faster iteration
Custom XGBoost params Pass your own hyperparameters
Score variance tracking Reports mean ยฑ std across folds
Classification & regression Supports both tasks with auto-detection

๐Ÿš€ Quick Start

Installation

pip install -e .

Or install dependencies directly:

pip install -r requirements.txt

Minimal Example

import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600,
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Baseline AUC: {result['base_score']:.6f}")
print(f"Best AUC:     {result['best_score']:.6f}")
print(f"Features added: {len(result['selected_features'])}")

Using the Class API

from autofepg import AutoFE

autofe = AutoFE(
    task="classification",
    n_folds=5,
    time_budget=1800,
    improvement_threshold=0.0001,
    backward_selection=True,
    sample=10000,
    xgb_params={
        "n_estimators": 1000,
        "max_depth": 8,
        "learning_rate": 0.05,
    },
)

X_train_new, X_test_new = autofe.fit_select(
    X_train, y_train, X_test,
    aux_target_cols=["employment_status", "debt_to_income_ratio"],
)

# Inspect results
print(autofe.get_selected_feature_names())
history_df = autofe.get_history()

๐Ÿ“– How It Works

1. Feature Generation

AutoFE-PG generates candidates from a hardcoded priority sequence ordered by expected impact:

Priority Strategy Leakage-free?
1 Target Encoding (single columns) โœ… OOF
2 Count Encoding (single columns) โœ… No target
3 Target Encoding on pairs โœ… OOF
4 Count Encoding on pairs โœ… No target
5 Frequency Encoding โœ… No target
6 Missing Indicators โœ… No target
7 TE with auxiliary targets โœ… OOF
8 Unary transforms (log, sqrt, etc.) โœ… No target
9 Arithmetic interactions โœ… No target
10 Polynomial features โœ… No target
11 Pairwise label-encoded interactions โœ… No target
12 TE/CE on digit features โœ… OOF / No target
13 Digit ร— Category TE โœ… OOF
14 Quantile binning โœ… No target
15 Raw digit extraction โœ… No target
16 Digit interactions โœ… No target
17 Rounding features โœ… No target
18 Num-to-Cat conversion โœ… No target
19 Group statistics & deviations โœ… No target

2. Greedy Forward Selection

Each candidate is evaluated by adding it to the current feature set and running XGBoost K-fold CV. A feature is kept only if it improves the score beyond the configured threshold.

3. Optional Backward Pruning

After forward selection, features are tested for removal. If removing a feature improves (or maintains) the score, it is permanently dropped.


โš™๏ธ Configuration

Parameter Type Default Description
task str "auto" "classification", "regression", or "auto"
n_folds int 5 Number of CV folds
time_budget float None Max seconds (wall clock)
improvement_threshold float 1e-7 Min score delta to keep a feature
sample int None Subsample rows for faster CV
backward_selection bool False Run backward pruning after forward
max_pair_cols int 20 Max columns for pairwise features
max_digit_positions int 4 Max digit positions to extract
xgb_params dict None Custom XGBoost hyperparameters
metric_fn callable None Custom metric (y_true, y_pred) -> float
metric_direction str None "maximize" or "minimize"
random_state int 42 Random seed
verbose bool True Print progress

๐Ÿ“Š Output

The select_features() function returns a dictionary:

{
    "X_train": pd.DataFrame,          # Augmented training data
    "X_test": pd.DataFrame,           # Augmented test data (if provided)
    "autofe": AutoFE,                 # Fitted AutoFE object
    "history": pd.DataFrame,          # Full selection history
    "selected_features": List[str],   # Names of kept features
    "base_score": float,              # Baseline CV mean
    "base_score_std": float,          # Baseline CV std
    "best_score": float,              # Final CV mean
    "best_score_std": float,          # Final CV std
}

๐Ÿงช Running Tests

pytest tests/ -v

๐Ÿ“ Project Structure

autofepg/
โ”œโ”€โ”€ autofepg/
โ”‚   โ”œโ”€โ”€ __init__.py          # Public API
โ”‚   โ”œโ”€โ”€ utils.py             # GPU detection, task inference, metrics
โ”‚   โ”œโ”€โ”€ generators.py        # All feature generator classes
โ”‚   โ”œโ”€โ”€ builder.py           # FeatureCandidateBuilder
โ”‚   โ”œโ”€โ”€ engine.py            # XGBoost CV engine
โ”‚   โ””โ”€โ”€ core.py              # AutoFE class + select_features()
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ test_autofepg.py     # Unit and integration tests
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ example_classification.py
โ”‚   โ””โ”€โ”€ example_regression.py
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ””โ”€โ”€ ci.yml
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ CHANGELOG.md
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ Makefile
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ setup.py
โ””โ”€โ”€ requirements.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autofepg-0.1.1.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autofepg-0.1.1-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file autofepg-0.1.1.tar.gz.

File metadata

  • Download URL: autofepg-0.1.1.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for autofepg-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ba6368705b7cc6dd48e8f339578cc90355c795355e0600d9e31cc34df34d0182
MD5 e90d745ea04cf5ac4229a90d5b7d9baa
BLAKE2b-256 829807bc529d4f69407a89e56a708a93fa7f2aec4adec5b794e59c440833369d

See more details on using hashes here.

File details

Details for the file autofepg-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: autofepg-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for autofepg-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 27600b26666e8e7a3ee7953d66cabfddb46e60309eedb55c838a9f0d882ea298
MD5 7bd7353a5ac9c06f9f1ea548359b58ae
BLAKE2b-256 241fbb81ace158edca03d4f752aad03277a94c2398fc815788157ae0123ad735

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page