Skip to main content

AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Project description

๐Ÿงช AutoFE-PG

Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Python 3.8+ License: MIT CI

AutoFE-PG is a production-ready library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models โ€” with zero target leakage.


โœจ Key Features

Feature Description
Auto column detection Automatically identifies categorical vs. numerical columns
20+ feature strategies Target encoding, count encoding, digit extraction, arithmetic interactions, group statistics, and more
Zero target leakage All target-dependent features use strict out-of-fold encoding
Greedy forward selection Adds features one-by-one, keeping only those that improve CV score
Optional backward pruning Removes redundant features after forward selection
GPU acceleration Automatically uses XGBoost GPU if available
Time budget Set a wall-clock limit; the search stops gracefully
Sampling support Evaluate on a subsample for faster iteration
Custom XGBoost params Pass your own hyperparameters
Score variance tracking Reports mean ยฑ std across folds
Classification & regression Supports both tasks with auto-detection

๐Ÿš€ Quick Start

Installation

pip install autofepg .

Or install dependencies directly:

pip install -r requirements.txt

Minimal Example

import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600,
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Baseline AUC: {result['base_score']:.6f}")
print(f"Best AUC:     {result['best_score']:.6f}")
print(f"Features added: {len(result['selected_features'])}")

Using the Class API

from autofepg import AutoFE

autofe = AutoFE(
    task="classification",
    n_folds=5,
    time_budget=1800,
    improvement_threshold=0.0001,
    backward_selection=True,
    sample=10000,
    xgb_params={
        "n_estimators": 1000,
        "max_depth": 8,
        "learning_rate": 0.05,
    },
)

X_train_new, X_test_new = autofe.fit_select(
    X_train, y_train, X_test,
    aux_target_cols=["employment_status", "debt_to_income_ratio"],
)

# Inspect results
print(autofe.get_selected_feature_names())
history_df = autofe.get_history()

๐Ÿ“– How It Works

1. Feature Generation

AutoFE-PG generates candidates from a hardcoded priority sequence ordered by expected impact:

Priority Strategy Leakage-free?
1 Target Encoding (single columns) โœ… OOF
2 Count Encoding (single columns) โœ… No target
3 Target Encoding on pairs โœ… OOF
4 Count Encoding on pairs โœ… No target
5 Frequency Encoding โœ… No target
6 Missing Indicators โœ… No target
7 TE with auxiliary targets โœ… OOF
8 Unary transforms (log, sqrt, etc.) โœ… No target
9 Arithmetic interactions โœ… No target
10 Polynomial features โœ… No target
11 Pairwise label-encoded interactions โœ… No target
12 TE/CE on digit features โœ… OOF / No target
13 Digit ร— Category TE โœ… OOF
14 Quantile binning โœ… No target
15 Raw digit extraction โœ… No target
16 Digit interactions โœ… No target
17 Rounding features โœ… No target
18 Num-to-Cat conversion โœ… No target
19 Group statistics & deviations โœ… No target

2. Greedy Forward Selection

Each candidate is evaluated by adding it to the current feature set and running XGBoost K-fold CV. A feature is kept only if it improves the score beyond the configured threshold.

3. Optional Backward Pruning

After forward selection, features are tested for removal. If removing a feature improves (or maintains) the score, it is permanently dropped.


โš™๏ธ Configuration

Parameter Type Default Description
task str "auto" "classification", "regression", or "auto"
n_folds int 5 Number of CV folds
time_budget float None Max seconds (wall clock)
improvement_threshold float 1e-7 Min score delta to keep a feature
sample int None Subsample rows for faster CV
backward_selection bool False Run backward pruning after forward
max_pair_cols int 20 Max columns for pairwise features
max_digit_positions int 4 Max digit positions to extract
xgb_params dict None Custom XGBoost hyperparameters
metric_fn callable None Custom metric (y_true, y_pred) -> float
metric_direction str None "maximize" or "minimize"
random_state int 42 Random seed
verbose bool True Print progress

๐Ÿ“Š Output

The select_features() function returns a dictionary:

{
    "X_train": pd.DataFrame,          # Augmented training data
    "X_test": pd.DataFrame,           # Augmented test data (if provided)
    "autofe": AutoFE,                 # Fitted AutoFE object
    "history": pd.DataFrame,          # Full selection history
    "selected_features": List[str],   # Names of kept features
    "base_score": float,              # Baseline CV mean
    "base_score_std": float,          # Baseline CV std
    "best_score": float,              # Final CV mean
    "best_score_std": float,          # Final CV std
}

๐Ÿงช Running Tests

pytest tests/ -v

๐Ÿ“ Project Structure

autofepg/
โ”œโ”€โ”€ autofepg/
โ”‚   โ”œโ”€โ”€ __init__.py          # Public API
โ”‚   โ”œโ”€โ”€ utils.py             # GPU detection, task inference, metrics
โ”‚   โ”œโ”€โ”€ generators.py        # All feature generator classes
โ”‚   โ”œโ”€โ”€ builder.py           # FeatureCandidateBuilder
โ”‚   โ”œโ”€โ”€ engine.py            # XGBoost CV engine
โ”‚   โ””โ”€โ”€ core.py              # AutoFE class + select_features()
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ test_autofepg.py     # Unit and integration tests
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ example_classification.py
โ”‚   โ””โ”€โ”€ example_regression.py
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ””โ”€โ”€ ci.yml
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ CHANGELOG.md
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ Makefile
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ setup.py
โ””โ”€โ”€ requirements.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autofepg-0.1.3.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autofepg-0.1.3-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file autofepg-0.1.3.tar.gz.

File metadata

  • Download URL: autofepg-0.1.3.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autofepg-0.1.3.tar.gz
Algorithm Hash digest
SHA256 9e955aee95ad494376a021b3eaed632931bc4579a331d86d6f1ab38b810cbb06
MD5 05cf2905d2538faaa2c5280511568bad
BLAKE2b-256 e3cd1af6e71c1033ad5ef353bda00075719195b35a2d9936b099a85680012860

See more details on using hashes here.

Provenance

The following attestation bundles were made for autofepg-0.1.3.tar.gz:

Publisher: publish.yml on thomastschinkel/autofepg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autofepg-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: autofepg-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autofepg-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 eff7d96e3c954ac792e8d4a9774c6a6ed5e15b15382c1dd65a3e7d75c5f01738
MD5 301d19f55722f7c6d9c7a094a347d5ec
BLAKE2b-256 178fe48d06acef9430f943b45dfd386bbd453338d6f7fb746b6595896f2673e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for autofepg-0.1.3-py3-none-any.whl:

Publisher: publish.yml on thomastschinkel/autofepg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page