Skip to main content

AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Project description

๐Ÿงช AutoFE-PG

Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Python 3.8+ License: MIT CI

AutoFE-PG is a production-ready library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models โ€” with zero target leakage.


โœจ Key Features

Feature Description
Auto column detection Automatically identifies categorical vs. numerical columns
20+ feature strategies Target encoding, count encoding, digit extraction, arithmetic interactions, group statistics, and more
Zero target leakage All target-dependent features use strict out-of-fold encoding
Greedy forward selection Adds features one-by-one, keeping only those that improve CV score
Optional backward pruning Removes redundant features after forward selection
GPU acceleration Automatically uses XGBoost GPU if available
Time budget Set a wall-clock limit; the search stops gracefully
Sampling support Evaluate on a subsample for faster iteration
Custom XGBoost params Pass your own hyperparameters
Score variance tracking Reports mean ยฑ std across folds
Classification & regression Supports both tasks with auto-detection

๐Ÿš€ Quick Start

Installation

pip install autofepg .

Or install dependencies directly:

pip install -r requirements.txt

Minimal Example

import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600,
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Baseline AUC: {result['base_score']:.6f}")
print(f"Best AUC:     {result['best_score']:.6f}")
print(f"Features added: {len(result['selected_features'])}")

Using the Class API

from autofepg import AutoFE

autofe = AutoFE(
    task="classification",
    n_folds=5,
    time_budget=1800,
    improvement_threshold=0.0001,
    backward_selection=True,
    sample=10000,
    xgb_params={
        "n_estimators": 1000,
        "max_depth": 8,
        "learning_rate": 0.05,
    },
)

X_train_new, X_test_new = autofe.fit_select(
    X_train, y_train, X_test,
    aux_target_cols=["employment_status", "debt_to_income_ratio"],
)

# Inspect results
print(autofe.get_selected_feature_names())
history_df = autofe.get_history()

๐Ÿ“– How It Works

1. Feature Generation

AutoFE-PG generates candidates from a hardcoded priority sequence ordered by expected impact:

Priority Strategy Leakage-free?
1 Target Encoding (single columns) โœ… OOF
2 Count Encoding (single columns) โœ… No target
3 Target Encoding on pairs โœ… OOF
4 Count Encoding on pairs โœ… No target
5 Frequency Encoding โœ… No target
6 Missing Indicators โœ… No target
7 TE with auxiliary targets โœ… OOF
8 Unary transforms (log, sqrt, etc.) โœ… No target
9 Arithmetic interactions โœ… No target
10 Polynomial features โœ… No target
11 Pairwise label-encoded interactions โœ… No target
12 TE/CE on digit features โœ… OOF / No target
13 Digit ร— Category TE โœ… OOF
14 Quantile binning โœ… No target
15 Raw digit extraction โœ… No target
16 Digit interactions โœ… No target
17 Rounding features โœ… No target
18 Num-to-Cat conversion โœ… No target
19 Group statistics & deviations โœ… No target

2. Greedy Forward Selection

Each candidate is evaluated by adding it to the current feature set and running XGBoost K-fold CV. A feature is kept only if it improves the score beyond the configured threshold.

3. Optional Backward Pruning

After forward selection, features are tested for removal. If removing a feature improves (or maintains) the score, it is permanently dropped.


โš™๏ธ Configuration

Parameter Type Default Description
task str "auto" "classification", "regression", or "auto"
n_folds int 5 Number of CV folds
time_budget float None Max seconds (wall clock)
improvement_threshold float 1e-7 Min score delta to keep a feature
sample int None Subsample rows for faster CV
backward_selection bool False Run backward pruning after forward
max_pair_cols int 20 Max columns for pairwise features
max_digit_positions int 4 Max digit positions to extract
xgb_params dict None Custom XGBoost hyperparameters
metric_fn callable None Custom metric (y_true, y_pred) -> float
metric_direction str None "maximize" or "minimize"
random_state int 42 Random seed
verbose bool True Print progress

๐Ÿ“Š Output

The select_features() function returns a dictionary:

{
    "X_train": pd.DataFrame,          # Augmented training data
    "X_test": pd.DataFrame,           # Augmented test data (if provided)
    "autofe": AutoFE,                 # Fitted AutoFE object
    "history": pd.DataFrame,          # Full selection history
    "selected_features": List[str],   # Names of kept features
    "base_score": float,              # Baseline CV mean
    "base_score_std": float,          # Baseline CV std
    "best_score": float,              # Final CV mean
    "best_score_std": float,          # Final CV std
}

๐Ÿงช Running Tests

pytest tests/ -v

๐Ÿ“ Project Structure

autofepg/
โ”œโ”€โ”€ autofepg/
โ”‚   โ”œโ”€โ”€ __init__.py          # Public API
โ”‚   โ”œโ”€โ”€ utils.py             # GPU detection, task inference, metrics
โ”‚   โ”œโ”€โ”€ generators.py        # All feature generator classes
โ”‚   โ”œโ”€โ”€ builder.py           # FeatureCandidateBuilder
โ”‚   โ”œโ”€โ”€ engine.py            # XGBoost CV engine
โ”‚   โ””โ”€โ”€ core.py              # AutoFE class + select_features()
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ””โ”€โ”€ test_autofepg.py     # Unit and integration tests
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ example_classification.py
โ”‚   โ””โ”€โ”€ example_regression.py
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ””โ”€โ”€ ci.yml
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ CHANGELOG.md
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ Makefile
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ setup.py
โ””โ”€โ”€ requirements.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autofepg-0.1.2.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autofepg-0.1.2-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file autofepg-0.1.2.tar.gz.

File metadata

  • Download URL: autofepg-0.1.2.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for autofepg-0.1.2.tar.gz
Algorithm Hash digest
SHA256 cb7dee4667b80d1edb220aabf31879743b7a7066d38f1a221d79e8d57efc270f
MD5 01d4749af99f8f153556bab50ac82094
BLAKE2b-256 ad93f5d6e4f241d337b92146252bdd805e1224d44ba3e4c219f585107eaf1fbf

See more details on using hashes here.

File details

Details for the file autofepg-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: autofepg-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for autofepg-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 01800dc8378e26cda3040d2323d8fcea2aa8bd0bb9c3f6ef16b36855fc059296
MD5 93e8cee4cda80bfd68b82ebc52434f47
BLAKE2b-256 2e1fccb43852c80e193654888795f7f1b54a8ff13f3c63688c7d9ff6a838fe3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page