Skip to main content

AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Project description

🧪 AutoFE-PG

Automatic Feature Engineering & Selection for Kaggle Playground Competitions

Python 3.8+ License: MIT Version

AutoFE-PG is a powerful library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models — with zero target leakage.

Version 0.3.0 is a complete refactoring focused on general-purpose strategies that work across any tabular competition, featuring advanced binning, digit-based features, Cyclical encoding, Weight of Evidence, and Genetic Programming interactions.


✨ Key Features

Feature Description
Genetic Programming Generates complex non-linear interactions using gplearn
Digit-Based Logic Extracts integer and decimal positions; creates digit-cross-category interactions
Target Representation OOF Target Aggregation (mean, std, skew), WoE, and Entropy features
Cyclical Encoding Sine/Cosine transformations for periodic numerical features
Advanced Binning Both Quantile (qcut) and Equal-width (cut) discretization
External Signal Injection Inject historical Priors, WoE, and Entropy from original datasets
Zero Target Leakage All target-dependent features use strict out-of-fold (OOF) strategies
Greedy Selection Forward selection keeps only features that improve CV score
GPU Acceleration Built-in support for XGBoost GPU engines

🚀 Quick Start

Installation

pip install autofepg
# Optional: for Genetic Programming features
pip install gplearn

Basic Usage

import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600  # 1 hour limit
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Features added: {len(result['selected_features'])}")
print(f"CV Improvement: {result['base_score']:.6f} -> {result['best_score']:.6f}")

Injecting Historical Signals (Original Data)

If you have access to a "real world" dataset (common in Kaggle Playground synthetic competitions), you can inject its signals without leakage:

result = select_features(
    X_train, y_train, X_test,
    original_df=original_df,
    original_target=original_target,
    task="classification"
)

📖 Feature Strategies (v0.3.0)

1. Digits & Discretization

  • Digit Extraction: Integer positions (units, tens, etc.) and decimal positions.
  • Digit Interactions: Column-wise and cross-column interactions between digits.
  • Binning: Discretize continuous variables via Quantile (qcut) or Equal-width (cut) bins.
  • Rounding: Rounding to various decimal places or magnitudes to find structural splits.

2. Specialized Encoding

  • Cyclical Encoding: Sin/Cos transforms for periodic data.
  • Target Encoding (OOF): Out-of-fold mean target per category.
  • Weight of Evidence (WoE): OOF WoE scores for binary classification.
  • Entropy: OOF target entropy per value group.
  • OOF Aggregation: Mean, Std, and Skew of the target grouped by feature values.

3. Non-Linear Interactions

  • Genetic Programming: Evolves mathematical expressions using the base features (requires gplearn).
  • Pair Interactions: Categorical label-encoding of bigrams.
  • Numerical Products: NaN-safe products of bigram numerical features.
  • Digit × Category: Target encoding on the interaction of a column's digit and another category.

4. External Data Signals

  • Bayesian Priors: Historical P(target|value) from the original dataset.
  • External WoE: WoE scores pre-computed from the original dataset.
  • External Entropy: Group purity/impurity derived from the original dataset.

⚙️ Configuration

Parameter Default Description
task "auto" "classification", "regression", or "auto"
n_folds 5 Number of CV folds for evaluation
time_budget None Max wall-clock seconds for the search
improvement_threshold 1e-7 Min score delta to keep a feature
sample None Rows to sample for evaluation (speeds up search)
gp_generations 5 Evolution steps for Genetic Programming
gp_n_components 5 Max GP features to potentially keep
original_df None External dataset for Priors/WoE/Entropy

📝 Changelog

v0.3.0 (Current)

  • Refactoring: Removed competition-specific features (Domain Alignment, Dataset Frequency, Rarity).
  • New Features: Cyclical Features, OOF/External WoE, OOF/External Entropy, Genetic Programming (gplearn).
  • Enhanced Digits: Added Decimal Digit extraction.
  • Enhanced Aggregation: Added Skewness support to OOF Target Aggregation.
  • Simplified API: Decoupled from specific dataset patterns; focused on universal engineering.

v0.2.0

  • Added original dataset support (Domain Alignment, Bayesian Priors).
  • Introduced Cross-Dataset Frequency and Rarity features.

📄 License

MIT License — Copyright (c) 2026 Thomas Tschinkel.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autofepg-0.3.0.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autofepg-0.3.0-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file autofepg-0.3.0.tar.gz.

File metadata

  • Download URL: autofepg-0.3.0.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autofepg-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2cb2f0586e448f0c8c2b0c830cda2de51779d39626ab022ccb176dd1a62cf2eb
MD5 448db0ffcc06b9a637dc492f2ba5f319
BLAKE2b-256 98e71e0132da79baefd7e4e7bde01bbd8d0fa485142caad43d071b95c67416df

See more details on using hashes here.

Provenance

The following attestation bundles were made for autofepg-0.3.0.tar.gz:

Publisher: publish.yml on thomastschinkel/autofepg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autofepg-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: autofepg-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autofepg-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 884a571752c0fd6a6016745137d10da9d3e048d760b8a69d042cd7969e47cf0b
MD5 73e9a56f69c5d0322858e39012cd8fe5
BLAKE2b-256 ebb9522e5d16fa7627c9aae1de8004ce21d7b1ffc59668c2fb90d66307466e3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for autofepg-0.3.0-py3-none-any.whl:

Publisher: publish.yml on thomastschinkel/autofepg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page