AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions
Project description
🧪 AutoFE-PG
Automatic Feature Engineering & Selection for Kaggle Playground Competitions
AutoFE-PG is a powerful library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models — with zero target leakage.
Version 0.3.0 is a complete refactoring focused on general-purpose strategies that work across any tabular competition, featuring advanced binning, digit-based features, Cyclical encoding, Weight of Evidence, and Genetic Programming interactions.
✨ Key Features
| Feature | Description |
|---|---|
| Genetic Programming | Generates complex non-linear interactions using gplearn |
| Digit-Based Logic | Extracts integer and decimal positions; creates digit-cross-category interactions |
| Target Representation | OOF Target Aggregation (mean, std, skew), WoE, and Entropy features |
| Cyclical Encoding | Sine/Cosine transformations for periodic numerical features |
| Advanced Binning | Both Quantile (qcut) and Equal-width (cut) discretization |
| External Signal Injection | Inject historical Priors, WoE, and Entropy from original datasets |
| Zero Target Leakage | All target-dependent features use strict out-of-fold (OOF) strategies |
| Greedy Selection | Forward selection keeps only features that improve CV score |
| GPU Acceleration | Built-in support for XGBoost GPU engines |
🚀 Quick Start
Installation
pip install autofepg
# Optional: for Genetic Programming features
pip install gplearn
Basic Usage
import pandas as pd
from autofepg import select_features
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])
result = select_features(
X_train, y_train, X_test,
task="classification",
time_budget=3600 # 1 hour limit
)
X_train_new = result["X_train"]
X_test_new = result["X_test"]
print(f"Features added: {len(result['selected_features'])}")
print(f"CV Improvement: {result['base_score']:.6f} -> {result['best_score']:.6f}")
Injecting Historical Signals (Original Data)
If you have access to a "real world" dataset (common in Kaggle Playground synthetic competitions), you can inject its signals without leakage:
result = select_features(
X_train, y_train, X_test,
original_df=original_df,
original_target=original_target,
task="classification"
)
📖 Feature Strategies (v0.3.0)
1. Digits & Discretization
- Digit Extraction: Integer positions (units, tens, etc.) and decimal positions.
- Digit Interactions: Column-wise and cross-column interactions between digits.
- Binning: Discretize continuous variables via Quantile (qcut) or Equal-width (cut) bins.
- Rounding: Rounding to various decimal places or magnitudes to find structural splits.
2. Specialized Encoding
- Cyclical Encoding: Sin/Cos transforms for periodic data.
- Target Encoding (OOF): Out-of-fold mean target per category.
- Weight of Evidence (WoE): OOF WoE scores for binary classification.
- Entropy: OOF target entropy per value group.
- OOF Aggregation: Mean, Std, and Skew of the target grouped by feature values.
3. Non-Linear Interactions
- Genetic Programming: Evolves mathematical expressions using the base features (requires
gplearn). - Pair Interactions: Categorical label-encoding of bigrams.
- Numerical Products: NaN-safe products of bigram numerical features.
- Digit × Category: Target encoding on the interaction of a column's digit and another category.
4. External Data Signals
- Bayesian Priors: Historical
P(target|value)from the original dataset. - External WoE: WoE scores pre-computed from the original dataset.
- External Entropy: Group purity/impurity derived from the original dataset.
⚙️ Configuration
| Parameter | Default | Description |
|---|---|---|
task |
"auto" |
"classification", "regression", or "auto" |
n_folds |
5 |
Number of CV folds for evaluation |
time_budget |
None |
Max wall-clock seconds for the search |
improvement_threshold |
1e-7 |
Min score delta to keep a feature |
sample |
None |
Rows to sample for evaluation (speeds up search) |
gp_generations |
5 |
Evolution steps for Genetic Programming |
gp_n_components |
5 |
Max GP features to potentially keep |
original_df |
None |
External dataset for Priors/WoE/Entropy |
📝 Changelog
v0.3.0 (Current)
- Refactoring: Removed competition-specific features (Domain Alignment, Dataset Frequency, Rarity).
- New Features: Cyclical Features, OOF/External WoE, OOF/External Entropy, Genetic Programming (gplearn).
- Enhanced Digits: Added Decimal Digit extraction.
- Enhanced Aggregation: Added Skewness support to OOF Target Aggregation.
- Simplified API: Decoupled from specific dataset patterns; focused on universal engineering.
v0.2.0
- Added original dataset support (Domain Alignment, Bayesian Priors).
- Introduced Cross-Dataset Frequency and Rarity features.
📄 License
MIT License — Copyright (c) 2026 Thomas Tschinkel.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autofepg-0.3.0.tar.gz.
File metadata
- Download URL: autofepg-0.3.0.tar.gz
- Upload date:
- Size: 26.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cb2f0586e448f0c8c2b0c830cda2de51779d39626ab022ccb176dd1a62cf2eb
|
|
| MD5 |
448db0ffcc06b9a637dc492f2ba5f319
|
|
| BLAKE2b-256 |
98e71e0132da79baefd7e4e7bde01bbd8d0fa485142caad43d071b95c67416df
|
Provenance
The following attestation bundles were made for autofepg-0.3.0.tar.gz:
Publisher:
publish.yml on thomastschinkel/autofepg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autofepg-0.3.0.tar.gz -
Subject digest:
2cb2f0586e448f0c8c2b0c830cda2de51779d39626ab022ccb176dd1a62cf2eb - Sigstore transparency entry: 1056206958
- Sigstore integration time:
-
Permalink:
thomastschinkel/autofepg@11a59214aebca380c6953c3291253669b97b5d6e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/thomastschinkel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@11a59214aebca380c6953c3291253669b97b5d6e -
Trigger Event:
release
-
Statement type:
File details
Details for the file autofepg-0.3.0-py3-none-any.whl.
File metadata
- Download URL: autofepg-0.3.0-py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
884a571752c0fd6a6016745137d10da9d3e048d760b8a69d042cd7969e47cf0b
|
|
| MD5 |
73e9a56f69c5d0322858e39012cd8fe5
|
|
| BLAKE2b-256 |
ebb9522e5d16fa7627c9aae1de8004ce21d7b1ffc59668c2fb90d66307466e3b
|
Provenance
The following attestation bundles were made for autofepg-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on thomastschinkel/autofepg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autofepg-0.3.0-py3-none-any.whl -
Subject digest:
884a571752c0fd6a6016745137d10da9d3e048d760b8a69d042cd7969e47cf0b - Sigstore transparency entry: 1056207059
- Sigstore integration time:
-
Permalink:
thomastschinkel/autofepg@11a59214aebca380c6953c3291253669b97b5d6e -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/thomastschinkel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@11a59214aebca380c6953c3291253669b97b5d6e -
Trigger Event:
release
-
Statement type: