AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions
Project description
๐งช AutoFE-PG
Automatic Feature Engineering & Selection for Kaggle Playground Competitions
AutoFE-PG is a production-ready library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models โ with zero target leakage.
Designed specifically for Kaggle Playground competitions where synthetic data is common, it includes specialized strategies for domain alignment, Bayesian priors from external data, dual-representation features, and cross-dataset density analysis.
โจ Key Features
| Feature | Description |
|---|---|
| Auto column detection | Automatically identifies categorical vs. numerical columns |
| 25+ feature strategies | Target encoding, domain alignment, Bayesian priors, dual representation, cross-dataset frequency, count encoding, digit extraction, arithmetic interactions, group statistics, and more |
| Zero target leakage | All target-dependent features use strict out-of-fold encoding |
| Greedy forward selection | Adds features one-by-one, keeping only those that improve CV score |
| Optional backward pruning | Removes redundant features after forward selection |
| Original data integration | Snap synthetic values to real clinical grids and inject historical priors |
| GPU acceleration | Automatically uses XGBoost GPU if available |
| Time budget | Set a wall-clock limit; the search stops gracefully |
| Sampling support | Evaluate on a subsample for faster iteration |
| Custom XGBoost params | Pass your own hyperparameters |
| Score variance tracking | Reports mean ยฑ std across folds |
| Classification & regression | Supports both tasks with auto-detection |
| Detailed reports | Auto-generated .txt report with full selection history |
๐ Quick Start
Installation
pip install autofepg
Or install dependencies directly:
pip install -r requirements.txt
Minimal Example
import pandas as pd
from autofepg import select_features
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])
result = select_features(
X_train, y_train, X_test,
task="classification",
time_budget=3600,
)
X_train_new = result["X_train"]
X_test_new = result["X_test"]
print(f"Baseline AUC: {result['base_score']:.6f}")
print(f"Best AUC: {result['best_score']:.6f}")
print(f"Features added: {len(result['selected_features'])}")
With Original Data (Domain Alignment + Bayesian Priors)
When working with Kaggle Playground competitions where synthetic data is generated from a real dataset, you can pass the original data to unlock powerful de-noising and prior-injection strategies:
import pandas as pd
from autofepg import select_features
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
original = pd.read_csv("original.csv")
X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])
X_original = original.drop(columns=["target"])
y_original = original["target"]
result = select_features(
X_train, y_train, X_test,
task="classification",
time_budget=3600,
original_df=X_original,
original_target=y_original,
)
X_train_new = result["X_train"]
X_test_new = result["X_test"]
print(f"Baseline AUC: {result['base_score']:.6f}")
print(f"Best AUC: {result['best_score']:.6f}")
print(f"Features added: {len(result['selected_features'])}")
Using the Class API
from autofepg import AutoFE
import pandas as pd
original = pd.read_csv("original.csv")
autofe = AutoFE(
task="classification",
n_folds=5,
time_budget=1800,
improvement_threshold=0.0001,
backward_selection=True,
sample=10000,
original_df=original.drop(columns=["target"]),
original_target=original["target"],
xgb_params={
"n_estimators": 1000,
"max_depth": 8,
"learning_rate": 0.05,
},
)
X_train_new, X_test_new = autofe.fit_select(
X_train, y_train, X_test,
aux_target_cols=["employment_status", "debt_to_income_ratio"],
)
# Inspect results
print(autofe.get_selected_feature_names())
history_df = autofe.get_history()
details_df = autofe.get_selection_details()
๐ How It Works
1. Feature Generation
AutoFE-PG generates candidates from a hardcoded priority sequence ordered by expected impact:
| Priority | Strategy | Description | Leakage-free? |
|---|---|---|---|
| 1 | Domain Alignment | Snap synthetic values to nearest real-data grid point; expose residual | โ No target |
| 2 | Bayesian Priors | Inject P(target | value) from original dataset as external knowledge | โ No train target |
| 3 | Target Encoding (single) | OOF mean-target per category | โ OOF |
| 4 | Count Encoding (single) | Value counts per category | โ No target |
| 5 | Dual Representation | Continuous + label-encoded copy of each numerical column | โ No target |
| 6 | Target Encoding on pairs | OOF TE on column pair interactions | โ OOF |
| 7 | Count Encoding on pairs | Value counts on column pair interactions | โ No target |
| 8 | Frequency Encoding | Normalized value counts | โ No target |
| 9 | Cross-Dataset Frequency & Rarity | How common/rare a value is across train+test+original | โ No target |
| 10 | Missing Indicators | Binary NaN flags | โ No target |
| 11 | TE with auxiliary targets | OOF TE using a different column as target | โ OOF |
| 12 | Unary transforms | log1p, sqrt, square, reciprocal | โ No target |
| 13 | Arithmetic interactions | add, sub, mul, div between numerical pairs | โ No target |
| 14 | Polynomial features | Square and cross-product terms | โ No target |
| 15 | Pairwise label interactions | Label-encoded column pairs | โ No target |
| 16 | TE/CE on digits | Target/count encoding on extracted digits | โ OOF / No target |
| 17 | Digit ร Category TE | Digit-category interaction with OOF TE | โ OOF |
| 18 | Quantile binning | Equal-frequency bins | โ No target |
| 19 | Raw digit extraction | i-th digit of numerical values | โ No target |
| 20 | Digit interactions | Within-feature and cross-feature digit combos | โ No target |
| 21 | Rounding features | Round to various decimal places / magnitudes | โ No target |
| 22 | Num-to-Cat conversion | Equal-width binning | โ No target |
| 23 | Group statistics & deviations | Mean, std, min, max, median by group; diff/ratio to group | โ No target |
2. Greedy Forward Selection
Each candidate is evaluated by adding it to the current feature set and running XGBoost K-fold CV. A feature is kept only if it improves the score beyond the configured threshold.
3. Optional Backward Pruning
After forward selection, features are tested for removal. If removing a feature improves (or maintains) the score, it is permanently dropped.
๐งฌ Synthetic Data Strategies
AutoFE-PG includes four strategies specifically designed for Kaggle Playground competitions where the training data is synthetically generated from a real-world dataset.
Domain Alignment (De-noising)
The synthetic generation process often introduces "fuzzy" values that wouldn't exist in a real clinical setting. Domain Alignment forces every continuous value in the synthetic set to its nearest neighbor in the original dataset, effectively "snapping" the data back to its true clinical grid. The residual (distance to the snap point) is also exposed as a feature, since it encodes how much the synthetic process perturbed the value.
from autofepg.generators import DomainAlignmentFeature
import numpy as np
# Reference values from original dataset
ref_vals = original["blood_pressure"].dropna().unique()
gen = DomainAlignmentFeature("blood_pressure", reference_values=ref_vals)
Bayesian-Style Priors (External Mapping)
Instead of letting the model learn strictly from the training data, Bayesian Priors import external knowledge from the original dataset. By calculating P(target | value) in the original file and injecting those probabilities as features, the model starts with a "hint" about which values are clinically dangerous. This uses no information from the training target โ zero leakage.
from autofepg.generators import BayesianPriorFeature
# Pre-computed from original data
prior_map = original.groupby("cholesterol")["heart_disease"].mean().to_dict()
gen = BayesianPriorFeature("cholesterol", prior_map=prior_map)
Dimensionality Expansion (Dual Representation)
The model uses a "dual-representation" strategy for numerical features:
- Continuous copy: Treated as a number to capture linear or threshold trends
- Categorical copy: Treated as a discrete label-encoded value to allow the tree to create very specific, non-linear splits on exact values
from autofepg.generators import DualRepresentationFeature
gen = DualRepresentationFeature("age")
# Produces: dual__age_cont (float) + dual__age_cat (int label)
Frequency and Density Analysis
Cross-dataset frequency analysis calculates the rarity of values across the entire data ecosystem (train, test, and original). This helps the model identify if a specific data point is an outlier or part of a common cluster โ a strong signal in synthetic datasets where certain "modes" are over-represented.
from autofepg.generators import CrossDatasetFrequencyFeature, ValueRarityFeature
import pandas as pd
# Combine counts across all datasets
combined = pd.concat([train["age"], test["age"], original["age"]])
eco_counts = combined.value_counts()
eco_total = len(combined)
freq_gen = CrossDatasetFrequencyFeature("age", eco_counts, eco_total)
rare_gen = ValueRarityFeature("age", eco_counts, eco_total)
Note: When you pass
original_df,original_target, andX_testtoAutoFEorselect_features, all four strategies are automatically generated and evaluated. No manual setup required.
โ๏ธ Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
task |
str | "auto" |
"classification", "regression", or "auto" |
n_folds |
int | 5 |
Number of CV folds |
time_budget |
float | None |
Max seconds (wall clock) |
improvement_threshold |
float | 1e-7 |
Min score delta to keep a feature |
sample |
int | None |
Subsample rows for faster CV |
backward_selection |
bool | False |
Run backward pruning after forward |
max_pair_cols |
int | 20 |
Max columns for pairwise features |
max_digit_positions |
int | 4 |
Max digit positions to extract |
xgb_params |
dict | None |
Custom XGBoost hyperparameters |
metric_fn |
callable | None |
Custom metric (y_true, y_pred) -> float |
metric_direction |
str | None |
"maximize" or "minimize" |
random_state |
int | 42 |
Random seed |
verbose |
bool | True |
Print progress |
original_df |
DataFrame | None |
Original (real) dataset features for domain alignment & priors |
original_target |
Series | None |
Original dataset target for Bayesian prior computation |
report_path |
str | "autofepg_report.txt" |
Path for detailed selection report |
๐ Output
The select_features() function returns a dictionary:
{
"X_train": pd.DataFrame, # Augmented training data
"X_test": pd.DataFrame, # Augmented test data (if provided)
"autofe": AutoFE, # Fitted AutoFE object
"history": pd.DataFrame, # Full selection history
"selected_features": List[str], # Names of kept features
"selection_details": pd.DataFrame, # Per-feature improvement details
"base_score": float, # Baseline CV mean
"base_score_std": float, # Baseline CV std
"best_score": float, # Final CV mean
"best_score_std": float, # Final CV std
}
๐งช Running Tests
pytest tests/ -v
๐ Project Structure
autofepg/
โโโ autofepg/
โ โโโ __init__.py # Public API & exports
โ โโโ utils.py # GPU detection, task inference, metrics
โ โโโ generators.py # All feature generator classes (25+)
โ โโโ builder.py # FeatureCandidateBuilder
โ โโโ engine.py # XGBoost CV engine
โ โโโ core.py # AutoFE class + select_features()
โโโ tests/
โ โโโ __init__.py
โ โโโ test_autofepg.py # Unit and integration tests
โโโ examples/
โ โโโ example_classification.py
โ โโโ example_regression.py
โ โโโ example_with_original.py
โโโ .github/
โ โโโ workflows/
โ โโโ ci.yml
โโโ .gitignore
โโโ LICENSE
โโโ README.md
โโโ CHANGELOG.md
โโโ CONTRIBUTING.md
โโโ Makefile
โโโ pyproject.toml
โโโ setup.py
โโโ requirements.txt
๐ Generator Reference
Original Strategies
| Generator | Class | Target used? |
|---|---|---|
| Target Encoding | TargetEncoding |
โ OOF |
| Count Encoding | CountEncoding |
โ |
| Frequency Encoding | FrequencyEncoding |
โ |
| Pair Interaction | PairInteraction |
โ |
| TE on Pairs | TargetEncodingOnPair |
โ OOF |
| CE on Pairs | CountEncodingOnPair |
โ |
| Digit Extraction | DigitFeature |
โ |
| Digit Interaction | DigitInteraction |
โ |
| TE on Digits | TargetEncodingOnDigit |
โ OOF |
| CE on Digits | CountEncodingOnDigit |
โ |
| Digit ร Cat TE | DigitBasePairTE |
โ OOF |
| Rounding | RoundFeature |
โ |
| Quantile Binning | QuantileBinFeature |
โ |
| Num-to-Cat | NumToCat |
โ |
| TE with Aux Target | TargetEncodingAuxTarget |
โ OOF (aux) |
| Arithmetic Interaction | ArithmeticInteraction |
โ |
| Missing Indicator | MissingIndicator |
โ |
| Group Statistics | GroupStatFeature |
โ |
| Group Deviation | GroupDeviationFeature |
โ |
| Unary Transform | UnaryTransform |
โ |
| Polynomial Feature | PolynomialFeature |
โ |
Synthetic Data Strategies (NEW in v0.2.0)
| Generator | Class | Requires | Target used? |
|---|---|---|---|
| Domain Alignment | DomainAlignmentFeature |
original_df |
โ |
| Bayesian Prior | BayesianPriorFeature |
original_df + original_target |
โ (external only) |
| Dual Representation | DualRepresentationFeature |
โ | โ |
| Cross-Dataset Frequency | CrossDatasetFrequencyFeature |
original_df or X_test |
โ |
| Value Rarity | ValueRarityFeature |
original_df or X_test |
โ |
๐ Changelog
v0.2.0
- Domain Alignment: Snap synthetic values to nearest real-data grid point with residual feature
- Bayesian Priors: Inject external P(target|value) from original dataset
- Dual Representation: Continuous + categorical copy of numerical features
- Cross-Dataset Frequency: Value frequency across train+test+original ecosystem
- Value Rarity: Log-inverse-frequency score for outlier detection
- Added
original_dfandoriginal_targetparameters toAutoFEandselect_features - Report now includes original data status
- Version bump to 0.2.0
v0.1.3
- Initial public release
- 20+ feature generation strategies
- Greedy forward selection with optional backward pruning
- GPU acceleration support
- Detailed text report generation
๐ License
MIT License. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autofepg-0.2.0.tar.gz.
File metadata
- Download URL: autofepg-0.2.0.tar.gz
- Upload date:
- Size: 36.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16bff906112a55f7019ae463ede92c54502e575e0f7f6f2d62b1b216a66ed861
|
|
| MD5 |
c51b69118c69bc6ff90ea399c449d297
|
|
| BLAKE2b-256 |
1e458056f0e3751e50922257a8b49ee39d652a45b1f2cc7f1fec6b9d7dc3ab0e
|
Provenance
The following attestation bundles were made for autofepg-0.2.0.tar.gz:
Publisher:
publish.yml on thomastschinkel/autofepg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autofepg-0.2.0.tar.gz -
Subject digest:
16bff906112a55f7019ae463ede92c54502e575e0f7f6f2d62b1b216a66ed861 - Sigstore transparency entry: 969311141
- Sigstore integration time:
-
Permalink:
thomastschinkel/autofepg@1f4a2c386d7e600048d1502ef590675794868e94 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/thomastschinkel
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1f4a2c386d7e600048d1502ef590675794868e94 -
Trigger Event:
release
-
Statement type:
File details
Details for the file autofepg-0.2.0-py3-none-any.whl.
File metadata
- Download URL: autofepg-0.2.0-py3-none-any.whl
- Upload date:
- Size: 30.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b1b1e2362a2a2cc3f11f6c23730d9de2d1070947d8390fbc2e88124e71a6449
|
|
| MD5 |
f26a448e83ab35e38c799878c6468a8f
|
|
| BLAKE2b-256 |
1681e2252ea630456f0bb2e91bedb31207590f418abb85faa0ec7183e18b73a1
|
Provenance
The following attestation bundles were made for autofepg-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on thomastschinkel/autofepg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autofepg-0.2.0-py3-none-any.whl -
Subject digest:
3b1b1e2362a2a2cc3f11f6c23730d9de2d1070947d8390fbc2e88124e71a6449 - Sigstore transparency entry: 969311150
- Sigstore integration time:
-
Permalink:
thomastschinkel/autofepg@1f4a2c386d7e600048d1502ef590675794868e94 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/thomastschinkel
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1f4a2c386d7e600048d1502ef590675794868e94 -
Trigger Event:
release
-
Statement type: