Skip to main content

Intelligent automatic feature engineering for tabular ML.

Project description

AutoFeature

Intelligent automatic feature engineering for tabular ML.

PyPI version Python License: MIT

What is AutoFeature?

AutoFeature is a scikit-learn compatible library that automates the most impactful parts of tabular feature engineering:

Component What it does
AutoFeatureEngineer Detects and generates useful interaction features (products, ratios, differences) using importance-guided search
TargetAwareSelector Selects features by mutual information with the target — not just variance
CyclicalEncoder Encodes periodic variables (hour, month, day) with sin/cos to preserve cyclical structure
SmartCategoricalEncoder Automatically picks the right encoding per column: label / one-hot / target encoding
LeakageDetector Warns about features that suspiciously correlate with the target
AutoFeaturePipeline Runs everything end-to-end in one call

Installation

pip install sufyaan-autofeature

Requires Python ≥ 3.8, scikit-learn ≥ 1.0, pandas ≥ 1.3, numpy ≥ 1.21.

Quickstart

Full Pipeline (recommended)

import pandas as pd
from autofeature import AutoFeaturePipeline

pipeline = AutoFeaturePipeline(
    cyclical_columns={"hour": 24, "month": 12},
    max_interaction_features=15,
    k=20,                  # keep top 20 features
    task="classification",
    verbose=True,
)

X_train_out = pipeline.fit_transform(X_train, y_train)
X_test_out  = pipeline.transform(X_test)

print(pipeline.get_summary())

Individual Components

from autofeature import (
    AutoFeatureEngineer,
    TargetAwareSelector,
    CyclicalEncoder,
    SmartCategoricalEncoder,
    LeakageDetector,
)

# 1. Detect leakage
ld = LeakageDetector()
ld.fit(X_train, y_train)
X_train = ld.remove_leaky(X_train)

# 2. Encode categoricals automatically
enc = SmartCategoricalEncoder()
X_train = enc.fit_transform(X_train, y_train)
X_test  = enc.transform(X_test)

# 3. Encode cyclical columns
cyc = CyclicalEncoder(columns={"hour": 24, "day_of_week": 7})
X_train = cyc.fit_transform(X_train)
X_test  = cyc.transform(X_test)

# 4. Generate interaction features
afe = AutoFeatureEngineer(max_interaction_features=20)
X_train = afe.fit_transform(X_train, y_train)
X_test  = afe.transform(X_test)

# See what interactions were selected
print(afe.get_interaction_report())

# 5. Select top features by target mutual information
sel = TargetAwareSelector(k=15)
X_train = sel.fit_transform(X_train, y_train)
X_test  = sel.transform(X_test)

print(sel.get_feature_scores())

API Reference

AutoFeatureEngineer

AutoFeatureEngineer(
    max_interaction_features=20,   # max interactions to add
    interaction_types=["product", "ratio", "difference"],
    interaction_threshold=0.01,    # minimum importance gain
    n_estimators=50,               # trees in internal evaluator
    task="auto",                   # "classification" | "regression" | "auto"
    random_state=42,
    verbose=False,
)

Methods: fit(X, y), transform(X), fit_transform(X, y), get_interaction_report()

TargetAwareSelector

TargetAwareSelector(
    k=10,             # number of features to keep, or "all"
    task="auto",
    threshold=None,   # MI threshold (overrides k if set)
    random_state=42,
)

Methods: fit(X, y), transform(X), fit_transform(X, y), get_feature_scores()

CyclicalEncoder

CyclicalEncoder(
    columns={"hour": 24, "month": 12},  # column → period mapping
    drop_original=True,
)

Produces {col}_sin and {col}_cos columns.

SmartCategoricalEncoder

SmartCategoricalEncoder(
    max_onehot_cardinality=10,   # >10 unique values → target encoding
    smoothing=1.0,               # regularisation for target encoding
    task="auto",
    handle_unknown="mean",       # or "zero"
)

LeakageDetector

LeakageDetector(
    correlation_threshold=0.95,
    name_patterns=["label", "target", "outcome"],
    verbose=True,
)

Methods: fit(X, y), remove_leaky(X), get_report()

AutoFeaturePipeline

AutoFeaturePipeline(
    cyclical_columns=None,
    max_interaction_features=20,
    k=20,
    task="auto",
    detect_leakage=True,
    remove_leaky=False,
    random_state=42,
    verbose=False,
)

Methods: fit(X, y), transform(X), fit_transform(X, y), get_summary()

Why AutoFeature?

  • Target-aware: selections and interactions are evaluated against the actual prediction target, not generic statistics
  • Scikit-learn compatible: works with Pipeline, GridSearchCV, and any estimator
  • Production-safe: fit on train, transform on test — no leakage from the transform step
  • Interpretable: every decision (which interaction, which encoding, which feature) is inspectable

Contributing

Pull requests are welcome. For major changes, please open an issue first.

git clone https://github.com/yourusername/autofeature
cd autofeature
pip install -e ".[dev]"
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sufyaan_autofeature-0.1.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sufyaan_autofeature-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file sufyaan_autofeature-0.1.0.tar.gz.

File metadata

  • Download URL: sufyaan_autofeature-0.1.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for sufyaan_autofeature-0.1.0.tar.gz
Algorithm Hash digest
SHA256 00f69493ce85304d123bcff7a341e8ec11041e11dae81b69f226e785c132b0c0
MD5 407ba7297fd5d650286c749605f1ff68
BLAKE2b-256 8bc3ac9f811081df053cdbc24a6bb612ee9a60bc8c0768eaed4afade962472a9

See more details on using hashes here.

File details

Details for the file sufyaan_autofeature-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sufyaan_autofeature-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 593ca717f79721d6fb54284a6298ca4cdb09b70f133c9dc8e5c5b4ff18e66516
MD5 948889dc679a62a3875614ec2cce503f
BLAKE2b-256 8322325a5eff243a186e53018566985b4e4fb050da484b440c7f2b62e6dca187

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page