Skip to main content

A professional, enterprise-grade feature selection and engineering pipeline.

Project description

Feature Engine Pro

Feature Engine Pro is an advanced, deterministically-driven Python library designed for automated feature engineering and mathematically rigorous feature selection.

In real-world machine learning environments, datasets frequently contain hundreds or thousands of columns. Navigating this high dimensionality manually is prone to error and bias. Feature Engine Pro solves this by providing a multi-stage, Scikit-Learn compatible mathematical funnel that autonomously selects only the features that positively impact model performance.

Crucially, this library resolves the "black box" problem of automated data pipelines by generating a comprehensive HTML Audit Report, detailing the exact mathematical reasoning behind every feature kept or dropped.

Installation

Simply install the package using pip:

pip install .

Note: The library will automatically handle browser dependencies (Playwright/Chromium) the first time you generate a PDF report.

Core Philosophy

  1. Deterministic and Mathematical: Relies entirely on robust statistical techniques (Variance, Pearson/Spearman correlation, Information Theory, Recursive Feature Elimination) rather than non-deterministic or costly LLM-based agent swarms.
  2. Transparent "Audit Trail": Never wonder why a feature disappeared. The Engine logs every action and compiles a visual report.
  3. Scikit-Learn Native: Designed to slot perfectly into existing sklearn.pipeline.Pipeline architectures, complete with fit(), transform(), and GridSearchCV compatibility to prevent data leakage.
  4. End-to-End Execution: Automatically handles missing values, encodes complex text/categorical variables, extracts temporal features, and reduces dimensionality in a single execution.

Pipeline Architecture

Feature Engine Pro processes high-dimensional data through a sequence of modular stages:

Stage 1: Automated Feature Engineering

  • Datetime Expansion: Detects temporal columns and extracts granular numerical representations (year, month, day, day-of-week, weekend flags).
  • Group Aggregation: Autonomously detects ID-based columns and engineers aggregated statistics (mean, sum) to capture group-level behavior.

Stage 2: Data Pre-Processing & Encoding

  • Secure Imputation: Learns missing value distributions (mean, median) during .fit() and safely applies them during .transform().
  • Target Encoding: Converts high-cardinality categorical string columns into continuous numerical data by mapping them against the target variable.

Stage 3: The Mathematical Selection Funnel

  • Variance Filter: Eliminates zero-variance constants and low-variance features that carry no signal.
  • Collinearity Filter: Identifies heavily correlated feature pairs. It evaluates both features against the target variable and intelligently drops the redundant feature providing the least predictive power.
  • Mutual Information: Applies Information Theory to identify and preserve features with complex, non-linear dependencies on the target.
  • Recursive Feature Elimination (RFE): Uses tree-based ensemble estimators (Random Forest) and feature importance ranking to iteratively prune the weakest remaining columns.

Installation

(Note: Package is currently in pre-release development phase)

pip install feature-engine-pro

Quick Start Guide

The entire framework can be instantiated and run with a few lines of code.

import pandas as pd
from feature_engine_pro.engine import FeatureEngine
from sklearn.model_selection import train_test_split

# 1. Load Data
df = pd.read_csv("high_dimensional_data.csv")
X = df.drop(columns=["target"])
y = df["target"]

# 2. Split Data (Crucial for preventing data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize Feature Engine
engine = FeatureEngine(
    target_column="target",
    problem_type="classification",
    variance_threshold=0.01,
    correlation_threshold=0.85,
    mi_threshold=0.01,
    rfe_n_features=25
)

# 4. Fit the pipeline to training data
engine.fit(X_train, y_train)

# 5. Transform both train and test sets
X_train_clean = engine.transform(X_train)
X_test_clean = engine.transform(X_test)

# 6. Generate the Audit Report
engine.generate_report(filepath="feature_audit_report.html")

Advanced Usage: GridSearchCV

Because FeatureEngine inherits from BaseEstimator and TransformerMixin, it natively supports hyperparameter tuning to find the optimal mathematical thresholds for your specific dataset.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

pipeline = Pipeline([
    ('feature_engine', FeatureEngine(problem_type='classification')),
    ('classifier', GradientBoostingClassifier())
])

param_grid = {
    'feature_engine__correlation_threshold': [0.75, 0.85, 0.95],
    'feature_engine__mi_threshold': [0.01, 0.05],
    'classifier__learning_rate': [0.01, 0.1]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

The Audit Report

Calling .generate_report("report.html") produces a standalone HTML document containing:

  • A summary count of features kept vs. dropped.
  • A visual Bar Chart Funnel illustrating the reduction at each pipeline stage.
  • A pre-filtering Correlation Heatmap to visualize dataset collinearity.
  • A comprehensive Tabular Audit Trail detailing the exact mathematical reason a specific column was eliminated (e.g., "[CorrelationSelector] Dropped: Correlated 0.92 with feature_X. Kept feature_X because it has higher correlation to target.").

Contributing

Contributions to mathematical optimization, expanding the suite of transformers, or improving computational efficiency for massive datasets are welcome. Please ensure all pull requests maintain Scikit-Learn compatibility and do not introduce data leakage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature_engine_pro-0.1.1.tar.gz (24.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

feature_engine_pro-0.1.1-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file feature_engine_pro-0.1.1.tar.gz.

File metadata

  • Download URL: feature_engine_pro-0.1.1.tar.gz
  • Upload date:
  • Size: 24.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for feature_engine_pro-0.1.1.tar.gz
Algorithm Hash digest
SHA256 26d9027abef863b2d84069ce320d09e626abb2a9574272993a29a9e928b99f1b
MD5 152e4ac8f186716e899ad46bc0685f4c
BLAKE2b-256 dda5adbede38c1cf8f9c5599e0787db3c0ad9691cfe9390225138300893fcd7e

See more details on using hashes here.

File details

Details for the file feature_engine_pro-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for feature_engine_pro-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5f08798c2be61a419db8105438bde5aaeb18ab2c6cbfb3825b4a57343ef47f2f
MD5 c14b020c5fc66de26c4c326b249b9763
BLAKE2b-256 778bd7a36768c437e9e1f92f8d10d5d7fda3dff0aed767793b8cb98526a52a6a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page