Skip to main content

A professional, enterprise-grade feature selection and engineering pipeline.

Project description

Feature Engine Pro

Feature Engine Pro is an advanced, deterministically-driven Python library designed for automated feature engineering and mathematically rigorous feature selection.

In real-world machine learning environments, datasets frequently contain hundreds or thousands of columns. Navigating this high dimensionality manually is prone to error and bias. Feature Engine Pro solves this by providing a multi-stage, Scikit-Learn compatible mathematical funnel that autonomously selects only the features that positively impact model performance.

Crucially, this library resolves the "black box" problem of automated data pipelines by generating a comprehensive HTML Audit Report, detailing the exact mathematical reasoning behind every feature kept or dropped.

Installation

Simply install the package using pip:

pip install .

Note: The library will automatically handle browser dependencies (Playwright/Chromium) the first time you generate a PDF report.

Core Philosophy

  1. Deterministic and Mathematical: Relies entirely on robust statistical techniques (Variance, Pearson/Spearman correlation, Information Theory, Recursive Feature Elimination) rather than non-deterministic or costly LLM-based agent swarms.
  2. Transparent "Audit Trail": Never wonder why a feature disappeared. The Engine logs every action and compiles a visual report.
  3. Scikit-Learn Native: Designed to slot perfectly into existing sklearn.pipeline.Pipeline architectures, complete with fit(), transform(), and GridSearchCV compatibility to prevent data leakage.
  4. End-to-End Execution: Automatically handles missing values, encodes complex text/categorical variables, extracts temporal features, and reduces dimensionality in a single execution.

Pipeline Architecture

Feature Engine Pro processes high-dimensional data through a sequence of modular stages:

Stage 1: Automated Feature Engineering

  • Datetime Expansion: Detects temporal columns and extracts granular numerical representations (year, month, day, day-of-week, weekend flags).
  • Group Aggregation: Autonomously detects ID-based columns and engineers aggregated statistics (mean, sum) to capture group-level behavior.

Stage 2: Data Pre-Processing & Encoding

  • Secure Imputation: Learns missing value distributions (mean, median) during .fit() and safely applies them during .transform().
  • Target Encoding: Converts high-cardinality categorical string columns into continuous numerical data by mapping them against the target variable.

Stage 3: The Mathematical Selection Funnel

  • Variance Filter: Eliminates zero-variance constants and low-variance features that carry no signal.
  • Collinearity Filter: Identifies heavily correlated feature pairs. It evaluates both features against the target variable and intelligently drops the redundant feature providing the least predictive power.
  • Mutual Information: Applies Information Theory to identify and preserve features with complex, non-linear dependencies on the target.
  • Recursive Feature Elimination (RFE): Uses tree-based ensemble estimators (Random Forest) and feature importance ranking to iteratively prune the weakest remaining columns.

Installation

(Note: Package is currently in pre-release development phase)

pip install feature-engine-pro

Quick Start Guide

The entire framework can be instantiated and run with a few lines of code.

import pandas as pd
from feature_engine_pro.engine import FeatureEngine
from sklearn.model_selection import train_test_split

# 1. Load Data
df = pd.read_csv("high_dimensional_data.csv")
X = df.drop(columns=["target"])
y = df["target"]

# 2. Split Data (Crucial for preventing data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize Feature Engine
engine = FeatureEngine(
    target_column="target",
    problem_type="classification",
    variance_threshold=0.01,
    correlation_threshold=0.85,
    mi_threshold=0.01,
    rfe_n_features=25
)

# 4. Fit the pipeline to training data
engine.fit(X_train, y_train)

# 5. Transform both train and test sets
X_train_clean = engine.transform(X_train)
X_test_clean = engine.transform(X_test)

# 6. Generate the Audit Report
engine.generate_report(filepath="feature_audit_report.html")

Advanced Usage: GridSearchCV

Because FeatureEngine inherits from BaseEstimator and TransformerMixin, it natively supports hyperparameter tuning to find the optimal mathematical thresholds for your specific dataset.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

pipeline = Pipeline([
    ('feature_engine', FeatureEngine(problem_type='classification')),
    ('classifier', GradientBoostingClassifier())
])

param_grid = {
    'feature_engine__correlation_threshold': [0.75, 0.85, 0.95],
    'feature_engine__mi_threshold': [0.01, 0.05],
    'classifier__learning_rate': [0.01, 0.1]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

The Audit Report

Calling .generate_report("report.html") produces a standalone HTML document containing:

  • A summary count of features kept vs. dropped.
  • A visual Bar Chart Funnel illustrating the reduction at each pipeline stage.
  • A pre-filtering Correlation Heatmap to visualize dataset collinearity.
  • A comprehensive Tabular Audit Trail detailing the exact mathematical reason a specific column was eliminated (e.g., "[CorrelationSelector] Dropped: Correlated 0.92 with feature_X. Kept feature_X because it has higher correlation to target.").

Contributing

Contributions to mathematical optimization, expanding the suite of transformers, or improving computational efficiency for massive datasets are welcome. Please ensure all pull requests maintain Scikit-Learn compatibility and do not introduce data leakage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature_engine_pro-0.1.0.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

feature_engine_pro-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded Python 3

File details

Details for the file feature_engine_pro-0.1.0.tar.gz.

File metadata

  • Download URL: feature_engine_pro-0.1.0.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for feature_engine_pro-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a9c413a5cd2573da0a1ee34982c090d5b3d6ca92d2d9906213889699b2f63266
MD5 4e2edbee3c0c45b8aa2224e993cb3542
BLAKE2b-256 9315266b3acfdf4a219e6c6a7fadc594d4af715f6e671351f8ceade8e664eaf1

See more details on using hashes here.

File details

Details for the file feature_engine_pro-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for feature_engine_pro-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77363af0fd6270aa25f02492a4ef9b99b0bb7b86716b603c4ace7c70224370ab
MD5 057cde4ba42a0362344d7c4ad98bc296
BLAKE2b-256 20184922fedc479f601262a41b37adf714f02c66589f7fe821dcdbaf9f0fd5c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page