A professional, enterprise-grade feature selection and engineering pipeline.

These details have not been verified by PyPI

Project links

Homepage

Project description

Feature Engine Pro

Feature Engine Pro is an advanced, deterministically-driven Python library designed for automated feature engineering and mathematically rigorous feature selection.

In real-world machine learning environments, datasets frequently contain hundreds or thousands of columns. Navigating this high dimensionality manually is prone to error and bias. Feature Engine Pro solves this by providing a multi-stage, Scikit-Learn compatible mathematical funnel that autonomously selects only the features that positively impact model performance.

Crucially, this library resolves the "black box" problem of automated data pipelines by generating a comprehensive HTML Audit Report, detailing the exact mathematical reasoning behind every feature kept or dropped.

Installation

Simply install the package using pip:

pip install .

Note: The library will automatically handle browser dependencies (Playwright/Chromium) the first time you generate a PDF report.

Core Philosophy

Deterministic and Mathematical: Relies entirely on robust statistical techniques (Variance, Pearson/Spearman correlation, Information Theory, Recursive Feature Elimination) rather than non-deterministic or costly LLM-based agent swarms.
Transparent "Audit Trail": Never wonder why a feature disappeared. The Engine logs every action and compiles a visual report.
Scikit-Learn Native: Designed to slot perfectly into existing sklearn.pipeline.Pipeline architectures, complete with fit(), transform(), and GridSearchCV compatibility to prevent data leakage.
End-to-End Execution: Automatically handles missing values, encodes complex text/categorical variables, extracts temporal features, and reduces dimensionality in a single execution.

Pipeline Architecture

Feature Engine Pro processes high-dimensional data through a sequence of modular stages:

Stage 1: Automated Feature Engineering

Datetime Expansion: Detects temporal columns and extracts granular numerical representations (year, month, day, day-of-week, weekend flags).
Group Aggregation: Autonomously detects ID-based columns and engineers aggregated statistics (mean, sum) to capture group-level behavior.

Stage 2: Data Pre-Processing & Encoding

Secure Imputation: Learns missing value distributions (mean, median) during .fit() and safely applies them during .transform().
Target Encoding: Converts high-cardinality categorical string columns into continuous numerical data by mapping them against the target variable.

Stage 3: The Mathematical Selection Funnel

Variance Filter: Eliminates zero-variance constants and low-variance features that carry no signal.
Collinearity Filter: Identifies heavily correlated feature pairs. It evaluates both features against the target variable and intelligently drops the redundant feature providing the least predictive power.
Mutual Information: Applies Information Theory to identify and preserve features with complex, non-linear dependencies on the target.
Recursive Feature Elimination (RFE): Uses tree-based ensemble estimators (Random Forest) and feature importance ranking to iteratively prune the weakest remaining columns.

Installation

(Note: Package is currently in pre-release development phase)

pip install feature-engine-pro

Quick Start Guide

The entire framework can be instantiated and run with a few lines of code.

import pandas as pd
from feature_engine_pro.engine import FeatureEngine
from sklearn.model_selection import train_test_split

# 1. Load Data
df = pd.read_csv("high_dimensional_data.csv")
X = df.drop(columns=["target"])
y = df["target"]

# 2. Split Data (Crucial for preventing data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize Feature Engine
engine = FeatureEngine(
    target_column="target",
    problem_type="classification",
    variance_threshold=0.01,
    correlation_threshold=0.85,
    mi_threshold=0.01,
    rfe_n_features=25
)

# 4. Fit the pipeline to training data
engine.fit(X_train, y_train)

# 5. Transform both train and test sets
X_train_clean = engine.transform(X_train)
X_test_clean = engine.transform(X_test)

# 6. Generate the Audit Report
engine.generate_report(filepath="feature_audit_report.html")

Advanced Usage: GridSearchCV

Because FeatureEngine inherits from BaseEstimator and TransformerMixin, it natively supports hyperparameter tuning to find the optimal mathematical thresholds for your specific dataset.

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

pipeline = Pipeline([
    ('feature_engine', FeatureEngine(problem_type='classification')),
    ('classifier', GradientBoostingClassifier())
])

param_grid = {
    'feature_engine__correlation_threshold': [0.75, 0.85, 0.95],
    'feature_engine__mi_threshold': [0.01, 0.05],
    'classifier__learning_rate': [0.01, 0.1]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

The Audit Report

Calling .generate_report("report.html") produces a standalone HTML document containing:

A summary count of features kept vs. dropped.
A visual Bar Chart Funnel illustrating the reduction at each pipeline stage.
A pre-filtering Correlation Heatmap to visualize dataset collinearity.
A comprehensive Tabular Audit Trail detailing the exact mathematical reason a specific column was eliminated (e.g., "[CorrelationSelector] Dropped: Correlated 0.92 with feature_X. Kept feature_X because it has higher correlation to target.").

Contributing

Contributions to mathematical optimization, expanding the suite of transformers, or improving computational efficiency for massive datasets are welcome. Please ensure all pull requests maintain Scikit-Learn compatibility and do not introduce data leakage.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.1

May 5, 2026

This version

0.1.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feature_engine_pro-0.1.0.tar.gz (24.7 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

feature_engine_pro-0.1.0-py3-none-any.whl (25.9 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file feature_engine_pro-0.1.0.tar.gz.

File metadata

Download URL: feature_engine_pro-0.1.0.tar.gz
Upload date: May 5, 2026
Size: 24.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for feature_engine_pro-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a9c413a5cd2573da0a1ee34982c090d5b3d6ca92d2d9906213889699b2f63266`
MD5	`4e2edbee3c0c45b8aa2224e993cb3542`
BLAKE2b-256	`9315266b3acfdf4a219e6c6a7fadc594d4af715f6e671351f8ceade8e664eaf1`

See more details on using hashes here.

File details

Details for the file feature_engine_pro-0.1.0-py3-none-any.whl.

File metadata

Download URL: feature_engine_pro-0.1.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 25.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for feature_engine_pro-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77363af0fd6270aa25f02492a4ef9b99b0bb7b86716b603c4ace7c70224370ab`
MD5	`057cde4ba42a0362344d7c4ad98bc296`
BLAKE2b-256	`20184922fedc479f601262a41b37adf714f02c66589f7fe821dcdbaf9f0fd5c3`

See more details on using hashes here.

feature-engine-pro 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Feature Engine Pro

Installation

Core Philosophy

Pipeline Architecture

Stage 1: Automated Feature Engineering

Stage 2: Data Pre-Processing & Encoding

Stage 3: The Mathematical Selection Funnel

Installation

Quick Start Guide

Advanced Usage: GridSearchCV

The Audit Report

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes