A professional, enterprise-grade feature selection and engineering pipeline.
Project description
Feature Engine Pro
Feature Engine Pro is an advanced, deterministically-driven Python library designed for automated feature engineering and mathematically rigorous feature selection.
In real-world machine learning environments, datasets frequently contain hundreds or thousands of columns. Navigating this high dimensionality manually is prone to error and bias. Feature Engine Pro solves this by providing a multi-stage, Scikit-Learn compatible mathematical funnel that autonomously selects only the features that positively impact model performance.
Crucially, this library resolves the "black box" problem of automated data pipelines by generating a comprehensive HTML Audit Report, detailing the exact mathematical reasoning behind every feature kept or dropped.
Installation
Simply install the package using pip:
pip install .
Note: The library will automatically handle browser dependencies (Playwright/Chromium) the first time you generate a PDF report.
Core Philosophy
- Deterministic and Mathematical: Relies entirely on robust statistical techniques (Variance, Pearson/Spearman correlation, Information Theory, Recursive Feature Elimination) rather than non-deterministic or costly LLM-based agent swarms.
- Transparent "Audit Trail": Never wonder why a feature disappeared. The Engine logs every action and compiles a visual report.
- Scikit-Learn Native: Designed to slot perfectly into existing
sklearn.pipeline.Pipelinearchitectures, complete withfit(),transform(), andGridSearchCVcompatibility to prevent data leakage. - End-to-End Execution: Automatically handles missing values, encodes complex text/categorical variables, extracts temporal features, and reduces dimensionality in a single execution.
Pipeline Architecture
Feature Engine Pro processes high-dimensional data through a sequence of modular stages:
Stage 1: Automated Feature Engineering
- Datetime Expansion: Detects temporal columns and extracts granular numerical representations (year, month, day, day-of-week, weekend flags).
- Group Aggregation: Autonomously detects ID-based columns and engineers aggregated statistics (mean, sum) to capture group-level behavior.
Stage 2: Data Pre-Processing & Encoding
- Secure Imputation: Learns missing value distributions (mean, median) during
.fit()and safely applies them during.transform(). - Target Encoding: Converts high-cardinality categorical string columns into continuous numerical data by mapping them against the target variable.
Stage 3: The Mathematical Selection Funnel
- Variance Filter: Eliminates zero-variance constants and low-variance features that carry no signal.
- Collinearity Filter: Identifies heavily correlated feature pairs. It evaluates both features against the target variable and intelligently drops the redundant feature providing the least predictive power.
- Mutual Information: Applies Information Theory to identify and preserve features with complex, non-linear dependencies on the target.
- Recursive Feature Elimination (RFE): Uses tree-based ensemble estimators (Random Forest) and feature importance ranking to iteratively prune the weakest remaining columns.
Installation
(Note: Package is currently in pre-release development phase)
pip install feature-engine-pro
Quick Start Guide
The entire framework can be instantiated and run with a few lines of code.
import pandas as pd
from feature_engine_pro.engine import FeatureEngine
from sklearn.model_selection import train_test_split
# 1. Load Data
df = pd.read_csv("high_dimensional_data.csv")
X = df.drop(columns=["target"])
y = df["target"]
# 2. Split Data (Crucial for preventing data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Initialize Feature Engine
engine = FeatureEngine(
target_column="target",
problem_type="classification",
variance_threshold=0.01,
correlation_threshold=0.85,
mi_threshold=0.01,
rfe_n_features=25
)
# 4. Fit the pipeline to training data
engine.fit(X_train, y_train)
# 5. Transform both train and test sets
X_train_clean = engine.transform(X_train)
X_test_clean = engine.transform(X_test)
# 6. Generate the Audit Report
engine.generate_report(filepath="feature_audit_report.html")
Advanced Usage: GridSearchCV
Because FeatureEngine inherits from BaseEstimator and TransformerMixin, it natively supports hyperparameter tuning to find the optimal mathematical thresholds for your specific dataset.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
pipeline = Pipeline([
('feature_engine', FeatureEngine(problem_type='classification')),
('classifier', GradientBoostingClassifier())
])
param_grid = {
'feature_engine__correlation_threshold': [0.75, 0.85, 0.95],
'feature_engine__mi_threshold': [0.01, 0.05],
'classifier__learning_rate': [0.01, 0.1]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
The Audit Report
Calling .generate_report("report.html") produces a standalone HTML document containing:
- A summary count of features kept vs. dropped.
- A visual Bar Chart Funnel illustrating the reduction at each pipeline stage.
- A pre-filtering Correlation Heatmap to visualize dataset collinearity.
- A comprehensive Tabular Audit Trail detailing the exact mathematical reason a specific column was eliminated (e.g., "[CorrelationSelector] Dropped: Correlated 0.92 with feature_X. Kept feature_X because it has higher correlation to target.").
Contributing
Contributions to mathematical optimization, expanding the suite of transformers, or improving computational efficiency for massive datasets are welcome. Please ensure all pull requests maintain Scikit-Learn compatibility and do not introduce data leakage.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file feature_engine_pro-0.1.0.tar.gz.
File metadata
- Download URL: feature_engine_pro-0.1.0.tar.gz
- Upload date:
- Size: 24.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9c413a5cd2573da0a1ee34982c090d5b3d6ca92d2d9906213889699b2f63266
|
|
| MD5 |
4e2edbee3c0c45b8aa2224e993cb3542
|
|
| BLAKE2b-256 |
9315266b3acfdf4a219e6c6a7fadc594d4af715f6e671351f8ceade8e664eaf1
|
File details
Details for the file feature_engine_pro-0.1.0-py3-none-any.whl.
File metadata
- Download URL: feature_engine_pro-0.1.0-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77363af0fd6270aa25f02492a4ef9b99b0bb7b86716b603c4ace7c70224370ab
|
|
| MD5 |
057cde4ba42a0362344d7c4ad98bc296
|
|
| BLAKE2b-256 |
20184922fedc479f601262a41b37adf714f02c66589f7fe821dcdbaf9f0fd5c3
|