A Versatile Toolkit for Automated Feature Engineering in Machine Learning
Project description
Beaver FE
A Versatile Toolkit for Automated Feature Engineering in Machine Learning
Beaver FE is a Python library that streamlines feature engineering for machine learning. It provides robust tools for preprocessing tasks such as scaling, normalization, feature creation (e.g., binning, mathematical operations), and encoding. It improves data quality and boosts model performance with minimal manual effort.
📌 Table of Contents
- Getting Started
- Usage Examples
- Benchmark Results
- Core API
- Available Transformations
- Contributing
- License
🚀 Getting Started
Install Beaver FE using pip:
pip install beaverfe
📖 Usage Examples
🤖 Automated Feature Engineering
Automatically optimize feature transformations using a given model and metric:
from beaverfe import auto_feature_pipeline, BeaverPipeline
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
transformations = auto_feature_pipeline(x, y, model, scoring="accuracy", direction="maximize")
bfe = BeaverPipeline(transformations)
x_train = bfe.fit_transform(x_train, y_train)
x_test = bfe.transform(x_test, y_test)
🔧 Manual Transformations
from beaverfe import BeaverPipeline
from beaverfe.transformations import (
MathematicalOperations,
NumericalBinning,
OutliersHandler,
ScaleTransformation,
)
# Define transformations
transformations = [
OutliersHandler(
transformation_options={
"sepal length (cm)": ("median", "iqr"),
"sepal width (cm)": ("cap", "zscore"),
},
thresholds={
"sepal length (cm)": 1.5,
"sepal width (cm)": 2.5,
},
),
ScaleTransformation(
transformation_options={
"sepal length (cm)": "min_max",
"sepal width (cm)": "robust",
},
quantile_range={
"sepal width (cm)": (25.0, 75.0),
},
),
NumericalBinning(
transformation_options={
"sepal length (cm)": ("uniform", 5),
}
),
MathematicalOperations(
operations_options=[
("sepal length (cm)", "sepal width (cm)", "add"),
]
),
]
bfe = BeaverPipeline(transformations)
x_train = bfe.fit_transform(x_train, y_train)
x_test = bfe.transform(x_test, y_test)
💾 Saving and Loading Transformations
Save your pipeline for reuse across sessions:
import pickle
from beaverfe import BeaverPipeline
bfe = BeaverPipeline(transformations)
# Save pipeline parameters
with open("beaverfe_transformations.pkl", "wb") as f:
pickle.dump(bfe.get_params(), f)
# Load pipeline parameters
with open("beaverfe_transformations.pkl", "rb") as f:
params = pickle.load(f)
bfe.set_params(**params)
📊 Benchmark Results
Beaver FE was evaluated on several datasets and models to assess its impact on model performance. The table below compares baseline accuracy versus accuracy after applying Beaver FE transformations:
| Dataset | Model | Baseline | BeaverFE | Improvement |
|---|---|---|---|---|
| adult | ||||
| LDA | 0.848 | 0.905 | +6.72% | |
| LogisticRegression | 0.822 | 0.900 | +9.49% | |
| XGBoost | 0.921 | 0.923 | +0.22% | |
| bank | ||||
| LDA | 0.874 | 0.911 | +4.23% | |
| LogisticRegression | 0.854 | 0.909 | +6.44% | |
| XGBoost | 0.927 | 0.929 | +0.22% | |
| credit | ||||
| LDA | 0.717 | 0.761 | +6.14% | |
| LogisticRegression | 0.696 | 0.747 | +7.33% | |
| XGBoost | 0.760 | 0.757 | -0.39% |
🚨 Note: A slight decrease was observed in one case. This shows that although Beaver FE generally improves performance, results may vary depending on the model and dataset.
🧩 Core API
auto_feature_pipeline
Automatically finds and applies optimal transformations to improve model performance.
from beaverfe import auto_feature_pipeline
Parameters:
X(np.ndarray): Feature matrix.y(np.ndarray): Target variable.model: A machine learning model implementing afitmethod.scoring(str): Evaluation metric (e.g.,"accuracy","f1","roc_auc").direction(str, optional): Optimization direction:"maximize"or"minimize". Default is"maximize".cv(intor callable, optional): Cross-validation strategy (e.g., number of folds or a custom splitter). Default isNone.groups(np.ndarray, optional): Group labels for cross-validation. Useful for grouped CV.verbose(bool, optional): Whether to display progress logs. Default isTrue.
Transformation Flags:
Each step of the pipeline can be selectively enabled or disabled.
-
preprocessing(bool, default=True): Applies initial cleaning steps, including:- Missing value indicators and imputation
- Outlier detection and handling
- Extraction of datetime features
-
feature_generation(bool, default=True): Applies feature creation techniques such as:- Spline transformations
- Binning of numeric features
- Arithmetic operations
- Categorical encodings
- Cyclical date transformations
-
normalization(bool, default=True): Transforms feature distributions using:- Non-linear transformations
- Quantile transformations
- Normalization/scaling
-
dimensionality_reduction(bool, default=True): Reduces feature space through:- Feature selection (based on performance)
- Projection-based dimensionality reduction (e.g., PCA)
Execution Order:
Transformations are applied in the following order:
- Preprocessing (missing values, outliers, datetime)
- Feature Generation (splines, binning, math ops, encodings)
- Normalization (non-linear transforms, quantiles, scaling)
- Dimensionality Reduction (column selection, PCA)
Returns:
List[dict]: A list of transformation configurations that can be passed toBeaverPipeline.
BeaverPipeline
A wrapper to apply a sequence of transformations.
from beaverfe import BeaverPipeline
Constructor Parameters:
transformations(list, optional): List of transformation objects or dictionaries.
Public Methods:
-
fit(X, y=None)
Fits each transformation in the pipeline to the dataset.- Returns:
self
- Returns:
-
transform(X, y=None)
Applies each fitted transformation in sequence.- Returns: Transformed feature matrix (
np.ndarrayorpd.DataFrame)
- Returns: Transformed feature matrix (
-
fit_transform(X, y=None)
Combinesfitandtransformfor each transformation.- Returns: Transformed feature matrix.
-
get_params(deep=True)
Retrieves the parameters of the pipeline (mainly the transformations).- Returns: Dictionary of parameters.
-
set_params(**params)
Sets or updates the pipeline parameters.- Returns:
self
- Returns:
🔍 Available Transformations
Grouped by feature type or transformation category:
📌 Missing Values & Outliers
Missing Values Indicator
Adds binary flags for missing values.
- Parameters:
features: List of column names to check for missing values. If None, all columns are considered.
from beaverfe.transformations import MissingValuesIndicator
MissingValuesIndicator(
features=[
'sepal width (cm)',
'petal length (cm)',
]
)
Missing Values Handler
Fills missing values.
- Parameters:
transformation_options: Dictionary that specifies the handling strategy for each column. Options:fill_0,mean,median,most_frequent,knn.n_neighbors: Number of neighbors for K-Nearest Neighbors imputation (used withknn).
from beaverfe.transformations import MissingValuesHandler
MissingValuesHandler(
transformation_options={
'sepal width (cm)': 'knn',
'petal length (cm)': 'mean',
'petal width (cm)': 'most_frequent',
},
n_neighbors= {
'sepal width (cm)': 5,
}
)
Handle Outliers
Detects and mitigates outliers using methods like iqr, zscore, lof, or iforest.
- Parameters:
transformation_options: Dictionary specifying the handling strategy. The strategy is a tuple where the first element is the action (capormedian) and the second is the method (iqr,zscore,lof,iforest).thresholds: Dictionary with thresholds foriqrandzscoremethods.lof_params: Dictionary specifying parameters for the LOF method.iforest_params: Dictionary specifying parameters for Isolation Forest.
from beaverfe.transformations import OutliersHandler
OutliersHandler(
transformation_options={
'sepal length (cm)': ('median', 'iqr'),
'sepal width (cm)': ('cap', 'zscore'),
'petal length (cm)': ('median', 'lof'),
'petal width (cm)': ('median', 'iforest'),
},
thresholds={
'sepal length (cm)': 1.5,
'sepal width (cm)': 2.5,
},
lof_params={
'petal length (cm)': {
'n_neighbors': 20,
}
},
iforest_params={
'petal width (cm)': {
'contamination': 0.1,
}
}
)
📌 Data Distribution & Scaling
Non-Linear Transformation
Applies logarithmic, exponential, or Yeo-Johnson transformations.
- Parameters:
transformation_options: A dictionary specifying the transformation to be applied for each column. Options include:log,exponential, andyeo_johnson.
from beaverfe.transformations import NonLinearTransformation
NonLinearTransformation(
transformation_options={
"sepal length (cm)": "log",
"sepal width (cm)": "exponential",
"petal length (cm)": "yeo_johnson",
}
)
Quantile Transformations
Transforms data to follow a normal or uniform distribution.
- Parameters:
transformation_options: Dictionary specifying the transformation type. Options:uniform,normal.
from beaverfe.transformations import QuantileTransformation
QuantileTransformation(
transformation_options={
'sepal length (cm)': 'uniform',
'sepal width (cm)': 'normal',
}
)
Scale Transformations
Scales numerical data using different scaling methods.
- Parameters:
transformation_options: Dictionary specifying the scaling method for each column. Options:min_max,standard,robust,max_abs.quantile_range: Dictionary specifying the quantile ranges for robust scaling.
from beaverfe.transformations import ScaleTransformation
ScaleTransformation(
transformation_options={
'sepal length (cm)': 'min_max',
'sepal width (cm)': 'standard',
'petal length (cm)': 'robust',
'petal width (cm)': 'max_abs',
},
quantile_range={
"petal length (cm)": (25.0, 75.0),
},
)
Normalization
Normalizes data using L1 or L2 norms.
- Parameters:
transformation_options: Dictionary specifying the normalization type. Options:l1,l2.
from beaverfe.transformations import Normalization
Normalization(
transformation_options={
'sepal length (cm)': 'l1',
'sepal width (cm)': 'l2',
}
)
📌 Numerical Features
Spline Transformations
Applies Spline transformation to numerical features.
- Parameters:
transformation_options: Dictionary specifying the spline transformation settings for each column. Options include different numbers of knots and degrees.
from beaverfe.transformations import SplineTransformation
SplineTransformation(
transformation_options={
'sepal length (cm)': {'degree': 3, 'n_knots': 3},
'sepal width (cm)': {'degree': 3, 'n_knots': 5},
}
)
Numerical Binning
Bins numerical columns into categories. You can now specify the column, the binning method, and the number of bins in a tuple.
- Parameters:
transformation_options: Dictionary specifying the binning method and number of bins for each column. Options for binning methods areuniform,quantileorkmeans.
from beaverfe.transformations import NumericalBinning
NumericalBinning(
transformation_options={
"sepal length (cm)": ("uniform", 5),
"sepal width (cm)": ("quantile", 6),
"petal length (cm)": ("kmeans", 7),
}
)
Mathematical Operations
Performs mathematical operations between columns.
-
Parameters:
operations_options: List of tuples specifying the columns and the operation.
-
Options:
add: Adds the values of two columns.subtract: Subtracts the values of two columns.multiply: Multiplies the values of two columns.divide: Divides the values of two columns.modulus: Computes the modulus of two columns.hypotenuse: Computes the hypotenuse of two columns.mean: Calculates the mean of two columns.
from beaverfe.transformations import MathematicalOperations
MathematicalOperations(
operations_options=[
('sepal length (cm)', 'sepal width (cm)', 'add'),
('petal length (cm)', 'petal width (cm)', 'subtract'),
('sepal length (cm)', 'petal length (cm)', 'multiply'),
('sepal width (cm)', 'petal width (cm)', 'divide'),
('sepal length (cm)', 'petal width (cm)', 'modulus'),
('sepal length (cm)', 'sepal width (cm)', 'hypotenuse'),
('petal length (cm)', 'petal width (cm)', 'mean'),
]
)
📌 Categorical Features
Categorical Encoding
Encodes categorical variables using various methods.
-
Parameters:
encodings_options: Dictionary specifying the encoding method for each column.ordinal_orders: Specifies the order for ordinal encoding.
-
Encodings:
backward_diff: Uses backward difference coding to compare each category to the previous one.basen: Encodes categorical features using a base-N representation.binary: Converts categorical variables into binary representations.catboost: Implements the CatBoost encoding, which is a target-based encoding method.count: Replaces categories with the count of occurrences in the dataset.dummy: Applies dummy coding, similar to one-hot encoding but with one less category to avoid collinearity.glmm: Uses Generalized Linear Mixed Models to encode categorical variables.gray: Converts categories into Gray code, a binary numeral system where two successive values differ in only one bit.hashing: Uses a hashing trick to encode categorical features into a fixed number of dimensions.helmert: Compares each level of a categorical variable to the mean of subsequent levels.james_stein: Applies James-Stein shrinkage estimation for target encoding.label: Assigns each category a unique integer label.loo: Uses leave-one-out target encoding to replace categories with the mean target value, excluding the current row.m_estimate: A variant of target encoding that applies an m-estimate to regularize values.onehot: Converts categorical variables into binary vectors where each category is represented by a separate column.ordinal: Replaces categories with ordinal values based on their ordering.polynomial: Applies polynomial contrast coding to categorical variables.quantile: Maps categorical variables to quantiles based on their distribution.rankhot: Encodes categories based on their ranking, similar to one-hot but considering order.sum: Uses sum coding to compare each level to the overall mean.target: Encodes categories using the mean of the target variable for each category.woe: Applies Weight of Evidence (WoE) encoding, useful in logistic regression by transforming categorical data into log odds.
from beaverfe.transformations import CategoricalEncoding
CategoricalEncoding(
transformation_options={
'Sex': 'label',
'Size': 'ordinal',
},
ordinal_orders={
"Size": ["small", "medium", "large"]
}
)
📌 Periodic Features
Date Time Transforms
Extracts time-based features like day, month, hour, etc.
- Parameters:
features: List of columns to extract date/time features from. If None, all datetime columns are considered.
from beaverfe.transformations import DateTimeTransformer
DateTimeTransformer(
features=["date"]
)
Cyclical Features Transforms
Encodes cyclical values using sine and cosine representations.
- Parameters:
transformation_options: Dictionary specifying the period for each cyclical column.
from beaverfe.transformations import CyclicalFeaturesTransformer
CyclicalFeaturesTransformer(
transformation_options={
"date_minute": 60,
"date_hour": 24,
}
)
📌 Features Reduction
Column Selection
Selects a subset of columns for further transformation.
- Parameters:
features: List of column names to select.
from beaverfe.transformations import ColumnSelection
ColumnSelection(
features=[
"sepal length (cm)",
"sepal width (cm)",
]
)
Dimensionality Reduction
Reduces the dimensionality of the dataset using various techniques, such as PCA, Factor Analysis, ICA, LDA, and others.
-
Parameters:
features: List of column names to apply the dimensionality reduction. If None, all columns are considered.method: The dimensionality reduction method to apply.n_components: Number of dimensions to reduce the data to.
-
Methods:
pca: Principal Component Analysis.factor_analysis: Factor Analysis.ica: Independent Component Analysis.kernel_pca: Kernel PCA.lda: Linear Discriminant Analysis.truncated_svd: Truncated Singular Value Decomposition.isomap: Isomap Embedding.lle: Locally Linear Embedding.
-
Notes: For
lda, the y target variable is required, as it uses class labels for discriminant analysis.
from beaverfe.transformations import DimensionalityReduction
DimensionalityReduction(
method="pca",
n_components=3
)
🛠️ Contributing
We welcome contributions! Please submit pull requests, open issues, or share suggestions to improve Beaver FE.
📄 License
Beaver FE is open-source software distributed under the MIT License.
🚀 Power up your ML workflows with intelligent, flexible feature engineering — with just a few lines of code. Try Beaver FE today!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file beaverfe-0.3.0.tar.gz.
File metadata
- Download URL: beaverfe-0.3.0.tar.gz
- Upload date:
- Size: 4.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.30
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e32182cbc008d7aa79b6f044dc050c4b0c4c4af789d59c718c96cc40adb5c87
|
|
| MD5 |
eee142dbcc0059bec619ae0b9cdac522
|
|
| BLAKE2b-256 |
60538ee367c97d832bca25e75591eea561e1de9b5cc98746d59bea9cfafbf493
|
File details
Details for the file beaverfe-0.3.0-py3-none-any.whl.
File metadata
- Download URL: beaverfe-0.3.0-py3-none-any.whl
- Upload date:
- Size: 55.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.30
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df35d627985527051cf3f91eda0d9e9a5c98d4c64c09877b226933b16e434e38
|
|
| MD5 |
a9b77a3d96158e15d1ad7f3e8ed4c287
|
|
| BLAKE2b-256 |
2de3d8dd1bff1fb7dc5fd6a409c1e8492d07584447f1a2f60e4636f3ba41ac59
|