Feature engineering package with Scikit-learn's fit transform functionality
Project description
Feature Engine
Feature-engine is a Python library with multiple transformers to engineer features for use in machine learning models. Feature-engine's transformers follow Scikit-learn functionality with fit() and transform() methods to first learn the transforming paramenters from data and then transform the data.
Feature-engine features in the following resources:
Blogs about Feature-engine:
-
Feature-engine: A new open source Python package for feature engineering
-
Open Source Python libraries for Feature Engineering: Comparisons and Walkthroughs
Documentation
- Documentation: http://feature-engine.readthedocs.io
- Home page: https://www.trainindata.com/feature-engine
Current Feature-engine's transformers include functionality for:
- Missing data imputation
- Categorical variable encoding
- Outlier removal
- Discretisation
- Numerical Variable Transformation
Imputing Methods
- MeanMedianImputer
- RandomSampleImputer
- EndTailImputer
- AddNaNBinaryImputer
- CategoricalVariableImputer
- FrequentCategoryImputer
- ArbitraryNumberImputer
Encoding Methods
- CountFrequencyCategoricalEncoder
- OrdinalCategoricalEncoder
- MeanCategoricalEncoder
- WoERatioCategoricalEncoder
- OneHotCategoricalEncoder
- RareLabelCategoricalEncoder
Outlier Handling methods
- Winsorizer
- ArbitraryOutlierCapper
- OutlierTrimmer
Discretisation methods
- EqualFrequencyDiscretiser
- EqualWidthDiscretiser
- DecisionTreeDiscretiser
- UserInputDiscreriser
Variable Transformation methods
- LogTransformer
- ReciprocalTransformer
- PowerTransformer
- BoxCoxTransformer
- YeoJohnsonTransformer
Scikit-learn Wrapper:
- SklearnTransformerWrapper
Installing
pip install feature_engine
or
git clone https://github.com/solegalli/feature_engine.git
Usage
>>> from feature_engine.categorical_encoders import RareLabelCategoricalEncoder
>>> import pandas as pd
>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()
Out[1]:
A 10
B 10
C 2
D 1
Name: var_A, dtype: int64
>>> rare_encoder = RareLabelCategoricalEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()
Out[2]:
A 10
B 10
Rare 3
Name: var_A, dtype: int64
See more usage examples in the jupyter notebooks in the example folder of this repository, or in the documentation: http://feature-engine.readthedocs.io
Contributing
Local Setup Steps
- Clone the repo and cd into it
- Run
pip install tox
- Run
tox
if the tests pass, your local setup is complete
Opening Pull Requests
PR's are welcome! Please make sure the CI tests pass on your branch.
License
BSD 3-Clause
Authors
- Soledad Galli - Initial work - Feature Engineering for Machine Learning, Online Course.
References
Many of the engineering and encoding functionality is inspired by this series of articles from the 2009 KDD competition.
To learn more about the rationale, functionality, pros and cos of each imputer, encoder and transformer, refer to the Feature Engineering for Machine Learning, Online Course
For a summary of the methods check this presentation and this article
To stay alert of latest releases, sign up at trainindata
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for feature_engine-0.5.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49e7eaf9d12893d3d1bf843b2c38271b21ccdeed06b71ade508d3160aa93f135 |
|
MD5 | a2117f178470ede2ec5b9a0765cbde1d |
|
BLAKE2b-256 | 793367a2d0c0e91f786b33b6cd29f5c217343ccda63f1304e1f6cac069e05f40 |