Intelligent imputation using tree-based and machine learning algorithms
Project description
PyImpuyte
PyImpuyte
is a Python3.7+ package that simplifies the task of imputing missing values in datasets.
PyImpuyte
was built with a strong customer-centric focus and leverages of scikit-learn
. It brings together various imputation strategies and harnesses machine learning algorithms to improve data coverage.
PyImpuyte
gives the user exactly what they want - hassle free deployment of machine learning algorithms. Simply ingest your data, set your target, pass in a feature matrix and select your chosen imputation strategy. You now have machine generated imputed values appended to your dataframe.
To learn more about how to use PyImpuyte
, check out our docs for a step-by-step guide.
Contents
- Motivation
- Installation
- Quick Start
- Contribute
- Conferences and Meet-ups
- Citation
- Developers and Maintainers
- Acknowledgements
- Copyright
Motivation
Incomplete data are quite common which can deteriorate statistical inference. As such, the PyImpuyte
team set out to develop a Python package that simplifies the task of imputing missing values in Australian Government national statistical assets and other micro-data sets.
The development of PyImpuyte
is motivated by helping micro-data practitioners select and implement advanced imputation methods. PyImpuyte
adds an additional tool in the toolkit of practitioners seeking to preserve their data and fight information loss that arises from droping observations with missing values.
Main Features
- Interfaces with
scikit-learn
to provide a customer-centric and efficient way to perform imputation using machine learning algorithms. - Support for numerous imputation strategies and performance metrics, as specified below:
Imputation Strategies
Univariate | Generalised Linear Models | Bagging and Boosted Trees | Neural Nets |
---|---|---|---|
Mean | Linear Regressions | Bagging Regressor | Multi-layer Perceptron |
Median | Lasso | Extra Trees Regressor | |
Mode | Ridge | Extreme Gradient Boosting | |
Random Forest Regressor | |||
XGBoost, LightGBM, CatBoost |
Performance Metrics
Simple error |
Percentage error |
Naive forecasting |
Relative Error |
Bounded Relative Error |
Geometric mean |
Mean Squared Error |
Normalized Root Mean Squared Error |
Mean Error |
Mean Absolute Error |
Geometric Mean Absolute Error |
Median Absolute Error |
Mean Percentage Error |
Mean Absolute Percentage Error |
Median Absolute Percentage Error |
Symmetric Mean Absolute Percentage Error |
Symmetric Median Absolute Percentage Error |
Mean Arctangent Absolute Percentage Error |
Mean Absolute Scaled Error |
Normalized Absolute Error |
Normalized Absolute Percentage Error |
Root Mean Squared Percentage Error |
Root Median Squared Percentage Error |
Root Mean Squared Scaled Error |
Integral Normalized Root Squared Error |
Root Relative Squared Error |
Mean Relative Error |
Median Relative Absolute Error |
Geometric Mean Relative Absolute Error |
Mean Bounded Relative Absolute Error |
Unscaled Mean Bounded Relative Absolute Error |
Mean Directional Accuracy |
Versions and Dependencies
- Python 3.7+
- Dependencies:
missingno
>= 0.4.1numpy
>= 1.15.4pandas
>= 0.20.3scikit-learn
>= 0.20.2xgboost
>= 0.83
Installation
There are two ways to install the PyImpuyte
package:
- Install
PyImpuyte
from PyPI (recommended):
pip install PyImpuyte==1.3.5
- Install
PyImpuyte
from the Bitbucket source:
git clone https://bitbucket.csiro.au/scm/dde/pyimpuyte.git
cd pyimpuyte
python setup.py install
Quick Start
To start imputing missing values with PyImpuyte
, a config.json
file must be passed. The following workflow can be used:
{
"pyimpuyte": {
"input": [
"data/synth_data_test.csv"
],
"feature_list": ["TURNOVER", "WAGES", "SALES"],
"target": "FTE",
"skip_columns": null,
"nrows": 1000,
"drop_duplicates": true,
"output": "out/synth_data_test.csv",
"evaluation": "out/evaluation.csv"
}
}
For more information about how to configure PyImpuyte
, see our suggested template.
Contribute
We welcome all kinds of contributions that improve the performance of the currently published pacakge. See the Contribution Guide for more details.
Conferences and Meet-ups
-
We presented our research at the 2019 Australasian Joint Conference on Artificial Intelligence which lead to the development of
PyImpuyte
. -
We will be presenting at the next Canberra Data Scientists Meet-up on 28 July 2020.
Citation
Please cite our work in your publications if it helps your research.
- Conference Paper - Chapter 18 of AI2019: Advances in Artificial Intelligence.
@inbook{inbook,
author = {Suresh, Marcus and Taib, Ronnie and Zhao, Yanchang and Jin, Warren},
year = {2019},
month = {11},
pages = {215-227},
title = {Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning},
isbn = {978-3-030-35287-5},
doi = {10.1007/978-3-030-35288-2_18}
}
- Python Package - PyImpuyte.
@misc{Suresh2020_PyImpuyte,
title={PyImpuyte},
author={Suresh, Marcus et al.},
year={2020},
howpublished={\url{https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte}},
}
Developers and Maintainers
-
The developers began work to bring
PyImpuyte
into production in October 2019.PyImpuyte
is actively maintained and there will be incremental improvements scheduled on a regular basis. The lead developers and maintainers are:-
Marcus Suresh, Bitbucket: sur033 and GitHub: marcus-suresh
-
Ronnie Taib, GitHub: rtaib
-
-
See the Developers page to get in touch with the
PyImpuyte
team.
Acknowledgements
-
This research was funded by the Australian Government through the Department of Industry, Science, Energy and Resources (DISER) and the Data Integration Partnership for Australia (DIPA).
-
The developers would like to extend their gratitude to Dr. Abrie Swanepoel (Branch Manager) and Dr. Tala Talgasawatta (Director) from DISER for their ongoing support in
PyImpuyte
.
Copyright
PyImpuyte
is distributed under the MIT license. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file PyImpuyte-1.3.5.tar.gz
.
File metadata
- Download URL: PyImpuyte-1.3.5.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63d6bae2581868123a5207488798cdbe7cec2dc302a27df1116e72f4145edd8e |
|
MD5 | c8b4c55483d18133d9d694f266858e2a |
|
BLAKE2b-256 | b01f4f2bff3c9e0a781de864549cb2945a164601d887254bde76923b6147ca3c |
File details
Details for the file PyImpuyte-1.3.5-py3-none-any.whl
.
File metadata
- Download URL: PyImpuyte-1.3.5-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d331ab30141acadf25d59882a7919b73ba45b64cb7005d8d16fc0f2441669da1 |
|
MD5 | 14c8d9edce844ae91d5d53cf3cb332f2 |
|
BLAKE2b-256 | f9e6b175caa493b73af34db734da600ffcc935d263cec039e09b6303202a8d0c |