Skip to main content

Intelligent imputation using tree-based and machine learning algorithms

Project description

PyImpuyte

forthebadge made-with-python Generic badge Documentation Status MIT license Python 3.7+ Maintenance Contributor Covenant

PyImpuyte is a Python3.7+ package that simplifies the task of imputing missing values in datasets.

PyImpuyte was built with a strong customer-centric focus and leverages of scikit-learn. It brings together various imputation strategies and harnesses machine learning algorithms to improve data coverage.

PyImpuyte gives the user exactly what they want - hassle free deployment of machine learning algorithms. Simply ingest your data, set your target, pass in a feature matrix and select your chosen imputation strategy. You now have machine generated imputed values appended to your dataframe.

To learn more about how to use PyImpuyte, check out our docs for a step-by-step guide.

Contents

Motivation

Incomplete data are quite common which can deteriorate statistical inference. As such, the PyImpuyte team set out to develop a Python package that simplifies the task of imputing missing values in Australian Government national statistical assets and other micro-data sets.

The development of PyImpuyte is motivated by helping micro-data practitioners select and implement advanced imputation methods. PyImpuyte adds an additional tool in the toolkit of practitioners seeking to preserve their data and fight information loss that arises from droping observations with missing values.

Main Features

  • Interfaces with scikit-learn to provide a customer-centric and efficient way to perform imputation using machine learning algorithms.
  • Support for numerous imputation strategies and performance metrics, as specified below:

Imputation Strategies

Univariate Generalised Linear Models Bagging and Boosted Trees Neural Nets
Mean Linear Regressions Bagging Regressor Multi-layer Perceptron
Median Lasso Extra Trees Regressor
Mode Ridge Extreme Gradient Boosting
Random Forest Regressor
XGBoost, LightGBM, CatBoost

Performance Metrics

Simple error
Percentage error
Naive forecasting
Relative Error
Bounded Relative Error
Geometric mean
Mean Squared Error
Normalized Root Mean Squared Error
Mean Error
Mean Absolute Error
Geometric Mean Absolute Error
Median Absolute Error
Mean Percentage Error
Mean Absolute Percentage Error
Median Absolute Percentage Error
Symmetric Mean Absolute Percentage Error
Symmetric Median Absolute Percentage Error
Mean Arctangent Absolute Percentage Error
Mean Absolute Scaled Error
Normalized Absolute Error
Normalized Absolute Percentage Error
Root Mean Squared Percentage Error
Root Median Squared Percentage Error
Root Mean Squared Scaled Error
Integral Normalized Root Squared Error
Root Relative Squared Error
Mean Relative Error
Median Relative Absolute Error
Geometric Mean Relative Absolute Error
Mean Bounded Relative Absolute Error
Unscaled Mean Bounded Relative Absolute Error
Mean Directional Accuracy

Versions and Dependencies

  • Python 3.7+
  • Dependencies:
    • missingno >= 0.4.1
    • numpy >= 1.15.4
    • pandas >= 0.20.3
    • scikit-learn >= 0.20.2
    • xgboost >= 0.83

Installation

There are two ways to install the PyImpuyte package:

  • Install PyImpuyte from PyPI (recommended):
pip install PyImpuyte==1.3.5
  • Install PyImpuyte from the Bitbucket source:
git clone https://bitbucket.csiro.au/scm/dde/pyimpuyte.git
cd pyimpuyte
python setup.py install

Quick Start

To start imputing missing values with PyImpuyte, a config.json file must be passed. The following workflow can be used:

{
    "pyimpuyte": {
        "input": [
            "data/synth_data_test.csv"
        ],
        "feature_list": ["TURNOVER", "WAGES", "SALES"],
        "target": "FTE",
        "skip_columns": null,
        "nrows": 1000,
        "drop_duplicates": true,
        "output": "out/synth_data_test.csv",
        "evaluation": "out/evaluation.csv"
    }
}

For more information about how to configure PyImpuyte, see our suggested template.

Contribute

We welcome all kinds of contributions that improve the performance of the currently published pacakge. See the Contribution Guide for more details.

Conferences and Meet-ups

Citation

Please cite our work in your publications if it helps your research.

@inbook{inbook,
  author = {Suresh, Marcus and Taib, Ronnie and Zhao, Yanchang and Jin, Warren},
  year = {2019},
  month = {11},
  pages = {215-227},
  title = {Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning},
  isbn = {978-3-030-35287-5},
  doi = {10.1007/978-3-030-35288-2_18}
}
@misc{Suresh2020_PyImpuyte,
  title={PyImpuyte},
  author={Suresh, Marcus et al.},
  year={2020},
  howpublished={\url{https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte}},
}

Developers and Maintainers

  • The developers began work to bring PyImpuyte into production in October 2019. PyImpuyte is actively maintained and there will be incremental improvements scheduled on a regular basis. The lead developers and maintainers are:

  • See the Developers page to get in touch with the PyImpuyte team.

Acknowledgements

Copyright

PyImpuyte is distributed under the MIT license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyImpuyte-1.3.5.tar.gz (5.3 kB view hashes)

Uploaded Source

Built Distribution

PyImpuyte-1.3.5-py3-none-any.whl (5.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page