Skip to main content

Intelligent imputation using tree-based and machine learning algorithms

Project description

PyImpuyte

forthebadge made-with-python Generic badge Documentation Status MIT license Python 3.7+ Maintenance Contributor Covenant

PyImpuyte is a Python3.7+ package that simplifies the task of imputing missing values in datasets.

PyImpuyte was built with a strong customer-centric focus and leverages of scikit-learn. It brings together various imputation strategies and harnesses machine learning algorithms to improve data coverage.

PyImpuyte gives the user exactly what they want - hassle free deployment of machine learning algorithms. Simply ingest your data, set your target, pass in a feature matrix and select your chosen imputation strategy. You now have machine generated imputed values appended to your dataframe.

To learn more about how to use PyImpuyte, check out our docs for a step-by-step guide.

Contents

Motivation

Incomplete data are quite common which can deteriorate statistical inference. As such, the PyImpuyte team set out to develop a Python package that simplifies the task of imputing missing values in Australian Government national statistical assets and other micro-data sets.

The development of PyImpuyte is motivated by helping micro-data practitioners select and implement advanced imputation methods. PyImpuyte adds an additional tool in the toolkit of practitioners seeking to preserve their data and fight information loss that arises from droping observations with missing values.

Main Features

  • Interfaces with scikit-learn to provide a customer-centric and efficient way to perform imputation using machine learning algorithms.
  • Support for numerous imputation strategies and performance metrics, as specified below:

Imputation Strategies

Univariate Generalised Linear Models Bagging and Boosted Trees Neural Nets
Mean Linear Regressions Bagging Regressor Multi-layer Perceptron
Median Lasso Extra Trees Regressor
Mode Ridge Extreme Gradient Boosting
Random Forest Regressor
XGBoost, LightGBM, CatBoost

Performance Metrics

Simple error
Percentage error
Naive forecasting
Relative Error
Bounded Relative Error
Geometric mean
Mean Squared Error
Normalized Root Mean Squared Error
Mean Error
Mean Absolute Error
Geometric Mean Absolute Error
Median Absolute Error
Mean Percentage Error
Mean Absolute Percentage Error
Median Absolute Percentage Error
Symmetric Mean Absolute Percentage Error
Symmetric Median Absolute Percentage Error
Mean Arctangent Absolute Percentage Error
Mean Absolute Scaled Error
Normalized Absolute Error
Normalized Absolute Percentage Error
Root Mean Squared Percentage Error
Root Median Squared Percentage Error
Root Mean Squared Scaled Error
Integral Normalized Root Squared Error
Root Relative Squared Error
Mean Relative Error
Median Relative Absolute Error
Geometric Mean Relative Absolute Error
Mean Bounded Relative Absolute Error
Unscaled Mean Bounded Relative Absolute Error
Mean Directional Accuracy

Versions and Dependencies

  • Python 3.7+
  • Dependencies:
    • missingno >= 0.4.1
    • numpy >= 1.15.4
    • pandas >= 0.20.3
    • scikit-learn >= 0.20.2
    • xgboost >= 0.83

Installation

There are two ways to install the PyImpuyte package:

  • Install PyImpuyte from PyPI (recommended):
pip install PyImpuyte==1.3.5
  • Install PyImpuyte from the Bitbucket source:
git clone https://bitbucket.csiro.au/scm/dde/pyimpuyte.git
cd pyimpuyte
python setup.py install

Quick Start

To start imputing missing values with PyImpuyte, a config.json file must be passed. The following workflow can be used:

{
    "pyimpuyte": {
        "input": [
            "data/synth_data_test.csv"
        ],
        "feature_list": ["TURNOVER", "WAGES", "SALES"],
        "target": "FTE",
        "skip_columns": null,
        "nrows": 1000,
        "drop_duplicates": true,
        "output": "out/synth_data_test.csv",
        "evaluation": "out/evaluation.csv"
    }
}

For more information about how to configure PyImpuyte, see our suggested template.

Contribute

We welcome all kinds of contributions that improve the performance of the currently published pacakge. See the Contribution Guide for more details.

Conferences and Meet-ups

Citation

Please cite our work in your publications if it helps your research.

@inbook{inbook,
  author = {Suresh, Marcus and Taib, Ronnie and Zhao, Yanchang and Jin, Warren},
  year = {2019},
  month = {11},
  pages = {215-227},
  title = {Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning},
  isbn = {978-3-030-35287-5},
  doi = {10.1007/978-3-030-35288-2_18}
}
@misc{Suresh2020_PyImpuyte,
  title={PyImpuyte},
  author={Suresh, Marcus et al.},
  year={2020},
  howpublished={\url{https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte}},
}

Developers and Maintainers

  • The developers began work to bring PyImpuyte into production in October 2019. PyImpuyte is actively maintained and there will be incremental improvements scheduled on a regular basis. The lead developers and maintainers are:

  • See the Developers page to get in touch with the PyImpuyte team.

Acknowledgements

Copyright

PyImpuyte is distributed under the MIT license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyImpuyte-1.3.5.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

PyImpuyte-1.3.5-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file PyImpuyte-1.3.5.tar.gz.

File metadata

  • Download URL: PyImpuyte-1.3.5.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for PyImpuyte-1.3.5.tar.gz
Algorithm Hash digest
SHA256 63d6bae2581868123a5207488798cdbe7cec2dc302a27df1116e72f4145edd8e
MD5 c8b4c55483d18133d9d694f266858e2a
BLAKE2b-256 b01f4f2bff3c9e0a781de864549cb2945a164601d887254bde76923b6147ca3c

See more details on using hashes here.

File details

Details for the file PyImpuyte-1.3.5-py3-none-any.whl.

File metadata

  • Download URL: PyImpuyte-1.3.5-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for PyImpuyte-1.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d331ab30141acadf25d59882a7919b73ba45b64cb7005d8d16fc0f2441669da1
MD5 14c8d9edce844ae91d5d53cf3cb332f2
BLAKE2b-256 f9e6b175caa493b73af34db734da600ffcc935d263cec039e09b6303202a8d0c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page