Intelligent imputation using tree-based and machine learning algorithms

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- Software Development

Project description

PyImpuyte

PyImpuyte is a Python3.7+ package that simplifies the task of imputing missing values in datasets.

PyImpuyte was built with a strong customer-centric focus and leverages of scikit-learn. It brings together various imputation strategies and harnesses machine learning algorithms to improve data coverage.

PyImpuyte gives the user exactly what they want - hassle free deployment of machine learning algorithms. Simply ingest your data, set your target, pass in a feature matrix and select your chosen imputation strategy. You now have machine generated imputed values appended to your dataframe.

To learn more about how to use PyImpuyte, check out our docs for a step-by-step guide.

Motivation
Installation
Quick Start
Contribute
Conferences and Meet-ups
Citation
Developers and Maintainers
Acknowledgements
Copyright

Motivation

Incomplete data are quite common which can deteriorate statistical inference. As such, the PyImpuyte team set out to develop a Python package that simplifies the task of imputing missing values in Australian Government national statistical assets and other micro-data sets.

The development of PyImpuyte is motivated by helping micro-data practitioners select and implement advanced imputation methods. PyImpuyte adds an additional tool in the toolkit of practitioners seeking to preserve their data and fight information loss that arises from droping observations with missing values.

Main Features

Interfaces with scikit-learn to provide a customer-centric and efficient way to perform imputation using machine learning algorithms.
Support for numerous imputation strategies and performance metrics, as specified below:

Imputation Strategies

Univariate	Generalised Linear Models	Bagging and Boosted Trees	Neural Nets
Mean	Linear Regressions	Bagging Regressor	Multi-layer Perceptron
Median	Lasso	Extra Trees Regressor
Mode	Ridge	Extreme Gradient Boosting
		Random Forest Regressor
		XGBoost, LightGBM, CatBoost

Performance Metrics


Simple error
Percentage error
Naive forecasting
Relative Error
Bounded Relative Error
Geometric mean
Mean Squared Error
Normalized Root Mean Squared Error
Mean Error
Mean Absolute Error
Geometric Mean Absolute Error
Median Absolute Error
Mean Percentage Error
Mean Absolute Percentage Error
Median Absolute Percentage Error
Symmetric Mean Absolute Percentage Error
Symmetric Median Absolute Percentage Error
Mean Arctangent Absolute Percentage Error
Mean Absolute Scaled Error
Normalized Absolute Error
Normalized Absolute Percentage Error
Root Mean Squared Percentage Error
Root Median Squared Percentage Error
Root Mean Squared Scaled Error
Integral Normalized Root Squared Error
Root Relative Squared Error
Mean Relative Error
Median Relative Absolute Error
Geometric Mean Relative Absolute Error
Mean Bounded Relative Absolute Error
Unscaled Mean Bounded Relative Absolute Error
Mean Directional Accuracy

Versions and Dependencies

Python 3.7+
Dependencies:
- missingno >= 0.4.1
- numpy >= 1.15.4
- pandas >= 0.20.3
- scikit-learn >= 0.20.2
- xgboost >= 0.83

Installation

There are two ways to install the PyImpuyte package:

Install PyImpuyte from PyPI (recommended):

pip install PyImpuyte==1.3.5

Install PyImpuyte from the Bitbucket source:

git clone https://bitbucket.csiro.au/scm/dde/pyimpuyte.git
cd pyimpuyte
python setup.py install

Quick Start

To start imputing missing values with PyImpuyte, a config.json file must be passed. The following workflow can be used:

{
    "pyimpuyte": {
        "input": [
            "data/synth_data_test.csv"
        ],
        "feature_list": ["TURNOVER", "WAGES", "SALES"],
        "target": "FTE",
        "skip_columns": null,
        "nrows": 1000,
        "drop_duplicates": true,
        "output": "out/synth_data_test.csv",
        "evaluation": "out/evaluation.csv"
    }
}

For more information about how to configure PyImpuyte, see our suggested template.

Contribute

We welcome all kinds of contributions that improve the performance of the currently published pacakge. See the Contribution Guide for more details.

Conferences and Meet-ups

We presented our research at the 2019 Australasian Joint Conference on Artificial Intelligence which lead to the development of PyImpuyte.
We will be presenting at the next Canberra Data Scientists Meet-up on 28 July 2020.

Citation

Please cite our work in your publications if it helps your research.

Conference Paper - Chapter 18 of AI2019: Advances in Artificial Intelligence.

@inbook{inbook,
  author = {Suresh, Marcus and Taib, Ronnie and Zhao, Yanchang and Jin, Warren},
  year = {2019},
  month = {11},
  pages = {215-227},
  title = {Sharpening the BLADE: Missing Data Imputation Using Supervised Machine Learning},
  isbn = {978-3-030-35287-5},
  doi = {10.1007/978-3-030-35288-2_18}
}

Python Package - PyImpuyte.

@misc{Suresh2020_PyImpuyte,
  title={PyImpuyte},
  author={Suresh, Marcus et al.},
  year={2020},
  howpublished={\url{https://bitbucket.csiro.au/projects/DDE/repos/pyimpuyte}},
}

Developers and Maintainers

The developers began work to bring PyImpuyte into production in October 2019. PyImpuyte is actively maintained and there will be incremental improvements scheduled on a regular basis. The lead developers and maintainers are:
- Marcus Suresh, Bitbucket: sur033 and GitHub: marcus-suresh
- Ronnie Taib, GitHub: rtaib
See the Developers page to get in touch with the PyImpuyte team.

Acknowledgements

This research was funded by the Australian Government through the Department of Industry, Science, Energy and Resources (DISER) and the Data Integration Partnership for Australia (DIPA).
The developers would like to extend their gratitude to Dr. Abrie Swanepoel (Branch Manager) and Dr. Tala Talgasawatta (Director) from DISER for their ongoing support in PyImpuyte.

Copyright

PyImpuyte is distributed under the MIT license. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering
- Software Development

Release history Release notifications | RSS feed

This version

1.3.5

Mar 17, 2020

1.3.4

Mar 17, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyImpuyte-1.3.5.tar.gz (5.3 kB view details)

Uploaded Mar 17, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

PyImpuyte-1.3.5-py3-none-any.whl (5.5 kB view details)

Uploaded Mar 17, 2020 Python 3

File details

Details for the file PyImpuyte-1.3.5.tar.gz.

File metadata

Download URL: PyImpuyte-1.3.5.tar.gz
Upload date: Mar 17, 2020
Size: 5.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for PyImpuyte-1.3.5.tar.gz
Algorithm	Hash digest
SHA256	`63d6bae2581868123a5207488798cdbe7cec2dc302a27df1116e72f4145edd8e`
MD5	`c8b4c55483d18133d9d694f266858e2a`
BLAKE2b-256	`b01f4f2bff3c9e0a781de864549cb2945a164601d887254bde76923b6147ca3c`

See more details on using hashes here.

File details

Details for the file PyImpuyte-1.3.5-py3-none-any.whl.

File metadata

Download URL: PyImpuyte-1.3.5-py3-none-any.whl
Upload date: Mar 17, 2020
Size: 5.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for PyImpuyte-1.3.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d331ab30141acadf25d59882a7919b73ba45b64cb7005d8d16fc0f2441669da1`
MD5	`14c8d9edce844ae91d5d53cf3cb332f2`
BLAKE2b-256	`f9e6b175caa493b73af34db734da600ffcc935d263cec039e09b6303202a8d0c`

See more details on using hashes here.

PyImpuyte 1.3.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyImpuyte

Contents

Motivation

Main Features

Imputation Strategies

Performance Metrics

Versions and Dependencies

Installation

Quick Start

Contribute

Conferences and Meet-ups

Citation

Developers and Maintainers

Acknowledgements

Copyright

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes