A package for synthetic data generation for imputation using single and multiple imputation methods.

These details have not been verified by PyPI

Project links

Project description

ML-Impute

A python package for synthetic data generation using single and multiple imputation.

Ml-Impute is a library for generating synthetic data for null-value imputation, notably with the ability to handle mixed datatypes. This package is based off of the research of Audigier, Husson, and Josse and their method of iterative factor analysis for singular data imputation.
The goal of this package is to:
(a) provide an open source package for use of this method in Python for the first time, and;
(b) to provide an efficient parallelization of the algorithm when extending it to both single and multiple imputation.

Note: I am currently a university student and may not have the time to continue to release updates and changes as fast as some other packages might. In the spirit of open-source code, please feel free to add pull requests or open a new issue if you have bug fixes or improvements. Thank you for your understanding and for your contributions.

Table of Contents
Installation
Usage
Example
License

Installation

ML-Impute is currently available on PyPi.

Unix/Mac OS/Windows

pip install ml-impute

Usage

Currently, ML-Impute can handle both single and multiple imputation.

To follow a demonstration of both methods, proceed to the Example Section.

The following subsections provide an overview into each method along with their usage information.

To use the package post-installation via pip, instantiate the following object as follows:

from mpute import generator

gen = generator.Generator()

Generator.generate(self, dataframe, encode_cols, exclude_cols, max_iter, tol, explained_var, method, n_versions, noise)

Parameter	Description
dataframe	(*required*) Pandas dataframe object
encode_cols	(optional, default=[]) Categorical columns to be encoded. By default, ml-impute will encode all columns with object or category dtypes. However, many datasets contain numerical categorical data (ex/ Likert scales, classification types, etc.) that should be encoded.
exclude_cols	(optional, default=[]) Categorical columns to be excluded from encoding and/or imputation. On occastion, datasets will contain unique non-ordinal data (such as unique IDs) that, if encoded, will lead to large increases in memory usage and runtime. These columns should be excluded.
max_iter	(optional, default=1000) The maximum number of iterations of imputation before exit.
tol	(optional, default=1e-4) Tolerance bound for convergence. If Frobenius norm relative error is < tol before max_iter is reached, exit.
explained_var	(optional, default=0.95) Percentage of the total variance kept when reconstructing the dataframe after performing Singular Value Decomposition.
method	(optional, default="single") Specification for use of single or multiple imputation method. Possible values: ["single", "multiple"]
n_versions	(optional, default=20) If performing multiple imputation, the number of generated dataframes. If performing singular imputation, n_versions=1
noise	(optional, default="gaussian") If performing multiple impuation, specify the type of noise added to each generated dataset to create variation. Gaussian noise is centered around 0 with a standard deviation of 0.1. If performing singular imputation, noise=None
engine	(optional, default="default") For either singular or multiple imputation, choose the engine through which the SVD is calculated. Possible values: ["default", "dask"] "default" utilizes the JAX numpy library for efficient SVD calculation and multiprocessing, and is recommended for speed. "dask" creates a dask distributed scheduler which is used to compute the SVD. Given that this is an iterative method, this is recommended only when working with very large datasets.

Method	Return Value
"single"	imputed_df: a copy of the dataframe argument with synthetic data imputed for all null values
"multiple"	df_dict: a dictionary containing each of the n_versions of generated datasets with variable synthetic data. keys: [0, n_versions) values: [dataframes]

Single Imputation

Single imputation works with the following line:

imputed_df = gen.generate(dataframe)

Multiple Imputation

Multiple imputation is as simple as the following:

imputed_dfs = gen.generate(dataframe method="multiple")

Example

For the following example, we will use the titanic example-dataset available in sklearn.datasets openml.

Build the titanic dataset and create a Generator object as follows:

import pandas as pd
from mpute import generator
from sklearn import datasets

titanic, target = datasets.fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
titanic['survived'] = target

gen = generator.Generator()

Single Imputation

imputed_df = gen.generate(titanic, exclude_cols=['name', 'cabin', 'ticket'])

Note: 'name', 'cabin', and 'ticket' are excluded as they mainly contain unique identifiers, therefore unnecessary for imputation and if encoded, would result in a significant increase in memory usage.
It is possible to replace the cabin column with two columns such as 'deck' and 'position', as these may be a determinant of survival. However, this preprocessing would have to occur beforehand

Multiple Imputation

Multiple imputation is as simple as the following:

imputed_dfs = gen.generate(titanic method="multiple")

That's all there is to it. Happy using!

License

ML-Impute is published under the MIT License. Please see the LICENSE file for more information.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.7

Feb 22, 2023

0.0.6

Feb 22, 2023

0.0.5

Feb 22, 2023

0.0.4

Feb 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml_impute-0.0.7.tar.gz (11.3 kB view details)

Uploaded Feb 22, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ml_impute-0.0.7-py3-none-any.whl (13.0 kB view details)

Uploaded Feb 22, 2023 Python 3

File details

Details for the file ml_impute-0.0.7.tar.gz.

File metadata

Download URL: ml_impute-0.0.7.tar.gz
Upload date: Feb 22, 2023
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.9

File hashes

Hashes for ml_impute-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`a75835858573f0b3bd43f7d04e852ccb71210b13244b270eb36cedaa13538e1f`
MD5	`eb4e059786194fa9d6a32336935b5582`
BLAKE2b-256	`9f831541e336b2ee323d50fac5cf64e8982f3300746afce700ef1a2debf38786`

See more details on using hashes here.

File details

Details for the file ml_impute-0.0.7-py3-none-any.whl.

File metadata

Download URL: ml_impute-0.0.7-py3-none-any.whl
Upload date: Feb 22, 2023
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.9

File hashes

Hashes for ml_impute-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`85a7a70431c6123e94c3a6f2982418113c78d030999d61e314b6b51e799a09e9`
MD5	`6eb3833707ee7075658194745a3ddb25`
BLAKE2b-256	`d350981e67b2da8a9766a0c1580a5454a5d44af94e5ecad52b2ef65a1a6d8e39`

See more details on using hashes here.

ml-impute 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

ML-Impute

A python package for synthetic data generation using single and multiple imputation.

Table of Contents

Installation

Usage

Generator.generate(self, dataframe, encode_cols, exclude_cols, max_iter, tol, explained_var, method, n_versions, noise)

Single Imputation

Multiple Imputation

Example

Single Imputation

Multiple Imputation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes