Skip to main content

mdatagen: A Python Library for the Generation of Artificial Missing Data

Project description

mdatagen: A Python Library for the Generation of Artificial Missing Data

Python3 License Documentation Latest PyPI Version

Overview

This package has been developed to address a gap in machine learning research, specifically the artificial generation of missing data. Santos et al. (2019) provided a survey that presents various strategies for both univariate and multivariate scenarios, but the Python community still needs implementations of these strategies. Our Python library missing-data-generator (mdatagen) puts forward a comprehensive set of implementations of missing data mechanisms, covering Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), allowing users to simulate several real-world scenarios comprising absent observations. The library is designed for easy integration with existing Python-based data analysis workflows, including well-established modules such as scikit-learn, and popular libraries for missing data visualization, such as missingno, enhancing its accessibility and usability for researchers.

This Python package is a collaboration between researchers at the Aeronautics Institute of Technologies (Brazil) and the University of Coimbra (Portugal).

User Guide

Please refer to the univariate docs or multivariate docs for more implementatios details.

Installation

To install the package, please use the pip installation as follows:

pip install mdatagen

API Usage

API usage is described in each of the following sections

Code examples

Here, we provide a basic usage for MAR mechanism in both univariate and multivariate scenarios to getting started. Also, we illustrate how to use the Histogram plot and evaluate the imputation quality.

MAR univariate

import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMAR import uMAR

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                      columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

generator = uMAR(X=X, 
                  y=y, 
                  missing_rate=50, 
                  x_miss='sepal length (cm)',
                  x_obs = 'petal lenght (cm)')

# Generate the missing data under MAR mechanism univariate
generate_data = generator.rank()
print(generate_data.isna().sum())

MAR multivariate

import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.multivariate.mMAR import mMAR

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                      columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

generator = mMAR(X=X, 
                  y=y)

# Generate the missing data under MAR mechanism multivariate
generate_data = generator.correlated(missing_rate=25)
print(generate_data.isna().sum())

Histogram plot

import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.plots.plot import PlotMissingData

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                        columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

# Create a instance with missing rate 
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X, 
                  y=y, 
                  missing_rate=25, 
                  x_miss='petal length (cm)')

# Generate the missing data under MNAR mechanism
generate_data = generator.random()


miss_plot = PlotMissingData(data_missing=generate_data,
                            data_original=X
                            )
miss_plot.visualize_miss("histogram",
                         col_missing="petal length (cm)",
                         save=True,
                         path_save_fig = "MCAR_iris.png")

Imputation Quality Evaluation: Mean Squared Error (MSE)

import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.metrics.metrics import EvaluateImputation

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                        columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

# Create a instance with missing rate 
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X, 
                  y=y, 
                  missing_rate=25, 
                  x_miss='petal length (cm)')

# Generate the missing data under MNAR mechanism
generate_data = generator.random()

# Calculate the metric: MSE
fill_zero = generate_data.drop("target",axis=1).fillna(0)
eval_metric = EvaluateImputation(
            data_imputed=fill_zero,
            data_original=X,
            metric="mean_squared_error")
print(eval_metric.show())

Contribuitions

Contributions are welcome! Feel free to open issues, submit pull requests, or provide feedback.

Citation

If you use mdatagen in your research, please cite the original paper

Bibtex entry:

@ARTICLE{Santos2019,
  author={Santos, Miriam Seoane and Pereira, Ricardo Cardoso and Costa, Adriana Fonseca and Soares, Jastin Pompeu and Santos, João and Abreu, Pedro Henriques},
  journal={IEEE Access}, 
  title={Generating Synthetic Missing Data: A Review by Missing Mechanism}, 
  year={2019},
  volume={7},
  number={},
  pages={11651-11667},
  doi={10.1109/ACCESS.2019.2891360}}

Acknowledgements

The authors gratefully acknowledge the Brazilian funding agencies FAPESP (Fundação Amparo à Pesquisa do Estado de São Paulo) under grants 2022/10553-6, 2023/13688-2, and 2021/06870-3. Moreover, this research was supported in part by the Coordenação de Aperfeiçoamento de Pessoalde Nível Superior - Brasil (CAPES) - Finance Code 001, and Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055 Center for Responsable AI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdatagen-0.1.63.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

mdatagen-0.1.63-py3-none-any.whl (28.7 kB view details)

Uploaded Python 3

File details

Details for the file mdatagen-0.1.63.tar.gz.

File metadata

  • Download URL: mdatagen-0.1.63.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for mdatagen-0.1.63.tar.gz
Algorithm Hash digest
SHA256 72c08473fe21ee9bd06fee7fd24fc08cadc791985510ee807eb41fadb02b2541
MD5 6a68ba8f523a3c8e1b05c8af45837e7e
BLAKE2b-256 2b5292b8c3faaf8eccaa9f729c791ff0c942bb851773cb8d80edf1b130ee9638

See more details on using hashes here.

File details

Details for the file mdatagen-0.1.63-py3-none-any.whl.

File metadata

  • Download URL: mdatagen-0.1.63-py3-none-any.whl
  • Upload date:
  • Size: 28.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for mdatagen-0.1.63-py3-none-any.whl
Algorithm Hash digest
SHA256 cb8844d013f608a78e131fa2704d031b9076384dd610a5e9d12e29f37c0d60a7
MD5 16b7adf96221df3b0a5dc36852de58d6
BLAKE2b-256 0957c4fc6b598a0740857981d4bef294e1299e226dc47d5e9728ac62b156e05e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page