Skip to main content

A Python library for optimal data imputation.

Project description

GitHubActions ReadTheDocs License PythonVersion PyPi Release Commits Codecov

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png

Qolmat - The Tool for Data Imputation

Qolmat provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.

🔗 Requirements

Python 3.8+

🛠 Installation

Qolmat can be installed in different ways:

$ pip install qolmat  # installation via `pip`
$ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch
$ pip install git+https://github.com/Quantmetry/qolmat  # or directly from the github repository

⚡️ Quickstart

Let us start with a basic imputation problem. We generate one-dimensional noisy time series with missing values. With just these few lines of code, you can see how easy it is to

  • impute missing values with one particular imputer;

  • benchmark multiple imputation methods with different metrics.

import numpy as np
import pandas as pd

from qolmat.benchmark import comparator, missing_patterns
from qolmat.imputations import imputers
from qolmat.utils import data

# load and prepare csv data

df_data = data.get_data("Beijing")
columns = ["TEMP", "PRES", "WSPM"]
df_data = df_data[columns]
df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)

# impute and compare
imputer_median = imputers.ImputerSimple(groups=("station",))
imputer_interpol = imputers.ImputerInterpolation(method="linear", groups=("station",))
imputer_var1 = imputers.ImputerEM(model="VAR", groups=("station",), method="mle", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)
dict_imputers = {
      "median": imputer_median,
      "interpolation": imputer_interpol,
      "VAR(1) process": imputer_var1
  }
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
comparison = comparator.Comparator(
      dict_imputers,
      generator_holes = generator_holes,
      metrics = ["mae", "wmape", "kl_columnwise", "frechet"],
  )
results = comparison.compare(df_with_nan)
results.style.highlight_min(color="lightsteelblue", axis=1)
https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png

📘 Documentation

The full documentation can be found on this link.

How does Qolmat work ?

Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:

  1. For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified hole generator.

  2. For each fold and each compared imputation method, Qolmat fills both the missing and the masked values, then computes each of the default or user specified performance metrics.

  3. For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.

This is very similar in spirit to the cross_val_score function for scikit-learn.

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png

Imputation methods

The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the Imputation article on Wikipedia.

Method

Description

Tabular or Time series

Single or Multiple

mean

Imputes the missing values using the mean along each column

tabular

single

median

Imputes the missing values using the median along each column

tabular

single

LOCF

Imputes missing entries by carrying the last observation forward for each columns

time series

single

shuffle

Imputes missing entries with the random value of each column

tabular

multiple

interpolation

Imputes missing using some interpolation strategies supported by pd.Series.interpolate

time series

single

impute on residuals

The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised

time series

single

MICE

Multiple Imputation by Chained Equation

tabular

both

RPCA

Robust Principal Component Analysis

both

single

SoftImpute

Iterative method for matrix completion that uses nuclear-norm regularization

tabular

single

KNN

K-nearest kneighbors

tabular

single

EM sampler

Imputes missing values via EM algorithm

both

both

MLP

Imputer based Multi-Layers Perceptron Model

both

both

Autoencoder

Imputer based Autoencoder Model with Variationel method

both

both

TabDDPM

Imputer based on Denoising Diffusion Probabilistic Models

both

both

📝 Contributing

You are welcome to propose and contribute new ideas. We encourage you to open an issue so that we can align on the work to be done. It is generally a good idea to have a quick discussion before opening a pull request that is potentially out-of-scope. For more information on the contribution process, please go here.

🤝 Affiliation

Qolmat has been developed by Quantmetry.

Quantmetry

🔍 References

[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.” Journal of the ACM (JACM) 58.3 (2011): 1-37, (pdf)

[2] Wang, Xuehui, et al. “An improved robust principal component analysis model for anomalies detection of subway passenger flow.” Journal of advanced transportation 2018 (2018). (pdf)

[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data.” Annals of statistics, 49(5), 2948 (2021), (pdf)

[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. (pdf)

[5] Jiashi Feng, et al. “Online robust pca via stochastic optimization.“ Advances in neural information processing systems, 26, 2013. (pdf)

[6] García, S., Luengo, J., & Herrera, F. “Data preprocessing in data mining”. 2015. (pdf)

[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. “Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series” (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (pdf)

📝 License

Qolmat is free and open-source software licensed under the BSD 3-Clause license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qolmat-0.2.0.tar.gz (18.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qolmat-0.2.0-py3-none-any.whl (16.4 MB view details)

Uploaded Python 3

File details

Details for the file qolmat-0.2.0.tar.gz.

File metadata

  • Download URL: qolmat-0.2.0.tar.gz
  • Upload date:
  • Size: 18.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qolmat-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ed3176955c748ea412c7968ea427ba95b4be8e73e0a31b16a8789e7f2411a30e
MD5 f606487f6cc60f9e39aa7ae93b9d198f
BLAKE2b-256 8b77b0b26f9e5d8d86712f3e141699e0c74ae271dddaf2542e5d27776f0db941

See more details on using hashes here.

Provenance

The following attestation bundles were made for qolmat-0.2.0.tar.gz:

Publisher: publish.yml on scikit-learn-contrib/qolmat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qolmat-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: qolmat-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qolmat-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d898be8de5cf56891f65f931b1b1663a40f6515496a0fd4523fed1559a6c7347
MD5 51930bb8fba446423f32664adde174fe
BLAKE2b-256 af5481554f55ddbb178a52205d6673dcf98886f7d5b573c2bc46cea1446c2fed

See more details on using hashes here.

Provenance

The following attestation bundles were made for qolmat-0.2.0-py3-none-any.whl:

Publisher: publish.yml on scikit-learn-contrib/qolmat

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page