Skip to main content

Tools to impute

Project description

hlbotterman@quantmetry.com, jroussel@quantmetry.com, tmorzadec@quantmetry.com, rhajou@quantmetry.com, fdakhli@quantmetry.com

License: new BSD Project-URL: Bug Tracker, https://github.com/Quantmetry/qolmat Project-URL: Documentation, https://qolmat.readthedocs.io/en/latest/ Project-URL: Source Code, https://github.com/Quantmetry/qolmat Classifier: Intended Audience :: Science/Research Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved Classifier: Topic :: Software Development Classifier: Topic :: Scientific/Engineering Classifier: Operating System :: Microsoft :: Windows Classifier: Operating System :: POSIX Classifier: Operating System :: Unix Classifier: Operating System :: MacOS Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Requires-Python: >=3.8 Description-Content-Type: text/x-rst Provides-Extra: tests Provides-Extra: docs Provides-Extra: pytorch License-File: LICENSE License-File: AUTHORS.rst

GitHubActions ReadTheDocs License PythonVersion PyPi Release Commits

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png

Qolmat - The Tool for Data Imputation

Qolmat provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.

🔗 Requirements

Python 3.8+

🛠 Installation

Install via pip:

$ pip install qolmat

If you need to use tensorflow, you can install it with the following ‘pip’ command:

$ pip install qolmat[tensorflow]

To install directly from the github repository :

$ pip install git+https://github.com/Quantmetry/qolmat

⚡️ Quickstart

Let us start with a basic imputation problem. Here, we generate one-dimensional noisy time series.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(42)
t = np.linspace(0,1,1000)
y = np.cos(2*np.pi*t*10)+np.random.randn(1000)/2
df = pd.DataFrame({'y': y}, index=pd.Series(t, name='index'))

For this demonstration, let us create artificial holes in our dataset.

from qolmat.utils.data import add_holes
plt.rcParams.update({'font.size': 18})

ratio_masked = 0.1
mean_size = 20
df_with_nan = add_holes(df, ratio_masked=ratio_masked, mean_size=mean_size)
is_na = df_with_nan['y'].isna()

plt.figure(figsize=(25,4))
plt.plot(df_with_nan['y'],'.')
plt.plot(df.loc[is_na, 'y'],'.')
plt. grid()
plt.xlim(0,1)

plt.legend(['Data', 'Missing data'])
plt.savefig('readme1.png')
plt.show()
https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme1.png

To impute missing data, there are several methods that can be imported with from qolmat.imputations import imputers. The creation of an imputation dictionary will enable us to benchmark the various imputations.

from sklearn.linear_model import LinearRegression
from qolmat.imputations import imputers

imputer_mean = imputers.ImputerMean()
imputer_median = imputers.ImputerMedian()
imputer_mode = imputers.ImputerMode()
imputer_locf = imputers.ImputerLOCF()
imputer_nocb = imputers.ImputerNOCB()
imputer_interpol = imputers.ImputerInterpolation(method="linear")
imputer_spline = imputers.ImputerInterpolation(method="spline", order=2)
imputer_shuffle = imputers.ImputerShuffle()
imputer_residuals = imputers.ImputerResiduals(period=10, model_tsa="additive", extrapolate_trend="freq", method_interpolation="linear")
imputer_rpca = imputers.ImputerRPCA(columnwise=True, period=10, max_iter=200, tau=2, lam=.3)
imputer_rpca_opti = imputers.ImputerRPCA(columnwise=True, period = 10, max_iter=100)
imputer_ou = imputers.ImputerEM(model="multinormal", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_tsou = imputers.ImputerEM(model="VAR1", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_tsmle = imputers.ImputerEM(model="VAR1", method="mle", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_knn = imputers.ImputerKNN(k=10)
imputer_mice = imputers.ImputerMICE(estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
imputer_regressor = imputers.ImputerRegressor(estimator=LinearRegression())

dict_imputers = {
    "mean": imputer_mean,
    "median": imputer_median,
    "mode": imputer_mode,
    "interpolation": imputer_interpol,
    "spline": imputer_spline,
    "shuffle": imputer_shuffle,
    "residuals": imputer_residuals,
    "OU": imputer_ou,
    "TSOU": imputer_tsou,
    "TSMLE": imputer_tsmle,
    "RPCA": imputer_rpca,
    "RPCA_opti": imputer_rpca_opti,
    "locf": imputer_locf,
    "nocb": imputer_nocb,
    "knn": imputer_knn,
    "ols": imputer_regressor,
    "mice_ols": imputer_mice,
}

It is possible to define a parameter dictionary for an imputer with three pieces of information: min, max and type. The aim of the dictionary is to determine the optimal parameters for data imputation. Here, we call this dictionary dict_config_opti.

search_params = {
    "RPCA_opti": {
        "tau": {"min": .5, "max": 5, "type":"Real"},
        "lam": {"min": .1, "max": 1, "type":"Real"},
    }
}

Then with the comparator function in from qolmat.benchmark import comparator, we can compare the different imputation methods. This does not use knowledge on missing values, but it relies data masking instead. For more details on how imputors and comparator work, please see the following link.

from qolmat.benchmark import comparator

generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)

comparison = comparator.Comparator(
    dict_imputers,
    ['y'],
    generator_holes = generator_holes,
    metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
    n_calls_opt = 10,
    dict_config_opti = dict_config_opti,
)
results = comparison.compare(df_with_nan)

We can observe the benchmark results.

dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)

plt.figure(figsize=(25,5))

plt.plot(df.loc[~is_na, 'y'],'.')
plt.plot(df.loc[is_na, 'y'],'.')
plt.plot(dfs_imputed.loc[is_na, 'y'],'.')

plt. grid()
plt.xlim(0,1)
plt.legend(['Data','Missing data', 'Imputed data'])
plt.savefig('readme3.png')
plt.show()
https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme2.png

Finally, we keep the best TSMLE imputor we represent.

dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)

plt.figure(figsize=(25,5))
plt.plot(df['y'],'.g')
plt.plot(dfs_imputed['y'],'.r')
plt.plot(df_with_nan['y'],'.b')
plt.show()
https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme3.png

📘 Documentation

The full documentation can be found on this link.

📝 Contributing

You are welcome to propose and contribute new ideas. We encourage you to open an issue so that we can align on the work to be done. It is generally a good idea to have a quick discussion before opening a pull request that is potentially out-of-scope. For more information on the contribution process, please go here.

🤝 Affiliation

Qolmat has been developed by Quantmetry.

Quantmetry

🔍 References

Qolmat methods belong to the field of conformal inference.

[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.” Journal of the ACM (JACM) 58.3 (2011): 1-37, (pdf)

[2] Wang, Xuehui, et al. “An improved robust principal component analysis model for anomalies detection of subway passenger flow.” Journal of advanced transportation 2018 (2018). (pdf)

[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data.” arXiv preprint arXiv:2001.05484 (2020), (pdf)

[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. (pdf)

[5] Jiashi Feng, et al. “Online robust pca via stochastic opti- mization.“ Advances in neural information processing systems, 26, 2013. (pdf)

[6] García, S., Luengo, J., & Herrera, F. “Data preprocessing in data mining”. 2015. (pdf)

📝 License

Qolmat is free and open-source software licensed under the BSD 3-Clause license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qolmat-0.0.15.tar.gz (65.5 kB view details)

Uploaded Source

Built Distribution

qolmat-0.0.15-py3-none-any.whl (75.4 kB view details)

Uploaded Python 3

File details

Details for the file qolmat-0.0.15.tar.gz.

File metadata

  • Download URL: qolmat-0.0.15.tar.gz
  • Upload date:
  • Size: 65.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for qolmat-0.0.15.tar.gz
Algorithm Hash digest
SHA256 ee070fc311fde5ce6b976618726522f5e0d9502ea86228fb7a85e31473c6e3f7
MD5 85c91ca0ea7bf97f82b92b8d5a136918
BLAKE2b-256 9d7208d023434b9cf792cd5b2abad67e6cc63a146da6591c1267077f268659cf

See more details on using hashes here.

File details

Details for the file qolmat-0.0.15-py3-none-any.whl.

File metadata

  • Download URL: qolmat-0.0.15-py3-none-any.whl
  • Upload date:
  • Size: 75.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for qolmat-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 4d00459f0f4eb021d13dd80610967cecd2d623bfb3d79c571324d09b0f0d6255
MD5 c94bd4d2da14acd7888ae93e360c686f
BLAKE2b-256 17e4ecd6702da7adc8f6c540507c897aa8b96c6d32613351103e8ed5c5c414d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page