Tools to impute

Project description

License: new BSD Project-URL: Bug Tracker, https://github.com/Quantmetry/qolmat Project-URL: Documentation, https://qolmat.readthedocs.io/en/latest/ Project-URL: Source Code, https://github.com/Quantmetry/qolmat Classifier: Intended Audience :: Science/Research Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved Classifier: Topic :: Software Development Classifier: Topic :: Scientific/Engineering Classifier: Operating System :: Microsoft :: Windows Classifier: Operating System :: POSIX Classifier: Operating System :: Unix Classifier: Operating System :: MacOS Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Requires-Python: >=3.8 Description-Content-Type: text/x-rst Provides-Extra: tests Provides-Extra: docs Provides-Extra: pytorch License-File: LICENSE License-File: AUTHORS.rst

Qolmat - The Tool for Data Imputation

Qolmat provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.

Python 3.8+

🛠 Installation

Install via pip:

$pip install qolmat If you need to use tensorflow, you can install it with the following ‘pip’ command: $ pip install qolmat[tensorflow]

To install directly from the github repository :

\$ pip install git+https://github.com/Quantmetry/qolmat

⚡️ Quickstart

Let us start with a basic imputation problem. Here, we generate one-dimensional noisy time series.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(42)
t = np.linspace(0,1,1000)
y = np.cos(2*np.pi*t*10)+np.random.randn(1000)/2
df = pd.DataFrame({'y': y}, index=pd.Series(t, name='index'))

For this demonstration, let us create artificial holes in our dataset.

from qolmat.utils.data import add_holes
plt.rcParams.update({'font.size': 18})

mean_size = 20
is_na = df_with_nan['y'].isna()

plt.figure(figsize=(25,4))
plt.plot(df_with_nan['y'],'.')
plt.plot(df.loc[is_na, 'y'],'.')
plt. grid()
plt.xlim(0,1)

plt.legend(['Data', 'Missing data'])
plt.show()

To impute missing data, there are several methods that can be imported with from qolmat.imputations import imputers. The creation of an imputation dictionary will enable us to benchmark the various imputations.

from sklearn.linear_model import LinearRegression
from qolmat.imputations import imputers

imputer_mean = imputers.ImputerMean()
imputer_median = imputers.ImputerMedian()
imputer_mode = imputers.ImputerMode()
imputer_locf = imputers.ImputerLOCF()
imputer_nocb = imputers.ImputerNOCB()
imputer_interpol = imputers.ImputerInterpolation(method="linear")
imputer_spline = imputers.ImputerInterpolation(method="spline", order=2)
imputer_shuffle = imputers.ImputerShuffle()
imputer_residuals = imputers.ImputerResiduals(period=10, model_tsa="additive", extrapolate_trend="freq", method_interpolation="linear")
imputer_rpca = imputers.ImputerRPCA(columnwise=True, period=10, max_iter=200, tau=2, lam=.3)
imputer_rpca_opti = imputers.ImputerRPCA(columnwise=True, period = 10, max_iter=100)
imputer_ou = imputers.ImputerEM(model="multinormal", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_tsou = imputers.ImputerEM(model="VAR1", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_tsmle = imputers.ImputerEM(model="VAR1", method="mle", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_knn = imputers.ImputerKNN(k=10)
imputer_mice = imputers.ImputerMICE(estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
imputer_regressor = imputers.ImputerRegressor(estimator=LinearRegression())

dict_imputers = {
"mean": imputer_mean,
"median": imputer_median,
"mode": imputer_mode,
"interpolation": imputer_interpol,
"spline": imputer_spline,
"shuffle": imputer_shuffle,
"residuals": imputer_residuals,
"OU": imputer_ou,
"TSOU": imputer_tsou,
"TSMLE": imputer_tsmle,
"RPCA": imputer_rpca,
"RPCA_opti": imputer_rpca_opti,
"locf": imputer_locf,
"nocb": imputer_nocb,
"knn": imputer_knn,
"ols": imputer_regressor,
"mice_ols": imputer_mice,
}

It is possible to define a parameter dictionary for an imputer with three pieces of information: min, max and type. The aim of the dictionary is to determine the optimal parameters for data imputation. Here, we call this dictionary dict_config_opti.

search_params = {
"RPCA_opti": {
"tau": {"min": .5, "max": 5, "type":"Real"},
"lam": {"min": .1, "max": 1, "type":"Real"},
}
}

Then with the comparator function in from qolmat.benchmark import comparator, we can compare the different imputation methods. This does not use knowledge on missing values, but it relies data masking instead. For more details on how imputors and comparator work, please see the following link.

from qolmat.benchmark import comparator

comparison = comparator.Comparator(
dict_imputers,
['y'],
generator_holes = generator_holes,
metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
n_calls_opt = 10,
dict_config_opti = dict_config_opti,
)
results = comparison.compare(df_with_nan)

We can observe the benchmark results.

dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)

plt.figure(figsize=(25,5))

plt.plot(df.loc[~is_na, 'y'],'.')
plt.plot(df.loc[is_na, 'y'],'.')
plt.plot(dfs_imputed.loc[is_na, 'y'],'.')

plt. grid()
plt.xlim(0,1)
plt.legend(['Data','Missing data', 'Imputed data'])
plt.show()

Finally, we keep the best TSMLE imputor we represent.

dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)

plt.figure(figsize=(25,5))
plt.plot(df['y'],'.g')
plt.plot(dfs_imputed['y'],'.r')
plt.plot(df_with_nan['y'],'.b')
plt.show()

📘 Documentation

The full documentation can be found on this link.

📝 Contributing

You are welcome to propose and contribute new ideas. We encourage you to open an issue so that we can align on the work to be done. It is generally a good idea to have a quick discussion before opening a pull request that is potentially out-of-scope. For more information on the contribution process, please go here.

🤝 Affiliation

Qolmat has been developed by Quantmetry.

🔍 References

Qolmat methods belong to the field of conformal inference.

[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.” Journal of the ACM (JACM) 58.3 (2011): 1-37, (pdf)

[2] Wang, Xuehui, et al. “An improved robust principal component analysis model for anomalies detection of subway passenger flow.” Journal of advanced transportation 2018 (2018). (pdf)

[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data.” arXiv preprint arXiv:2001.05484 (2020), (pdf)

[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. (pdf)

[5] Jiashi Feng, et al. “Online robust pca via stochastic opti- mization.“ Advances in neural information processing systems, 26, 2013. (pdf)

[6] García, S., Luengo, J., & Herrera, F. “Data preprocessing in data mining”. 2015. (pdf)

Qolmat is free and open-source software licensed under the BSD 3-Clause license.

Project details

Source Distribution

qolmat-0.0.15.tar.gz (65.5 kB view hashes)

Uploaded Source

Built Distribution

qolmat-0.0.15-py3-none-any.whl (75.4 kB view hashes)

Uploaded Python 3