qolmat·PyPI

Tools to impute

Project description

hlbotterman@quantmetry.com, jroussel@quantmetry.com, tmorzadec@quantmetry.com, rhajou@quantmetry.com, fdakhli@quantmetry.com

License: new BSD Project-URL: Bug Tracker, https://github.com/Quantmetry/qolmat Project-URL: Documentation, https://qolmat.readthedocs.io/en/latest/ Project-URL: Source Code, https://github.com/Quantmetry/qolmat Classifier: Intended Audience :: Science/Research Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved Classifier: Topic :: Software Development Classifier: Topic :: Scientific/Engineering Classifier: Operating System :: Microsoft :: Windows Classifier: Operating System :: POSIX Classifier: Operating System :: Unix Classifier: Operating System :: MacOS Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Requires-Python: >=3.8 Description-Content-Type: text/x-rst Provides-Extra: tests Provides-Extra: docs Provides-Extra: pytorch License-File: LICENSE License-File: AUTHORS.rst

Qolmat - The Tool for Data Imputation

Qolmat provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.

🔗 Requirements

Python 3.8+

🛠 Installation

Install via pip:

$ pip install qolmat

If you need to use tensorflow, you can install it with the following ‘pip’ command:

$ pip install qolmat[tensorflow]

To install directly from the github repository :

$ pip install git+https://github.com/Quantmetry/qolmat

⚡️ Quickstart

Let us start with a basic imputation problem. Here, we generate one-dimensional noisy time series.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(42)
t = np.linspace(0,1,1000)
y = np.cos(2*np.pi*t*10)+np.random.randn(1000)/2
df = pd.DataFrame({'y': y}, index=pd.Series(t, name='index'))

For this demonstration, let us create artificial holes in our dataset.

from qolmat.utils.data import add_holes
plt.rcParams.update({'font.size': 18})

ratio_masked = 0.1
mean_size = 20
df_with_nan = add_holes(df, ratio_masked=ratio_masked, mean_size=mean_size)
is_na = df_with_nan['y'].isna()

plt.figure(figsize=(25,4))
plt.plot(df_with_nan['y'],'.')
plt.plot(df.loc[is_na, 'y'],'.')
plt. grid()
plt.xlim(0,1)

plt.legend(['Data', 'Missing data'])
plt.savefig('readme1.png')
plt.show()

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme1.png

To impute missing data, there are several methods that can be imported with from qolmat.imputations import imputers. The creation of an imputation dictionary will enable us to benchmark the various imputations.

from sklearn.linear_model import LinearRegression
from qolmat.imputations import imputers

imputer_mean = imputers.ImputerMean()
imputer_median = imputers.ImputerMedian()
imputer_mode = imputers.ImputerMode()
imputer_locf = imputers.ImputerLOCF()
imputer_nocb = imputers.ImputerNOCB()
imputer_interpol = imputers.ImputerInterpolation(method="linear")
imputer_spline = imputers.ImputerInterpolation(method="spline", order=2)
imputer_shuffle = imputers.ImputerShuffle()
imputer_residuals = imputers.ImputerResiduals(period=10, model_tsa="additive", extrapolate_trend="freq", method_interpolation="linear")
imputer_rpca = imputers.ImputerRPCA(columnwise=True, period=10, max_iter=200, tau=2, lam=.3)
imputer_rpca_opti = imputers.ImputerRPCA(columnwise=True, period = 10, max_iter=100)
imputer_ou = imputers.ImputerEM(model="multinormal", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_tsou = imputers.ImputerEM(model="VAR1", method="sample", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_tsmle = imputers.ImputerEM(model="VAR1", method="mle", max_iter_em=34, n_iter_ou=15, dt=1e-3)
imputer_knn = imputers.ImputerKNN(k=10)
imputer_mice = imputers.ImputerMICE(estimator=LinearRegression(), sample_posterior=False, max_iter=100, missing_values=np.nan)
imputer_regressor = imputers.ImputerRegressor(estimator=LinearRegression())

dict_imputers = {
    "mean": imputer_mean,
    "median": imputer_median,
    "mode": imputer_mode,
    "interpolation": imputer_interpol,
    "spline": imputer_spline,
    "shuffle": imputer_shuffle,
    "residuals": imputer_residuals,
    "OU": imputer_ou,
    "TSOU": imputer_tsou,
    "TSMLE": imputer_tsmle,
    "RPCA": imputer_rpca,
    "RPCA_opti": imputer_rpca_opti,
    "locf": imputer_locf,
    "nocb": imputer_nocb,
    "knn": imputer_knn,
    "ols": imputer_regressor,
    "mice_ols": imputer_mice,
}

It is possible to define a parameter dictionary for an imputer with three pieces of information: min, max and type. The aim of the dictionary is to determine the optimal parameters for data imputation. Here, we call this dictionary dict_config_opti.

search_params = {
    "RPCA_opti": {
        "tau": {"min": .5, "max": 5, "type":"Real"},
        "lam": {"min": .1, "max": 1, "type":"Real"},
    }
}

Then with the comparator function in from qolmat.benchmark import comparator, we can compare the different imputation methods. This does not use knowledge on missing values, but it relies data masking instead. For more details on how imputors and comparator work, please see the following link.

from qolmat.benchmark import comparator

generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)

comparison = comparator.Comparator(
    dict_imputers,
    ['y'],
    generator_holes = generator_holes,
    metrics = ["mae", "wmape", "KL_columnwise", "ks_test", "energy"],
    n_calls_opt = 10,
    dict_config_opti = dict_config_opti,
)
results = comparison.compare(df_with_nan)

We can observe the benchmark results.

dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)

plt.figure(figsize=(25,5))

plt.plot(df.loc[~is_na, 'y'],'.')
plt.plot(df.loc[is_na, 'y'],'.')
plt.plot(dfs_imputed.loc[is_na, 'y'],'.')

plt. grid()
plt.xlim(0,1)
plt.legend(['Data','Missing data', 'Imputed data'])
plt.savefig('readme3.png')
plt.show()

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme2.png

Finally, we keep the best TSMLE imputor we represent.

dfs_imputed =  imputer_tsmle.fit_transform(df_with_nan)

plt.figure(figsize=(25,5))
plt.plot(df['y'],'.g')
plt.plot(dfs_imputed['y'],'.r')
plt.plot(df_with_nan['y'],'.b')
plt.show()

https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme3.png

📘 Documentation

The full documentation can be found on this link.

📝 Contributing

You are welcome to propose and contribute new ideas. We encourage you to open an issue so that we can align on the work to be done. It is generally a good idea to have a quick discussion before opening a pull request that is potentially out-of-scope. For more information on the contribution process, please go here.

🤝 Affiliation

Qolmat has been developed by Quantmetry.

🔍 References

Qolmat methods belong to the field of conformal inference.

[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.” Journal of the ACM (JACM) 58.3 (2011): 1-37, (pdf)

[2] Wang, Xuehui, et al. “An improved robust principal component analysis model for anomalies detection of subway passenger flow.” Journal of advanced transportation 2018 (2018). (pdf)

[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data.” arXiv preprint arXiv:2001.05484 (2020), (pdf)

[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. (pdf)

[5] Jiashi Feng, et al. “Online robust pca via stochastic opti- mization.“ Advances in neural information processing systems, 26, 2013. (pdf)

[6] García, S., Luengo, J., & Herrera, F. “Data preprocessing in data mining”. 2015. (pdf)

📝 License

Qolmat is free and open-source software licensed under the BSD 3-Clause license.

Project details

Release history Release notifications | RSS feed

0.1.8

Jun 13, 2024

0.1.7

Jun 13, 2024

0.1.6

Apr 17, 2024

0.1.5

Apr 17, 2024

0.1.4

Apr 15, 2024

0.1.3

Mar 8, 2024

0.1.2

Feb 28, 2024

0.1.1

Nov 6, 2023

0.1.0

Oct 12, 2023

0.0.19

Oct 12, 2023

This version

0.0.15

Aug 3, 2023

0.0.14

Jun 14, 2023

0.0.13

Jun 7, 2023

0.0.12

May 31, 2023

0.0.11

May 26, 2023

0.0.10

Mar 10, 2023

0.0.9

Mar 8, 2023

0.0.8

Mar 8, 2023

0.0.7

Mar 8, 2023

0.0.5

Mar 3, 2023

0.0.4

Mar 3, 2023

0.0.3

Feb 27, 2023

0.0.2

Feb 24, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qolmat-0.0.15.tar.gz (65.5 kB view details)

Uploaded Aug 3, 2023 Source

Built Distribution

qolmat-0.0.15-py3-none-any.whl (75.4 kB view details)

Uploaded Aug 3, 2023 Python 3

File details

Details for the file qolmat-0.0.15.tar.gz.

File metadata

Download URL: qolmat-0.0.15.tar.gz
Upload date: Aug 3, 2023
Size: 65.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for qolmat-0.0.15.tar.gz
Algorithm	Hash digest
SHA256	`ee070fc311fde5ce6b976618726522f5e0d9502ea86228fb7a85e31473c6e3f7`
MD5	`85c91ca0ea7bf97f82b92b8d5a136918`
BLAKE2b-256	`9d7208d023434b9cf792cd5b2abad67e6cc63a146da6591c1267077f268659cf`

See more details on using hashes here.

File details

Details for the file qolmat-0.0.15-py3-none-any.whl.

File metadata

Download URL: qolmat-0.0.15-py3-none-any.whl
Upload date: Aug 3, 2023
Size: 75.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for qolmat-0.0.15-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4d00459f0f4eb021d13dd80610967cecd2d623bfb3d79c571324d09b0f0d6255`
MD5	`c94bd4d2da14acd7888ae93e360c686f`
BLAKE2b-256	`17e4ecd6702da7adc8f6c540507c897aa8b96c6d32613351103e8ed5c5c414d1`

See more details on using hashes here.

qolmat 0.0.15

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Qolmat - The Tool for Data Imputation

🔗 Requirements

🛠 Installation

⚡️ Quickstart

📘 Documentation

📝 Contributing

🤝 Affiliation

🔍 References

📝 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes