Tools to impute
Project description
hlbotterman@quantmetry.com, jroussel@quantmetry.com, tmorzadec@quantmetry.com, rhajou@quantmetry.com
License: new BSD Classifier: Intended Audience :: Science/Research Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved Classifier: Topic :: Software Development Classifier: Topic :: Scientific/Engineering Classifier: Operating System :: Microsoft :: Windows Classifier: Operating System :: POSIX Classifier: Operating System :: Unix Classifier: Operating System :: MacOS Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Requires-Python: >=3.8 Description-Content-Type: text/x-rst Provides-Extra: tests Provides-Extra: docs
RPCA for anomaly detection and data imputation
What is robust principal component analysis?
Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of principal component analysis (PCA) which allows to work with grossly corrupted observations.
Suppose we are given a large data matrix \(\mathbf{D}\), and know that it may be decomposed as
where \(\mathbf{X}^*\) has low-rank and \(\mathbf{A}^*\) is sparse. We do not know the low-dimensional column and row space of \(\mathbf{X}^*\), not even their dimension. Similarly, for the non-zero entries of \(\mathbf{A}^*\), we do not know their location, magnitude or even their number. Are the low-rank and sparse parts possible to recover both accurately and efficiently?
Of course, for the separation problem to make sense, the low-rank part cannot be sparse and analogously, the sparse part cannot be low-rank. See here for more details.
Formally, the problem is expressed as
Unfortunately this optimization problem is a NP-hard problem due to its nonconvexity and discontinuity. So then, a widely used solving scheme is replacing rank(\(\mathbf{X}\)) by its convex envelope —the nuclear norm \(\Vert \mathbf{X} \Vert_*\)— and the \(\ell_0\) penalty is replaced with the \(\ell_1\)-norm, which is good at modeling the sparse noise and has high efficient solution. Therefore, the problem becomes
Theoretically, this is guaranteed to work even if the rank of \(\mathbf{X}^*\) grows almost linearly in the dimension of the matrix, and the errors in \(\mathbf{A}^*\) are up to a constant fraction of all entries. Algorithmically, the above problem can be solved by efficient and scalable algorithms, at a cost not so much higher than the classical PCA. Empirically, a number of simulations and experiments suggest this works under surprisingly broad conditions for many types of real data.
Some examples of real-life applications are background modelling from video surveillance, face recognition, speech recognition. We here focus on anomaly detection in time series.
What’s in this repo?
Some classes are implemented:
RPCA class based on RPCA p.29.
GraphRPCA class based on GraphRPCA.
TemporalRPCA class based on Link 1 and this Link 2). The optimisation problem is the following
where \(\Vert \mathbf{XH_k} \Vert_p\) is either \(\Vert \mathbf{XH_k} \Vert_1\) or \(\Vert \mathbf{XH_k} \Vert_F^2\).
The operator \(P_{\Omega}\) is the projection operator such that \(P_{\Omega}(\mathbf{M})\) is the projection of \(\mathbf{M}\) on the set of observed data \(\Omega\). This allows to deal with missing values. Each of these classes is adapted to take as input either a time series or a matrix directly. If a time series is passed, a pre-processing is done.
See the examples folder for a first overview of the implemented classes.
Installation
Install directly from the gitlab repository:
Contributing
Feel free to open an issue or contact us at pnom@quantmetry.com
References
[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.” Journal of the ACM (JACM) 58.3 (2011): 1-37, (pdf)
[2] Wang, Xuehui, et al. “An improved robust principal component analysis model for anomalies detection of subway passenger flow.” Journal of advanced transportation 2018 (2018). (pdf)
[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data.” arXiv preprint arXiv:2001.05484 (2020), (pdf)
[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. (pdf)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.