Skip to main content

Tools to impute

Project description

hlbotterman@quantmetry.com, jroussel@quantmetry.com, tmorzadec@quantmetry.com, rhajou@quantmetry.com, fdakhli@quantmetry.com

License: new BSD Classifier: Intended Audience :: Science/Research Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved Classifier: Topic :: Software Development Classifier: Topic :: Scientific/Engineering Classifier: Operating System :: Microsoft :: Windows Classifier: Operating System :: POSIX Classifier: Operating System :: Unix Classifier: Operating System :: MacOS Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Requires-Python: >=3.8 Description-Content-Type: text/x-rst Provides-Extra: tests Provides-Extra: docs

RPCA for anomaly detection and data imputation

What is robust principal component analysis?

Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of principal component analysis (PCA) which allows to work with grossly corrupted observations.

Suppose we are given a large data matrix \(\mathbf{D}\), and know that it may be decomposed as

\begin{equation*} \mathbf{D} = \mathbf{X}^* + \mathbf{A}^* \end{equation*}

where \(\mathbf{X}^*\) has low-rank and \(\mathbf{A}^*\) is sparse. We do not know the low-dimensional column and row space of \(\mathbf{X}^*\), not even their dimension. Similarly, for the non-zero entries of \(\mathbf{A}^*\), we do not know their location, magnitude or even their number. Are the low-rank and sparse parts possible to recover both accurately and efficiently?

Of course, for the separation problem to make sense, the low-rank part cannot be sparse and analogously, the sparse part cannot be low-rank. See here for more details.

Formally, the problem is expressed as

\begin{equation*} \begin{align*} & \text{minimise} \quad \text{rank} (\mathbf{X}) + \lambda \Vert \mathbf{A} \Vert_0 \\ & \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} \end{align*} \end{equation*}

Unfortunately this optimization problem is a NP-hard problem due to its nonconvexity and discontinuity. So then, a widely used solving scheme is replacing rank(\(\mathbf{X}\)) by its convex envelope —the nuclear norm \(\Vert \mathbf{X} \Vert_*\)— and the \(\ell_0\) penalty is replaced with the \(\ell_1\)-norm, which is good at modeling the sparse noise and has high efficient solution. Therefore, the problem becomes

\begin{equation*} \begin{align*} & \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\ & \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} \end{align*} \end{equation*}

Theoretically, this is guaranteed to work even if the rank of \(\mathbf{X}^*\) grows almost linearly in the dimension of the matrix, and the errors in \(\mathbf{A}^*\) are up to a constant fraction of all entries. Algorithmically, the above problem can be solved by efficient and scalable algorithms, at a cost not so much higher than the classical PCA. Empirically, a number of simulations and experiments suggest this works under surprisingly broad conditions for many types of real data.

Some examples of real-life applications are background modelling from video surveillance, face recognition, speech recognition. We here focus on anomaly detection in time series.

What’s in this repo?

Some classes are implemented:

RPCA class based on RPCA p.29.

\begin{equation*} \begin{align*} & \text{minimise} \quad \Vert \mathbf{X} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 \\ & \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} \end{align*} \end{equation*}

GraphRPCA class based on GraphRPCA.

\begin{equation*} \begin{align*} & \text{minimise} \quad \Vert \mathbf{A} \Vert_1 + \gamma_1 \text{tr}(\mathbf{X} \mathbf{\mathcal{L}_1} \mathbf{X}^T) + \gamma_2 \text{tr}(\mathbf{X}^T \mathbf{\mathcal{L}_2} \mathbf{X}) \\ & \text{s.t.} \quad \mathbf{D} = \mathbf{X} + \mathbf{A} \end{align*} \end{equation*}

TemporalRPCA class based on Link 1 and this Link 2). The optimisation problem is the following

\begin{equation*} \text{minimise} \quad \Vert P_{\Omega}(\mathbf{X}+\mathbf{A}-\mathbf{D}) \Vert_F^2 + \lambda_1 \Vert \mathbf{X} \Vert_* + \lambda_2 \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{XH_k} \Vert_p \end{equation*}

where \(\Vert \mathbf{XH_k} \Vert_p\) is either \(\Vert \mathbf{XH_k} \Vert_1\) or \(\Vert \mathbf{XH_k} \Vert_F^2\).

The operator \(P_{\Omega}\) is the projection operator such that \(P_{\Omega}(\mathbf{M})\) is the projection of \(\mathbf{M}\) on the set of observed data \(\Omega\). This allows to deal with missing values. Each of these classes is adapted to take as input either a time series or a matrix directly. If a time series is passed, a pre-processing is done.

See the examples folder for a first overview of the implemented classes.

Installation

Install directly from the gitlab repository:

Contributing

Feel free to open an issue or contact us at pnom@quantmetry.com

References

[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.” Journal of the ACM (JACM) 58.3 (2011): 1-37, (pdf)

[2] Wang, Xuehui, et al. “An improved robust principal component analysis model for anomalies detection of subway passenger flow.” Journal of advanced transportation 2018 (2018). (pdf)

[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data.” arXiv preprint arXiv:2001.05484 (2020), (pdf)

[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of Selected Topics in Signal Processing 10.4 (2016): 740-756. (pdf)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qolmat-0.0.8.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qolmat-0.0.8-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file qolmat-0.0.8.tar.gz.

File metadata

  • Download URL: qolmat-0.0.8.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for qolmat-0.0.8.tar.gz
Algorithm Hash digest
SHA256 2c880bc1e8bfdb70db2a0f0cf6ed4678b2fedb0b7c0e215ddecf13a483502f45
MD5 346240fbda1a9c84e415fd48c9de3c70
BLAKE2b-256 6bcf657810586d361502615a0adafa940b3e4886e4a7ed8945786c5c4c6fbf21

See more details on using hashes here.

File details

Details for the file qolmat-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: qolmat-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for qolmat-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 5e5cc37bb56b4d4634a2e42a3656aaeda12ff5a3059d9b44d13c6870cce40866
MD5 5bb4790bf96cde8d90be84b9a6558e4b
BLAKE2b-256 9faed480955ff8db561fb94af5943c4f494d050a798b154f70e9158353ab11b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page