Skip to main content

outlier detection and clustering based on sparse data observers

Project description

Sparse Data Observers (SDO) is an unsupervised learning approach developed to cover the need for fast, highly interpretable and intuitively parameterizable anomaly detection. Its extension, SDOclust, performs clustering while preserving the simplicity and applicability of the original approach.

SDO and SDOclust are powerful options when statistical estimates are representative and feature spaces conform distance-based analysis. Their main characteristics are: lightweight, intuitive, self-adjusted, noise- resistant, able to extract non-convex clusters (SDOclust), and built on robust parameters and interpretable models.

Feasibility and rapid integration into real-world applications are the core goals behind SDO and SDOclust, which can work on most data scenarios without parameter adjustment (simply using the default parameterization).

Installation and dependecies

sdo can be installed from PyPI using

    pip3 install sdoclust

or directly from our GitHub repository:

    pip3 install git+https://github.com/CN-TU/pysdoclust

sdo requires de following packages:

  • numpy
  • math
  • scipy
  • sklearn

Examples of usage

SDO

    import numpy as np
    np.random.seed(1)

    # Generate data
    from sklearn import datasets
    x, y = datasets.make_circles(n_samples=5000, factor=0.3, noise=0.1)

    # SDO outlier scoring
    import sdoclust as sdo
    s = sdo.SDO().fit_predict(x)

    # plotting results
    import matplotlib.pyplot as plt
    fig = plt.figure()
    plt.scatter(x[:,0],x[:,1], s=10, cmap='coolwarm', c=s)
    plt.colorbar(ticks=[np.min(s), np.max(s)])
    plt.title('SDO outlierness scores')
    plt.show()

SDOclust

    import numpy as np
    np.random.seed(1)

    # Generate data
    from sklearn import datasets
    x, y = datasets.make_circles(n_samples=5000, factor=0.3, noise=0.1)

    # SDOclust clustering
    import sdoclust as sdo
    p = sdo.SDOclust().fit_predict(x)

    # plotting results
    import matplotlib.pyplot as plt
    fig = plt.figure()
    plt.scatter(x[:,0],x[:,1], s=10, cmap='coolwarm', c=p)
    plt.title('SDOclust clustering')
    plt.show()

Application notes

SDO and SDOclust obtain good performances without modifying the default parameterization in most applications, but may require adjustment in some cases: typically, when datasets have very few elements, when clusters are overlapping or in cases with many under-represented clusters.

Main SDO parameters are:

  • x, which establishes the number of closest observers to evaluate each data point.

  • qv, which sets a robust threshold for removing idle observers.

  • k, which fixes de number of observers in the model

      mdl = sdo.SDO(x=5, qv=0.3, k=500)
    

Additionally, SDOclust also incorporates:

  • zeta, which sets a trade-off between locality and globality for cutting-off graph edges thresholds.

  • chi, which defines the chi-closest observer of any given observer to decide cutting-off graph edges thresholds.

  • e sets the minimum number of observers that a cluster can have.

      mdl = sdo.SDOclust(zeta=0.6, chi=10, e=3)
    

[1] and [2] provide further explanations on SDO and SDOclust parameters.

Citation

If you use SDO or SDOclust in your research, please cite our publications:

SDO

[1] Iglesias, F., Zseby, T., Hartl, A., Zimek, A. (2023). SDOclust: Clustering with Sparse Data Observers. In: Pedreira, O., Estivill-Castro, V. (eds) Similarity Search and Applications. SISAP 2023. Lecture Notes in Computer Science, vol 14289. Springer, Cham. https://doi.org/10.1007/978-3-031-46994-7_16

    @INPROCEEDINGS{SDO2018,
        author    = {F{\'e}lix Iglesias and Tanja Zseby and Alexander Hartl and Arthur Zimek},
        booktitle={2018 IEEE International Conference on Data Mining Workshops (ICDMW)}, 
        title={Outlier Detection Based on Low Density Models}, 
        year={2018},
        volume={},
        number={},
        pages={970-979},
        doi={10.1109/ICDMW.2018.00140}}	
    }

SDOclust

[2] Iglesias, F., Zseby, T., Zimek, A., "Outlier Detection Based on Low Density Models," 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 2018, pp. 970-979, doi: 10.1109/ICDMW.2018.00140. keywords: {Observers;Anomaly detection;Clustering algorithms;Data models;Statistical analysis;Decision making;Complexity theory;outlier analysis;eager learning;machine learning model},

    @InProceedings{SDOclust2023,
        title     = {SDOclust: Clustering with Sparse Data Observers},
        author    = {F{\'e}lix Iglesias and Tanja Zseby and Arthur Zimek},
        editor    = {{\'O}scar Pedreira and Vladimir Estivill-Castro",
        booktitle = {Similarity Search and Applications},
        year      = {2023},
        publisher = {Springer Nature Switzerland},
        address   = {Cham},
        pages     = {185--199},
        doi       = {https://doi.org/10.1007/978-3-031-46994-7\_16}
    }

Others

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdoclust-0.2.tar.gz (271.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdoclust-0.2-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file sdoclust-0.2.tar.gz.

File metadata

  • Download URL: sdoclust-0.2.tar.gz
  • Upload date:
  • Size: 271.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.5

File hashes

Hashes for sdoclust-0.2.tar.gz
Algorithm Hash digest
SHA256 428058a69db28b3c7081677a5880c76c07ad79142e321fc62c8bfa620d349544
MD5 dc0d56034f69d67c16baa63d7d00893a
BLAKE2b-256 1f0e249b27aad1174e8180144bc0e4923943d4148c3ce74b202647546b99b2ae

See more details on using hashes here.

File details

Details for the file sdoclust-0.2-py3-none-any.whl.

File metadata

  • Download URL: sdoclust-0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.5

File hashes

Hashes for sdoclust-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bcaac5b9e7cdf76f4ceacc5c5e6eee46de8f0dfc17b40e5e746eb596ba81ce6f
MD5 3b3bb99c7d39a04238a99ed12079bb0c
BLAKE2b-256 7aa36725ee3b5cd303dc36a7e7ff71d2a2929a574ddea5bf385eb1063bfe053b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page