Skip to main content

Clustering using different non-parameteric models with the combination of word Embedding

Project description

Clustering using different non-parameteric models with the power of bert embeddings

The project includes the implementation of different non-parametric clustering models with the power of bert embeddings to identify groups of similar objects from textual data in python.

Nonparametric models concept
Have parameters with infinite dimensional
Having latent variables with finite raw data
Having an infinite number of parameters
Can be understood as having a random number of parameters
Number of parameters can grow with the dataset

Definition

model a collection of distributions (distribution over distribution)

  • Nonparametric model: the parameters are from a possibly infinite dimensional space F (Θ ∈ F)

Properties

  1. CRP (Chinese Restaurant Process) defines a distribution over clusterings (i.e. partitions) of the indices 1,…,n

[In a simulated environment]

  • Customer = index
  • Table = cluster

When customer 1 enters, he can sit anywhere he likes. Customer 2 can sit in any empty seat, with the following probabilities:

- Table 1 : 1 / (1 + α)
- New Table (i.e. any empty table) : α / (1 + α)

[In a clustering problem]

Let Nj be the number of data point in cluster j. For data point #(n+1), we have:

P(choose cluster # j) = Nj / α + N

P(choose a new cluster)= α / α+N
  1. Expected number of clusters

given n customers (i.e. observations) is O(α log(n))

  • Rich-get-richer effect on clusters: popular tables tend to get more crowded.

The challenge of the model is as more people sit at a particular table, those tables increase in popularity, so new patrons are less likely to sit at empty tables.

  1. Behavior of CRP with α:
– As α goes to 0, the number of clusters goes to 1
– As α goes to +∞, the number of clusters goes to n
  1. The CRP has known as an exchangeable process

  2. If we shuffle the data points and get a new configuration, the probability of the different configurations is equal.

for more information:

Stanford University (2016). Chinese Restaurant Viewpoint. Retrieved February 13, 2018 from: https://cs.stanford.edu/~ppasupat/a9online/1083.html

Installation

Use the package manager pip to install embed-clustering.

pip install embed-clustering

Usage

# import the crp algorithm
from embed_clustering.latent_component import crp_algorithm

# read the data you want to cluster
import pandas as pd
df = pd.read_csv('sample.csv')

corpus = df[column] # mention the column you want to cluster

# apply the algorithm by passing the parameters
df['cluster'] = crp_algorithm(corpus, compute='cpu', cleaning=True) #if you have gpu, compute='cuda', if you doesn't wish to clean the text before clustering you can flag cleaning=False

Evaluation

The performance of non-parametric crp-algorithm against centroid based algorithm by the number of K clusters (k is a predefined parameter).

We implemented two known methods on a collection of arbitrary data to identify an optimum number of clusters an elbow method and a Silhouette score method as a measure to have the cohesion value between clusters. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. Then we deployed our model on the same data with no predefining and tuning parameters, we found out that our non-parametric model derived most obtain clustering with structure or unstructured data.

About

This algorithm is developed by Masume Azizyan & Deepak John Reji as part of their ongoing research on non-parametric models and word embeddings. If you use this work (code, model),

Please cite us and start at: https://github.com/dreji18/embed-clustering

License

MIT License

Copyright (c) 2022 Masume Azizyan, Deepak John Reji

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_clustering-0.0.5.tar.gz (41.2 kB view details)

Uploaded Source

Built Distribution

embed_clustering-0.0.5-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file embed_clustering-0.0.5.tar.gz.

File metadata

  • Download URL: embed_clustering-0.0.5.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for embed_clustering-0.0.5.tar.gz
Algorithm Hash digest
SHA256 6fa0583cd43bcf976e2cd116b49693f642d6c3a7f5b396c24fc26217bb482cba
MD5 529a671b002281f53115d3a5be03fca7
BLAKE2b-256 db564d96dbc2be406f46460acac576dd7eae1f64b61e7cff3840423542db0c9b

See more details on using hashes here.

File details

Details for the file embed_clustering-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for embed_clustering-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4d79db283efdbdb3cc9a0d4e35e096c8f58aef7e2195e16c5bdb6fe5ff0b65cb
MD5 be66318ff0c04e0ba75a0dacc18f49b6
BLAKE2b-256 06fd35194d3ed773054add81057415faeee6335ffa81c3a91839e95decfc40c0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page