Skip to main content

A package for clustering categorical data

Project description

Installation

pip install categorical-cluster

Example

This code reads the pickle file of the example dataset provided in this repository, performs clustering on it, and saves the results in output.p.

import pickle
from categorical_cluster import cluster


MIN_SIMILARITY_FIRST_ITERATION = 0.5    # Value from 0 to 1 - % of similarity among entities in clusters
MIN_SIMILARITY_NEXT_ITERATIONS = 0.45   # Value from 0 to 1 - % of similarity among entities in clusters
MIN_ENTITIES_IN_CLUSTER = 4             # Minimum elements that cluster can consist of


# Read example dataset avaliable in repo
with open("dataset/sample_dataset.p", "rb") as file:
    data = pickle.load(file)


# Perform clustering
clusters = perform_clustering(
    data=data,
    min_elements_in_cluster=MIN_ENTITIES_IN_CLUSTER,
    min_similarity_first_iter=MIN_SIMILARITY_FIRST_ITERATION,
    min_similarity_next_iters=MIN_SIMILARITY_NEXT_ITERATIONS,
)


# Save result
with open("output.p", "wb") as file:
    pickle.dump(clusters, file)

Input data is a list of rows with "tags"(described later):

['envelope laser rectangle', 'casually explained', 'stand up comedy', 'comedy', 'animation', 'animated comedy', 'satire', 'how to', 'advice', 'funny', 'stand up', 'comedian', 'hilarious', 'humor']

Output would be clusters of input rows with their initial indexes (row numbers from the original input list):

[{'source_data': ['golf', 'golf highlights', 'ryder cup', 'ryder cup highlights', '2022 ryder cup', '2023 golf', 'marco simone', 'marco simone course', 'marco simone golf', 'luke donald', 'zach johnson', 'u.s. team', 'european team', 'europe golf', 'u.s. golf', 'ryder cup trophy'], 'source_row_number': 22}, {'source_data': ['golf', 'golf highlights', 'ryder cup', 'ryder cup highlights', '2022 ryder cup', '2023 golf', 'marco simone', 'marco simone course', 'marco simone golf', 'luke donald', 'zach johnson', 'u.s. team', 'european team', 'europe golf', 'u.s. golf', 'ryder cup trophy'], 'source_row_number': 235}, {'source_data': ['golf', 'golf highlights', 'ryder cup', 'ryder cup highlights', '2022 ryder cup', '2023 golf', 'marco simone', 'marco simone course', 'marco simone golf', 'luke donald', 'zach johnson', 'u.s. team', 'european team', 'europe golf', 'u.s. golf', 'ryder cup trophy'], 'source_row_number': 484}, {'source_data': ['golf', 'golf highlights', 'ryder cup', 'ryder cup highlights', '2022 ryder cup', '2023 golf', 'marco simone', 'marco simone course', 'marco simone golf', 'luke donald', 'zach johnson', 'u.s. team', 'european team', 'europe golf', 'u.s. golf', 'ryder cup trophy'], 'source_row_number': 538}, {'source_data': ['golf', 'golf highlights', 'ryder cup', 'ryder cup highlights', '2022 ryder cup', '2023 golf', 'marco simone', 'marco simone course', 'marco simone golf', 'luke donald', 'zach johnson', 'u.s. team', 'european team', 'europe golf', 'u.s. golf', 'ryder cup trophy', 'highlights | day 3 | 2023 ryder cup', 'watch highlights of the day 3 at the 2023 ryder cup held at marco simone golf & country club.', '2023 ryder cup held at marco simone golf', 'marco simone golf & country club.', 'highlights of the day 3', 'ryder cup'], 'source_row_number': 627}]

An example demonstrating the usage of logs can be found in the example_logs.py file. This example shows how to use logs to determine the similarity parameters.

Description

This package is specifically designed for clustering categorical data. The input should be provided as a list of lists, where each inner list represents a set of "tags" for a particular record. The more similar the tags between two records, the more likely they are to be in the same cluster.

This package was initially developed for clustering YouTube videos. The sample data provided in the file "example_dataset.p" can be used to try out this package. This sample data is a collection of unique tags and elements of the titles of YouTube videos that were trending in 2023.

The clustering process is carried out in the following steps:

  1. Encoding Process: In this step, all tags are mapped to integers. This is done to facilitate the comparison of tags between different records. The mapping is done such that each unique tag is assigned a unique integer.

  2. Filtering Process: After the encoding process, records are filtered based on their tags. Records that only contain tags that do not occur in any other records in the dataset are filtered out. This is done to ensure that the clustering process only considers records that have some level of similarity with other records in the dataset.

These steps ensure that the clustering process is efficient and accurate, providing meaningful clusters of records based on their tags.

  1. Clustering Process: The clustering process consists of two stages - initial and next iterations. Both stages perform the same operations but are divided so that the results can be determined by specifying parameters for each separately. In the first iteration, the similarity threshold is min_similarity_first_iter. In subsequent iterations, the threshold is min_similarity_next_iters.

Clustering Loop looks as follows:

  1. For each cluster (initially each record is a single cluster), similarities to every other cluster are calculated and compared against a threshold.
  2. If a given cluster does not have at least one similar record (based on initial similarity treshold), it's dropped.
  3. We look for two most similar clusters and merge this pair into one cluster. Following calculations will be done against result of merge, not initial element.
  4. If for a given iteration a certain cluster does not have any new similar elements, we check against min_elements_in_cluster. If it's met, we put it into the final_clusters list.
  5. This process is repeated until no more clusters to merge are met.

This process ensures that the clusters formed are meaningful and based on the similarity of tags between the records.

Please note that during the clustering process, a single record could potentially be assigned to more than one cluster.

Logging

During the clustering process, logs are generated that capture the calculated similarities while running the clustering algorithm. These logs contain the values of the calculated similarities and the number of occurrences of these values.

By analyzing these logs, you can determine the optimal similarity parameters for your specific use case. This can help in fine-tuning the clustering process to achieve more accurate and meaningful clusters.

Here is an example of how these logs look like:

Example of logs

In the above figure, the x-axis represents the similarity values and the y-axis represents the number of occurrences of these values. By analyzing this graph, you can determine the most common similarity values in your data, which can be used to set the similarity parameters for the clustering algorithm.

The function below can be used to generate the plot shown above. It takes a list of similarity values as input and plots a histogram of these values.

import matplotlib.pyplot as plt


def plot_similarity_values(values):
    values = [round((x), 2) for x in values]
    plt.hist(values, bins=100, edgecolor="k", alpha=0.7)
    plt.title("Histogram of Values Rounded to 0.01")
    plt.xlabel("Value")
    plt.ylabel("Frequency")
    plt.grid(True)
    plt.show()

Future plans, draft:

1. Enable multiprocessing, rewrite into maps:
    If there are no other ideas for parallelizing this, can do map and reduce at each iteration - that is:
    for each cluster, compare it to all other clusters - map
    then again map - look at which of the next clusters is the most coherent - leave the most coherent, if have pairs e.g. (1,3) with a coherence of 93% and a pair (3,5) with a coherence of 50% - combine pair 1 and 3 together creating a new cluster and do nothing with element 5 (leave it for the next iteration).
2. You pass pandas dataframe and columns to cluster on - I return dataframe with new column - label
3. Optimize calculating similarity - don't calculate it at each iteration, rather calculate against merged pair

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

categorical_cluster-0.3.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

categorical_cluster-0.3-py3-none-any.whl (10.6 kB view details)

Uploaded Python 3

File details

Details for the file categorical_cluster-0.3.tar.gz.

File metadata

  • Download URL: categorical_cluster-0.3.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for categorical_cluster-0.3.tar.gz
Algorithm Hash digest
SHA256 427bb77a8df41603a356c313ac26b04aa1fe484710292a53a3dc4856e3f5d063
MD5 607092dc08010f6b6fe4d4a2f80a1be9
BLAKE2b-256 3c6e61fecb346991e985ed598b85733283199ce440e61204a7f7c16923ba2809

See more details on using hashes here.

File details

Details for the file categorical_cluster-0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for categorical_cluster-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f5af7c158abd25723b24418cce114dc18e64efd42ad518a10225db1b14a48d8a
MD5 24cb3fb08fe17c0d210d1a65e6804c6c
BLAKE2b-256 fd653f7bc98171c3d54c674825b4870930b63313ef27aff742b593d3f8277f97

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page