Skip to main content

Extract higher level clusters from keywords

Project description

Simple Keyword Clusterer

A simple machine learning package to cluster keywords in higher-level groups.

Example:
"Senior Frontend Engineer" --> "Frontend Engineer"
"Junior Backend developer" --> "Backend developer"


Installation

pip install simple_keyword_clusterer

Usage

# import the package
from simple_keyword_clusterer import Clusterer

# read your keywords in list
with open("../my_keywords.txt", "r") as f:
    data = f.read().splitlines()

# instantiate object
clusterer = Clusterer()

# apply clustering
df = clusterer.extract(data)

print(df)
clustering_example

Performance

The algorithm will find the optimal number of clusters automatically based on the best Silhouette Score.

You can specify the number of clusters yourself too

# instantiate object
clusterer = Clusterer(n_clusters=4)

# apply clustering
df = clusterer.extract(data)

For best performance, try to reduce the variance of data by providing the same semantic context
(the job title keywords file should remain coherent, in that it shouldn't contain other stuff like gardening keywords).

If items are clearly separable, the algorithm should still be able to provide a useable output.

Customization

You can customize the clustering mechanism through the files

  • blacklist.txt
  • to_normalize.txt

If you notice that the clustering identifies unwanted groups, you can blacklist certain words simply by appending them in the blacklist.txt file.

The to_normalize.txt file contains tuples that identify a transformation to apply to the keyword. For instance

("back end", "backend), ("front end", "frontend), ("sr", "Senior"), ("jr", "junior")

Simply add your tuples to use this functionality.

Dependencies

  • Scikit-learn
  • Pandas
  • Matplotlib
  • Seaborn
  • Numpy
  • NLTK
  • Tqdm

Make sure to download NLTK English Stopwords with the command

nltk.download("stopwords")

Contact

If you feel like contacting me, do so and send me a mail. You can find my contact information on my website.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_keyword_clusterer-0.23.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simple_keyword_clusterer-0.23-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file simple_keyword_clusterer-0.23.tar.gz.

File metadata

  • Download URL: simple_keyword_clusterer-0.23.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for simple_keyword_clusterer-0.23.tar.gz
Algorithm Hash digest
SHA256 8a5a15ded822db1e21a7e868cc3120e0e983fe58de63bb4d844e91ccfbf86b5c
MD5 b7a630b90c4c53adf6a038f27008c002
BLAKE2b-256 e98568713e7613a14d8d7d65b3924540637c8cea5c5edb77eee1871ce503baa8

See more details on using hashes here.

File details

Details for the file simple_keyword_clusterer-0.23-py3-none-any.whl.

File metadata

  • Download URL: simple_keyword_clusterer-0.23-py3-none-any.whl
  • Upload date:
  • Size: 6.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for simple_keyword_clusterer-0.23-py3-none-any.whl
Algorithm Hash digest
SHA256 ef64e03e86c1ae7cf39c8c08c00e05151fe505f8e11f17fb80531573a40e8b7c
MD5 fa4cb9357e3e6a68f8aead4daa5ed44a
BLAKE2b-256 6601e8e3de7e3722a1d0877a9cdc83cc64850c0c04fa987f0f71c1c70fba1450

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page