Skip to main content

Gain a clue by clustering!

Project description

cluestar

Gain a clue by clustering!

This library contains visualisation tools that might help you get started with classification tasks. The idea is that if you can inspect clusters easily, you might gain a clue on what good labels for your dataset might be!

It generates charts that looks like this:

Normal plot

There's even a fancy chart that can compare embedding techniques.

Comparing two embeddings

Install

python -m pip install cluestar

Interactive Demo

You can see an interactive demo of the generated widgets here.

You can also toy around with the demo notebook found here.

Usage

The first step is to encode textdata in two dimensions, like below.

from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = make_pipeline(TfidfVectorizer(), TruncatedSVD(n_components=2))

X = pipe.fit_transform(texts)

From here you can make an interactive chart via;

from cluestar import plot_text

plot_text(X, texts)

The best results are likely found when you use umap together with something like universal sentence encoder.

You might also improve the understandability by highlighting points that have a certain word in it.

plot_text(X, texts, color_words=["plastic", "voucher", "deliver"])

You can also use a numeric array, one that contains proba-values for prediction, to influence the color.

# First, get an array of pvals from some model
p_vals = some_model.predict(texts)[:, 0]
# Use these to assign pretty colors.
plot_text(X, texts, color_array=p_vals)

You can also compare two embeddings interactively. To do this:

from cluestar import plot_text_comparison

plot_text(X1=X, X2=X, texts)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluestar-0.2.0.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cluestar-0.2.0-py2.py3-none-any.whl (5.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file cluestar-0.2.0.tar.gz.

File metadata

  • Download URL: cluestar-0.2.0.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for cluestar-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3f9b30e7321ba47f360876024d471b1ce990d609438be2baa92b664f4f9247db
MD5 5e42c3f1df9777452eb411b76723ea5e
BLAKE2b-256 35311d1c00dfc0284727b72e7fd726e1e6564caf1de96fc38064336ce636b0f3

See more details on using hashes here.

File details

Details for the file cluestar-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: cluestar-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for cluestar-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e0c8ce490b63ff6d8bea6c376800aca9b0eb6350a39c34c3c1c42f032503d917
MD5 318a2615d8481320dfc950c16367b178
BLAKE2b-256 318bde0415df77bd826fac0a081850a6815e2a96dcd93fef909ed77ee09f022a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page