Skip to main content

Gain a clue by clustering!

Project description

cluestar

Gain a clue by clustering!

This library contains visualisation tools that might help you get started with classification tasks. The idea is that if you can inspect clusters easily, you might gain a clue on what good labels for your dataset might be!

It generates charts that looks like this:

Normal plot

There's even a fancy chart that can compare embedding techniques.

Comparing two embeddings

Install

python -m pip install cluestar

Interactive Demo

You can see an interactive demo of the generated widgets here.

You can also toy around with the demo notebook found here.

Usage

The first step is to encode textdata in two dimensions, like below.

from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

pipe = make_pipeline(TfidfVectorizer(), TruncatedSVD(n_components=2))

X = pipe.fit_transform(texts)

From here you can make an interactive chart via;

from cluestar import plot_text

plot_text(X, texts)

The best results are likely found when you use umap together with something like universal sentence encoder.

You might also improve the understandability by highlighting points that have a certain word in it.

plot_text(X, texts, color_words=["plastic", "voucher", "deliver"])

You can also use a numeric array, one that contains proba-values for prediction, to influence the color.

# First, get an array of pvals from some model
p_vals = some_model.predict(texts)[:, 0]
# Use these to assign pretty colors.
plot_text(X, texts, color_array=p_vals)

You can also compare two embeddings interactively. To do this:

from cluestar import plot_text_comparison

plot_text(X1=X, X2=X, texts)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluestar-0.2.1.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cluestar-0.2.1-py2.py3-none-any.whl (5.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file cluestar-0.2.1.tar.gz.

File metadata

  • Download URL: cluestar-0.2.1.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for cluestar-0.2.1.tar.gz
Algorithm Hash digest
SHA256 d3b816d8a2b60c60a08737a9225129aaae6757983273cd8e9fb8645105d5c61e
MD5 d7dac9f17a000fff77d2b5047171e736
BLAKE2b-256 1acbc36629da325ed5773c48f4f1e34f2ecfe2db2e2b2d0cfa304b4a613604d7

See more details on using hashes here.

File details

Details for the file cluestar-0.2.1-py2.py3-none-any.whl.

File metadata

  • Download URL: cluestar-0.2.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for cluestar-0.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 fcf2cd43c3385130cbc71509947f545d9edbdef6a1811e4dbb007e888bcf7fae
MD5 22775d947653f509c1b8543942c07864
BLAKE2b-256 7d1d8cfcec80f1dbe1221dbaa61026b1b9991bb87125ca187f24e04a4f47ec65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page