Skip to main content

No project description provided

Project description

Clustexts

Performs k-means clustering on a collection of texts. It automates the selection of k by running the elbow method implicitly. The algorithm only expects the range of minimum and maximum values for k (default to 2 and 20, respectively).

Texts are encoded using a TFIDF Bag-of-Words representation. Optionally, Truncated Singular Value Decomposition can be used to reduce the dimensionality of the resulting matrix and project the topology onto an embedded space, for improved data compression and schema generalization.

The call to returns an iterator containing the cluster identifiers associated with each input document.

Dependencies

Ensure you have the following packages installed:

matplotlib==3.10.1
numpy==2.2.3
pandas==2.2.3
scikit-learn==1.6.1
scipy==1.15.2
seaborn==0.13.2
tqdm==4.67.1

Usage

Example of usage:

rows = [
  'one text',
  'another text',
  'this sentence',
  'fourth sentence',
  'fifth sentence',
]
df = pd.DataFrame(rows, columns=['text'])

cls = Clustexts(
  reducer={},
  range = (2, 10),
  min_gain=0.001,
  vectorizer={'min_df': 0.0}
)
df['cluster'] = cls(df['text'])

Parameters

Clustering

  • range: Tuple[int, int] = (2, 20): Specifies the minimum and maximum values of k to explore when applying the elbow method.
  • min_size: int = 0: The minimum cluster size to be accepted. If reached, the clustering stops.
  • min_gain: float = 0.03: The minimum relative improvement for the clustering to continue running (as a percentage of the inertia).

Vectorization (required) & Dimensionality reduction (optional)

Refer to the scikit-learn's documentation for the TfidfVectorizer and the TruncatedSVD classes.

Reporting (optional)

  • plot_density: bool = False: If set to True, the system will plot cluster densities (number of documents in each cluster).
  • plot_k: bool = False: If set to True, the algorithm will plot the inertia trendline for every k that has been explored.
  • show_examples: bool = False: If set to True, the algorithm will display 3 examples of each output cluster after the elbow has been found.
  • verbose: bool = False: If set to True, prints a message on the terminal specifying the clustering termination condition.

Methods

  • encode(X: Iterable[str]) -> np.ndarray: Transforms input text X to a numerical vector using TF-IDF Vectorizer, and optionally applies SVD dimensionality reduction.
  • __call__(self, X: Iterable[str]) -> Iterable[int]: fits model on input data X.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clustexts-0.0.1.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

clustexts-0.0.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file clustexts-0.0.1.tar.gz.

File metadata

  • Download URL: clustexts-0.0.1.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for clustexts-0.0.1.tar.gz
Algorithm Hash digest
SHA256 47b03990b50c4caa210744ad939a46bc7f2cb541cd104327bbc1312991a1983e
MD5 d7981aaf9827166cd1e929efe497e3c1
BLAKE2b-256 fbe3170edf6c7063d7498f690fdede52a1f310b7cb5574d6b848f59032cc4d7f

See more details on using hashes here.

File details

Details for the file clustexts-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: clustexts-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for clustexts-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dec9da1b72561505a2ba18c7e2d20eeee6b26730024b483f746efb4d83ea5469
MD5 e499de7faacda321a424cdffb293429a
BLAKE2b-256 efccee19e35eca213b84e03fb0734002da8975f28b42a34cd6f21142f38affdd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page