Skip to main content

Semantic text clustering using sentence embeddings and agglomerative clustering

Project description

semaclust

semaclust (semantic + clustering) is a lightweight Python package for semantic text clustering using sentence embeddings and agglomerative clustering.

Features

  • SentenceTransformer-based text encoding
  • Agglomerative clustering with configurable thresholds
  • Easily map or replace similar text values

Installation

pip install git+https://github.com/cobanov/semaclust.git

Usage

# Create clusterer
clusterer = TextClusterer()

texts = ["New York", "Los Angeles", "San Francisco", "new york city", "LA", "San Fran"]
# Get clusters
clusters = clusterer.cluster(texts)
print("Clusters:", clusters)

# Clusters: {1: ['New York', 'new york city'], 2: ['Los Angeles', 'LA'], 0: ['San Francisco', 'San Fran']}
# Get replacement map
replacement_map = clusterer.get_replacement_map(texts)
print("\nReplacement map:", replacement_map)

# Replacement map: {'New York': 'New York', 'new york city': 'New York', 'Los Angeles': 'Los Angeles', 'LA': 'Los Angeles', 'San Francisco': 'San Francisco', 'San Fran': 'San Francisco'}
# Replace values
replaced_texts = clusterer.replace_values(texts)
print("\nReplaced texts:", replaced_texts)

# Replaced texts: ['New York', 'Los Angeles', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semaclust-0.2.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semaclust-0.2.0-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file semaclust-0.2.0.tar.gz.

File metadata

  • Download URL: semaclust-0.2.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for semaclust-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cc37093c35a36e1fe82c81381380d564c1773cec26c2862373c86c0efa717342
MD5 7cdd73576525e6736bcca7e89a4db011
BLAKE2b-256 aecc6ceddb7d9e246d3120d8c7231b923d4e4e8db8c32c1fc13de6c5c626628b

See more details on using hashes here.

File details

Details for the file semaclust-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: semaclust-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for semaclust-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3653334409f4df8baa9c9ae4c49819350529cc0d84443167a179a0328dad2511
MD5 5d45ddfe7dbea054733b70dfd6715f79
BLAKE2b-256 2b8318e406c630e13f516319d00a9a914c8668fe6ce9098968e95a68bfed2377

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page