Semantic text clustering using sentence embeddings and agglomerative clustering
Project description
semaclust
semaclust (semantic + clustering) is a lightweight Python package for semantic text clustering using sentence embeddings and agglomerative clustering.
Features
- SentenceTransformer-based text encoding
- Agglomerative clustering with configurable thresholds
- Easily map or replace similar text values
Installation
pip install git+https://github.com/cobanov/semaclust.git
Usage
# Create clusterer
clusterer = TextClusterer()
texts = ["New York", "Los Angeles", "San Francisco", "new york city", "LA", "San Fran"]
# Get clusters
clusters = clusterer.cluster(texts)
print("Clusters:", clusters)
# Clusters: {1: ['New York', 'new york city'], 2: ['Los Angeles', 'LA'], 0: ['San Francisco', 'San Fran']}
# Get replacement map
replacement_map = clusterer.get_replacement_map(texts)
print("\nReplacement map:", replacement_map)
# Replacement map: {'New York': 'New York', 'new york city': 'New York', 'Los Angeles': 'Los Angeles', 'LA': 'Los Angeles', 'San Francisco': 'San Francisco', 'San Fran': 'San Francisco'}
# Replace values
replaced_texts = clusterer.replace_values(texts)
print("\nReplaced texts:", replaced_texts)
# Replaced texts: ['New York', 'Los Angeles', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
semaclust-0.2.0.tar.gz
(4.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semaclust-0.2.0.tar.gz.
File metadata
- Download URL: semaclust-0.2.0.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc37093c35a36e1fe82c81381380d564c1773cec26c2862373c86c0efa717342
|
|
| MD5 |
7cdd73576525e6736bcca7e89a4db011
|
|
| BLAKE2b-256 |
aecc6ceddb7d9e246d3120d8c7231b923d4e4e8db8c32c1fc13de6c5c626628b
|
File details
Details for the file semaclust-0.2.0-py3-none-any.whl.
File metadata
- Download URL: semaclust-0.2.0-py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3653334409f4df8baa9c9ae4c49819350529cc0d84443167a179a0328dad2511
|
|
| MD5 |
5d45ddfe7dbea054733b70dfd6715f79
|
|
| BLAKE2b-256 |
2b8318e406c630e13f516319d00a9a914c8668fe6ce9098968e95a68bfed2377
|