Semantic text clustering using sentence embeddings and agglomerative clustering.
Project description
semaclust
semaclust (semantic + clustering) is a lightweight Python package for semantic text clustering using sentence embeddings and agglomerative clustering.
Features
- SentenceTransformer-based text encoding
- Agglomerative clustering with configurable thresholds
- Easily map or replace similar text values
Installation
pip install git+https://github.com/cobanov/semaclust.git
Usage
# Create clusterer
clusterer = TextClusterer()
texts = ["New York", "Los Angeles", "San Francisco", "new york city", "LA", "San Fran"]
# Get clusters
clusters = clusterer.cluster(texts)
print("Clusters:", clusters)
# Clusters: {1: ['New York', 'new york city'], 2: ['Los Angeles', 'LA'], 0: ['San Francisco', 'San Fran']}
# Get replacement map
replacement_map = clusterer.get_replacement_map(texts)
print("\nReplacement map:", replacement_map)
# Replacement map: {'New York': 'New York', 'new york city': 'New York', 'Los Angeles': 'Los Angeles', 'LA': 'Los Angeles', 'San Francisco': 'San Francisco', 'San Fran': 'San Francisco'}
# Replace values
replaced_texts = clusterer.replace_values(texts)
print("\nReplaced texts:", replaced_texts)
# Replaced texts: ['New York', 'Los Angeles', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
semaclust-0.1.1.tar.gz
(4.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semaclust-0.1.1.tar.gz.
File metadata
- Download URL: semaclust-0.1.1.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c269d21235ef27fa6e07edf6cde777039107a19dbf65d2b98a360f86723e3407
|
|
| MD5 |
01d7848b426b14bbd425b0dcb53f307a
|
|
| BLAKE2b-256 |
f5856d18221392efbb41256f44398bffc2623e051ff214532ef1b11927920272
|
File details
Details for the file semaclust-0.1.1-py3-none-any.whl.
File metadata
- Download URL: semaclust-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00c33e519f09f2ae3c35e9d6c29be40d529cce2a67e7e5e9f6044629fa176664
|
|
| MD5 |
a1ccdfd24a118f767513c2a6729b0893
|
|
| BLAKE2b-256 |
7afe5fd7ba4a7eded137ba87c636577beba696529db4e4aa7f3c873e9ae3d956
|