Skip to main content

Semantic text clustering using sentence embeddings and agglomerative clustering.

Project description

semaclust

semaclust (semantic + clustering) is a lightweight Python package for semantic text clustering using sentence embeddings and agglomerative clustering.

Features

  • SentenceTransformer-based text encoding
  • Agglomerative clustering with configurable thresholds
  • Easily map or replace similar text values

Installation

pip install git+https://github.com/cobanov/semaclust.git

Usage

# Create clusterer
clusterer = TextClusterer()

texts = ["New York", "Los Angeles", "San Francisco", "new york city", "LA", "San Fran"]
# Get clusters
clusters = clusterer.cluster(texts)
print("Clusters:", clusters)

# Clusters: {1: ['New York', 'new york city'], 2: ['Los Angeles', 'LA'], 0: ['San Francisco', 'San Fran']}
# Get replacement map
replacement_map = clusterer.get_replacement_map(texts)
print("\nReplacement map:", replacement_map)

# Replacement map: {'New York': 'New York', 'new york city': 'New York', 'Los Angeles': 'Los Angeles', 'LA': 'Los Angeles', 'San Francisco': 'San Francisco', 'San Fran': 'San Francisco'}
# Replace values
replaced_texts = clusterer.replace_values(texts)
print("\nReplaced texts:", replaced_texts)

# Replaced texts: ['New York', 'Los Angeles', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semaclust-0.1.1.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semaclust-0.1.1-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file semaclust-0.1.1.tar.gz.

File metadata

  • Download URL: semaclust-0.1.1.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for semaclust-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c269d21235ef27fa6e07edf6cde777039107a19dbf65d2b98a360f86723e3407
MD5 01d7848b426b14bbd425b0dcb53f307a
BLAKE2b-256 f5856d18221392efbb41256f44398bffc2623e051ff214532ef1b11927920272

See more details on using hashes here.

File details

Details for the file semaclust-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: semaclust-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for semaclust-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 00c33e519f09f2ae3c35e9d6c29be40d529cce2a67e7e5e9f6044629fa176664
MD5 a1ccdfd24a118f767513c2a6729b0893
BLAKE2b-256 7afe5fd7ba4a7eded137ba87c636577beba696529db4e4aa7f3c873e9ae3d956

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page