ACLOSE- Automatic Clustering and Labeling Over Semantic Embeddings
Project description
ATMOSE
ATMOSE- Automatic Topic Modeling Over Semantic Embeddings
What it does
This package is a tool for quick EDA of emergent topics among your semantic embeddings.
Problem
- I have all these embedding vectors. What are the general topics that emerge from them?
Solution
- ATMOSE will cluster your embeddings and then label the clusters using an LLM.
- Instead of throwing a random sample of embeddings from each cluster at an LLM, ATMOSE uses stratified sampling and refinement to ensure that the topic labels balance generalization and specificity.
Algorithms (more coming soon)
- UMAP
- HDBSCAN
- TOPSIS
LLM agnostic (coming soon)
- All LLMs are supported via LiteLLM
Experiment tracking (coming soon)
- MLflow
- Dim reduction and clustering Model serialization and versioning
- Helicone tracking (optional)
C++ compiler required
Before installing ATMOSE, ensure you have:
-
Windows: Microsoft Visual C++ Build Tools
-
Linux: GCC/G++ compiler (
sudo apt-get install build-essentialon Ubuntu) -
macOS: Xcode Command Line Tools (
xcode-select --install)
Tip for building in Docker
Add this to your dockerfile:
RUN apt-get update && apt-get install -y \
curl \
build-essential \
gcc \
g++ \
libpq-dev \
libx11-dev \
libxrandr-dev \
libxext-dev \
libxi-dev \
libgl1-mesa-dev \
&& rm -rf /var/lib/apt/lists/*
ENV POETRY_VERSION=1.8.2
RUN curl -sSL https://install.python-poetry.org | python3 -
ENV PATH="/root/.local/bin:$PATH"
Notebook demo
Quickstart
Number of LLM calls
- 2 LLM calls per cluster
Instructions for use
Assume df has columns:
- content_str
- embedding_vector
Gets additional columns after applying .label(df, data_description)
- cluster_id
- topic_label
- membership_score
- outlier_score
- silhouette_score
- reduced_vector
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aclose-0.0.1.tar.gz.
File metadata
- Download URL: aclose-0.0.1.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.2 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe60383a99afe61006356cf7b6d9f0f39b0d9d42b1c9b175721684ab089b2bf5
|
|
| MD5 |
d681e0274df5dbf1b02c73dbb129d849
|
|
| BLAKE2b-256 |
0ad4366d39ad89dddfe6f64985b4125f7cccc30708280d98e3e74b1142264a15
|
File details
Details for the file aclose-0.0.1-py3-none-any.whl.
File metadata
- Download URL: aclose-0.0.1-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.2 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de8893462891db97dc1dc2ce399065abc740b5e3176aa3bc6faa684713e946c3
|
|
| MD5 |
b5c6a666ef07efa9c1f5a32dcbb85a7e
|
|
| BLAKE2b-256 |
3cbb0465941b98fc341a4c650f4ea14536aec1fdbcb9cd4d710b41faae9b6d4c
|