Skip to main content

ACLOSE- Automatic Clustering and Labeling Over Semantic Embeddings

Project description

ATMOSE

ATMOSE- Automatic Topic Modeling Over Semantic Embeddings

What it does

This package is a tool for quick EDA of emergent topics among your semantic embeddings.

Problem

  • I have all these embedding vectors. What are the general topics that emerge from them?

Solution

  • ATMOSE will cluster your embeddings and then label the clusters using an LLM.
  • Instead of throwing a random sample of embeddings from each cluster at an LLM, ATMOSE uses stratified sampling and refinement to ensure that the topic labels balance generalization and specificity.

Algorithms (more coming soon)

  • UMAP
  • HDBSCAN
  • TOPSIS

LLM agnostic (coming soon)

  • All LLMs are supported via LiteLLM

Experiment tracking (coming soon)

  • MLflow
  • Dim reduction and clustering Model serialization and versioning
  • Helicone tracking (optional)

C++ compiler required

Before installing ATMOSE, ensure you have:

  • Windows: Microsoft Visual C++ Build Tools

  • Linux: GCC/G++ compiler (sudo apt-get install build-essential on Ubuntu)

  • macOS: Xcode Command Line Tools (xcode-select --install)

Tip for building in Docker

Add this to your dockerfile:

RUN apt-get update && apt-get install -y \
    curl \
    build-essential \
    gcc \
    g++ \
    libpq-dev \
    libx11-dev \
    libxrandr-dev \
    libxext-dev \
    libxi-dev \
    libgl1-mesa-dev \
    && rm -rf /var/lib/apt/lists/*

ENV POETRY_VERSION=1.8.2
RUN curl -sSL https://install.python-poetry.org | python3 -
ENV PATH="/root/.local/bin:$PATH"

Notebook demo

Quickstart

Number of LLM calls

  • 2 LLM calls per cluster

Instructions for use

Assume df has columns:

  • content_str
  • embedding_vector

Gets additional columns after applying .label(df, data_description)

  • cluster_id
  • topic_label
  • membership_score
  • outlier_score
  • silhouette_score
  • reduced_vector

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aclose-0.0.1.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aclose-0.0.1-py3-none-any.whl (35.9 kB view details)

Uploaded Python 3

File details

Details for the file aclose-0.0.1.tar.gz.

File metadata

  • Download URL: aclose-0.0.1.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Windows/10

File hashes

Hashes for aclose-0.0.1.tar.gz
Algorithm Hash digest
SHA256 fe60383a99afe61006356cf7b6d9f0f39b0d9d42b1c9b175721684ab089b2bf5
MD5 d681e0274df5dbf1b02c73dbb129d849
BLAKE2b-256 0ad4366d39ad89dddfe6f64985b4125f7cccc30708280d98e3e74b1142264a15

See more details on using hashes here.

File details

Details for the file aclose-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: aclose-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 35.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Windows/10

File hashes

Hashes for aclose-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 de8893462891db97dc1dc2ce399065abc740b5e3176aa3bc6faa684713e946c3
MD5 b5c6a666ef07efa9c1f5a32dcbb85a7e
BLAKE2b-256 3cbb0465941b98fc341a4c650f4ea14536aec1fdbcb9cd4d710b41faae9b6d4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page