ACLOSE- Automatic Clustering and Labeling Of Semantic Embeddings

These details have not been verified by PyPI

Project description

ACLOSE 🔍✨ Automatic Clustering and Labeling Of Semantic Embeddings

🌟 What is ACLOSE?

ACLOSE is a powerful machine learning library that automates the discovery, labeling, and visualization of topics within text data. It combines cutting-edge dimensionality reduction, clustering, and large language models to transform raw embeddings into meaningful, labeled topics with minimal code.

Think of it as automatic topic discovery without the headaches.

Visualize Clusters

🔥 Why Use ACLOSE?

The Problem ACLOSE Solves

📊 Embedding vectors by themselves aren't helpful for understanding content themes
🧩 Manual topic discovery is tedious and doesn't scale to large datasets
🏷️ Labeling clusters is subjective and time-consuming
⚙️ Tuning clustering algorithms is complex and requires expertise

ACLOSE's Solution

ACLOSE offers a streamlined, three-step process:

Cluster text embeddings using optimized hyperparameters
Label the clusters with semantic topics using LLMs
Visualize the results with publication-quality interactive plots

No more guessing at parameters or manually interpreting cluster contents!

✨ Key Features

🤖 End-to-End Automation: From raw embeddings to labeled topics in just a few lines of code
📐 Multi-Objective Optimization: Intelligent hyperparameter tuning with Pareto front selection
🎯 Smart LLM-Based Labeling: Two-pass approach with core and peripheral point sampling for accurate topics
📊 Interactive Visualizations: Ready-to-use cluster exploration with minimal setup
⚡ Production Ready: Trained models that can be reused for classifying new data
📈 Drift Monitoring: Tools to detect when clustering models need retraining

📦 Installation

Prerequisites

Before installing, make sure you have a C++ compiler:

Windows: Install Microsoft Visual C++ Build Tools
Linux: sudo apt-get install build-essential
macOS: Install Xcode Command Line Tools with xcode-select --install

Install from PyPI

pip install aclose

🚀 Quick Start

1. Cluster your embeddings

import pandas as pd
from aclose import run_clustering

# Example DataFrame with embeddings
df = pd.DataFrame({
    "content": ["Text document 1", "Text document 2", "Text document 3"],
    "embedding_vector": [[0.1, 0.2, ...], [0.3, 0.4, ...], [0.5, 0.6, ...]]
})

# Run clustering with optimized parameters
result = run_clustering(df)

# Get the clustered dataframe
clustered_df = result["clustered_df"]

2. Label your clusters

from aclose import add_labels

# Generate semantic topic labels for clusters
label_result = add_labels(
    cluster_df=clustered_df,
    data_description="Dataset of scientific paper abstracts",
    llm_model="o1-mini"  # Use OpenAI models
)

# Get labeled dataframe and mapping
labeled_df = label_result["dataframe"]
topic_mapping = label_result["labels_dict"]

print(topic_mapping)  # {0: "Machine Learning Applications", 1: "Climate Change Research", ...}

3. Visualize your topics

from aclose import silhouette_fig, scatter_fig, bars_fig

# Generate and display three complementary visualizations
silhouette_fig(labeled_df).show()  # Assess cluster quality
scatter_fig(labeled_df, content_col_name="content").show()  # Explore semantic space
bars_fig(labeled_df).show()  # View topic distribution

📊 Visualizations

ACLOSE provides three powerful visualizations to help you understand your data:

🔍 Cluster Exploration (3D/2D Interactive)

Explore the semantic relationships between your documents in an interactive 3D or 2D visualization. Each point represents a document, color-coded by cluster, with topics labeled at cluster centers.

Visualize Clusters

📊 Topic Distribution

See the relative sizes of each topic in your dataset with a clear, color-coded bar chart. Quickly identify dominant themes and niche topics.

Topic Prevalence

📈 Cluster Quality Assessment

Evaluate the quality of your clustering with a silhouette plot. Higher values indicate better-defined clusters, helping you assess the reliability of your topics.

Cluster Quality

🧠 Use Cases

1. Quick Exploratory Data Analysis

Instantly discover the main themes in your text corpus without manual annotation or parameter tuning.

from aclose import run_clustering, add_labels, scatter_fig

result = run_clustering(df)
labeled = add_labels(result["clustered_df"])
scatter_fig(labeled["dataframe"]).show()

2. Experimentation and Refinement

Try different dimensionality settings before committing to expensive labeling operations:

# Try 2D clustering (good for visualization)
clustering_2d = run_clustering(df, dims=2)

# Try 3D clustering (better balance of viz & quality)
clustering_3d = run_clustering(df, dims=3)

# Let the algorithm find optimal dimensions
clustering_nd = run_clustering(df, dims=None)

# Compare metrics
print(clustering_2d["metrics_dict"])
print(clustering_3d["metrics_dict"])
print(clustering_nd["metrics_dict"])

# Choose the best and label it
best_clustering = clustering_3d  # based on metrics
labeled = add_labels(best_clustering["clustered_df"])

3. Production ML Pipeline Integration

Reuse trained models to classify new data and monitor distribution drift:

# Train initial models
clustering = run_clustering(training_df)
labeled = add_labels(clustering["clustered_df"])

# Extract models for reuse
umap_model = clustering["umap_model"]
hdbscan_model = clustering["hdbscan_model"]
topic_mapping = labeled["labels_dict"]

# Apply to new data
new_embeddings = get_embeddings(new_df)
reduced_vectors = umap_model.transform(new_embeddings)
new_labels, probabilities = hdbscan.approximate_predict(hdbscan_model, reduced_vectors)
new_df["topic"] = [topic_mapping.get(label, "Unknown") for label in new_labels]

⚙️ How It Works: The Magic Behind ACLOSE

ACLOSE isn't just a simple pipeline—it employs sophisticated techniques to produce high-quality topic clusters:

1. Smart Dimensionality Reduction

PCA Preprocessing: Optional noise reduction that preserves a target explained variance ratio
UMAP Transformation: Non-linear dimensionality reduction that maintains local structure

2. Intelligent Clustering

HDBSCAN: Density-based clustering that automatically finds natural groupings
Branch Detection: Optional hierarchical structure identification to find sub-topics

3. Advanced Hyperparameter Optimization

Triple-Objective Pareto Front: Balances silhouette score, noise ratio, and cluster count
TOPSIS Selection: Chooses the optimal configuration from the Pareto front

4. Two-Pass Topic Labeling

Core Point Sampling: Identifies representative documents from each cluster's center
Stratified Peripheral Sampling: Refines topics based on the full distribution of documents
Intelligent Prompting: Guides the LLM to generate specific, distinctive topic labels

📖 Quick Documentation

For detailed documentation, including guidance on all hyperparameters, see DOCUMENTATION.md. Alternatively, if you're in a hurry, you can chat with the code using a (gimmicky) custom GPT.

Core Functions

`run_clustering`

Performs optimized clustering on embeddings and returns models and results.

result = run_clustering(
    filtered_df,                     # DataFrame with embedding_vector column
    min_clusters=3,                  # Minimum acceptable clusters
    max_clusters=25,                 # Maximum acceptable clusters
    dims=3,                          # Target dimensionality (None to optimize)
    target_pca_evr=0.9,              # PCA explained variance ratio target
    hdbscan_outlier_threshold=10,    # Percentile for core point detection
    # Many more configurable parameters...
)

Returns a dictionary with:

clustered_df: DataFrame with cluster assignments and metadata
umap_model: Fitted UMAP model for reuse
hdbscan_model: Fitted HDBSCAN model
pca_model: Fitted PCA model (if used)
metrics_dict: Clustering quality metrics
branch_detector: Branch detector (if used)

`add_labels`

Generates semantic topic labels for clusters using LLMs.

result = add_labels(
    cluster_df,                     # DataFrame from run_clustering
    llm_model="o1-mini",            # LLM to use for labeling
    language="english",             # Output language
    data_description="Scientific papers", # Context for the LLM
    content_col_name="abstract",    # Column with text content
    # More configuration options...
)

Returns a dictionary with:

dataframe: Original DataFrame with added 'topic' column
labels_dict: Mapping from cluster_id to topic label

Visualization Functions

silhouette_fig(df): Creates a silhouette plot for evaluating cluster quality
scatter_fig(df): Creates a 2D/3D scatter plot of document clusters
bars_fig(df): Creates a bar chart of topic distribution

🔧 Requirements

Python 3.10+
OpenAI API key (for LLM-based labeling)
Helicone API key (optional, for API call tracking)

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for details.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

Distributed under the MIT License. See LICENSE for more information.

🙏 Acknowledgments

Developed and maintained by Joe Nance
Built on the shoulders of giants: UMAP, HDBSCAN, Optuna, and OpenAI

💡 Request for Features

Here is a list of features that we are planning to add in the future. If you would like to take up any of these features, please create an issue and assign it to yourself:

Support for non-openai and OSS LLMs via LiteLLM
More than two passes for topic label refinement
Support for other clustering algorithms
Lightweight (non-langchain) utilities for creating chunking and embedding
More options for chart visualization

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 7, 2025

0.0.1

Mar 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aclose-0.1.0.tar.gz (48.1 kB view details)

Uploaded Mar 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aclose-0.1.0-py3-none-any.whl (49.9 kB view details)

Uploaded Mar 7, 2025 Python 3

File details

Details for the file aclose-0.1.0.tar.gz.

File metadata

Download URL: aclose-0.1.0.tar.gz
Upload date: Mar 7, 2025
Size: 48.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.2 Windows/10

File hashes

Hashes for aclose-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1efaf726b8087eb5a0ebdd3c70a0bf71274ee4c15e842b9c4505cef00a5b6b13`
MD5	`712eafd63374a3ee165beeeb8eefab68`
BLAKE2b-256	`1f8fd9010433b60c13b4bc73f979d2e3a1f0a1a4fa3e68fd9423cc4c44e670fe`

See more details on using hashes here.

File details

Details for the file aclose-0.1.0-py3-none-any.whl.

File metadata

Download URL: aclose-0.1.0-py3-none-any.whl
Upload date: Mar 7, 2025
Size: 49.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.2 Windows/10

File hashes

Hashes for aclose-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`710d9e9c15291f1ebff8a1fd87bcc632032cab8772b4684d9ac3d81825cb0913`
MD5	`db8b2333b3d37e2733648e69f060e80c`
BLAKE2b-256	`a46bc1261510eded2a40b9ef82e24ebacf6a9c6131be4c6061642676a82de1ef`

See more details on using hashes here.

aclose 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ACLOSE 🔍✨ Automatic Clustering and Labeling Of Semantic Embeddings

🌟 What is ACLOSE?

🔥 Why Use ACLOSE?

The Problem ACLOSE Solves

ACLOSE's Solution

✨ Key Features

📦 Installation

Prerequisites

Install from PyPI

🚀 Quick Start

1. Cluster your embeddings

2. Label your clusters

3. Visualize your topics

📊 Visualizations

🔍 Cluster Exploration (3D/2D Interactive)

📊 Topic Distribution

📈 Cluster Quality Assessment

🧠 Use Cases

1. Quick Exploratory Data Analysis

2. Experimentation and Refinement

3. Production ML Pipeline Integration

⚙️ How It Works: The Magic Behind ACLOSE

1. Smart Dimensionality Reduction

2. Intelligent Clustering

3. Advanced Hyperparameter Optimization

4. Two-Pass Topic Labeling

📖 Quick Documentation

Core Functions

run_clustering

add_labels

Visualization Functions

🔧 Requirements

🤝 Contributing

📄 License

🙏 Acknowledgments

💡 Request for Features

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`run_clustering`

`add_labels`