Skip to main content

A flexible framework for hierarchical clustering of text, numeric, or image data using LLMs.

Project description

pyHercules

PyPI version Python Version License: MIT

pyHercules is a flexible Python framework for hierarchical clustering of text, numeric, or image data. The core algorithm, Hercules, uses hierarchical clustering algorithms (like recursive k-means or agglomerative clustering) and leverages Large Language Models (LLMs) for efficient and meaningful summarization of clusters at each level of the hierarchy. The project includes the core library (pyhercules), a set of "batteries-included" model functions, and a powerful Dash web application for interactive exploration.

Key Features

  • Hierarchical Clustering: Automatically builds a tree of clusters from your data.
  • Choice of Clustering Algorithm: Select between 'kmeans' (default) for iterative partitioning or 'agglomerative' clustering for a bottom-up hierarchical approach.
  • Multi-Modal: Natively handles text, numeric (NumPy, Pandas), and image data (file paths, URLs, PIL Images). (One modality at a time.)
  • LLM-Powered Summarization: Uses Large Language Models (LLMs) to generate human-readable titles and descriptions for each cluster.
  • Flexible Representation: For k-means, choose between direct mode (using original data embeddings) or description mode (using LLM-generated summary embeddings) for clustering at higher levels.
  • Interactive Web App: An included Dash application (pyhercules_app.py) allows for easy data upload, parameter configuration, and visualization of clustering results.
  • Extensible: The core library is dependency-light. Bring your own model functions or use the provided ones in pyhercules_functions.py.

Project Structure

  • pyhercules.py: The core clustering library. Contains the Hercules and Cluster classes.
  • pyhercules_functions.py: A collection of ready-to-use functions for embedding, captioning, and LLM calls (using Hugging Face, Google Gemini, etc.).
  • pyhercules_app.py: A comprehensive Dash web application for interactive clustering and visualization.
  • examples.ipynb: A Jupyter Notebook demonstrating various use cases of the library.
  • requirements-*.txt: Dependency files for different use cases (for reference).
  • setup.py: The packaging configuration script.

Installation

You can install pyhercules directly from PyPI. Several installation options are available depending on your needs.

1. Core Library Only

For using the Hercules class with your own model client functions. This is a minimal, lightweight installation.

pip install pyhercules

2. Library with Model Functions

To use the pre-built functions in pyhercules_functions.py (e.g., for running the examples.ipynb notebook).

pip install "pyhercules[models]"

3. Full Web Application

To run the interactive Dash application, which includes all dependencies.

pip install "pyhercules[app]"

Configuration: API Keys

To use models from Google or gated models from Hugging Face (like Gemma), you must configure your API keys. The recommended way is to create a .env file in your project's working directory:

# .env
GOOGLE_API_KEY="your-google-api-key-here"
HUGGINGFACE_HUB_TOKEN="your-hugging-face-token-for-gated-models"

The library will automatically load these variables. Alternatively, you can set them as system environment variables.

Usage

1. Running the Dash Web Application (Recommended)

The easiest way to get started is with the interactive app.

  1. Install dependencies:
    pip install "pyhercules[app]"
    
  2. Set API keys: Create a .env file as described in the Configuration section.
  3. Run the app:
    pyhercules-app
    

Then, open your web browser to http://127.0.0.1:8050.

2. Using the Core Library in Python

You can use the Hercules class directly in your scripts. See examples.ipynb for more detailed use cases.

from pyhercules import Hercules
from pyhercules_functions import local_minilm_l6_v2_embedding, local_gemma_3_4b_it_llm

# 1. Sample data
sample_texts = [
    "Introduction to machine learning concepts.",
    "Advanced techniques in deep neural networks.",
    "A guide to Python programming for beginners.",
    "Web development using Flask and Jinja.",
    "Understanding gradient descent and backpropagation.",
]

# 2. Instantiate Hercules with your chosen model clients
# Ensure you have set up your HUGGINGFACE_HUB_TOKEN in a .env file for Gemma
hercules = Hercules(
    level_cluster_counts=[3, 2],  # Desired hierarchy: 3 top-level, then subdivide
    representation_mode="direct",
    text_embedding_client=local_minilm_l6_v2_embedding,
    llm_client=local_gemma_3_4b_it_llm,
    verbose=1
)

# 3. Run clustering
top_clusters = hercules.cluster(sample_texts, topic_seed="computer science topics")

# 4. Print results
if top_clusters:
    for cluster in top_clusters:
        cluster.print_hierarchy(indent_increment=2, print_level_0=False)

Citation

If you find pyhercules useful in your research, please consider citing our paper.

@article{petnehazi2026hercules,
  title   = {HERCULES: Hierarchical Embedding-based Recursive Clustering Using LLMs for Efficient Summarization},
  author  = {Petnehazi, Gabor and Aradi, Bernadett},
  journal = {International Journal of Cognitive Computing in Engineering},
  volume  = {7},
  pages   = {445--456},
  year    = {2026},
  doi     = {10.1016/j.ijcce.2026.04.002},
  url     = {https://www.sciencedirect.com/science/article/pii/S2666307426000057}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhercules-1.1.1.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyhercules-1.1.1-py3-none-any.whl (91.0 kB view details)

Uploaded Python 3

File details

Details for the file pyhercules-1.1.1.tar.gz.

File metadata

  • Download URL: pyhercules-1.1.1.tar.gz
  • Upload date:
  • Size: 90.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pyhercules-1.1.1.tar.gz
Algorithm Hash digest
SHA256 fca341df9954a9b04f7822796b27504a443cf7bf7add1e2af040c2f0d96069e6
MD5 3007a9e2522a7e3e3e0bb505934029a2
BLAKE2b-256 0163bbe7c30d7d62a79f19ff878afe00d41891643c4f78907d1cdab1b5b59744

See more details on using hashes here.

File details

Details for the file pyhercules-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: pyhercules-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 91.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pyhercules-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a423a6d8d3c148349800c8188f160e678afd136d14e74eea58f97fbc2172fc58
MD5 2b19aa32adc3bb3194e6966ab33847ab
BLAKE2b-256 04bf2aa3414e0979f13f0afaaef0158348690e83a6eb2b43ac7b6e06ccc77480

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page