Skip to main content

A library for extracting hierarchical structure from unstructured text using adaptive clustering

Project description

AdaptiveHierarchicalTextClustering

AdaptiveHierarchicalTextClustering is a Python library for extracting hierarchical structure from unstructured text using an adaptive clustering approach. This project aims to provide an efficient and flexible way to organize and understand large volumes of text data by creating meaningful hierarchies.

Features

  • Adaptive threshold selection for optimal clustering
  • Rolling window approach for context-aware similarity calculation
  • Token-aware splitting to maintain coherent text segments
  • Hierarchical clustering with tree structure output
  • Easy integration with popular NLP libraries like sentence-transformers

Installation

To install AdaptiveHierarchicalTextClustering, run the following command:

pip install adaptive-hierarchical-text-clustering

Quick Start

Here's a simple example of how to use AdaptiveHierarchicalTextClustering:

import numpy as np
from sentence_transformers import SentenceTransformer
from adaptive_hierarchical_text_clustering import AdaptiveHierarchicalTextClustering

# Prepare your text data
sentences = ["Your", "list", "of", "sentences", "here"]

# Encode sentences (using sentence-transformers as an example)
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

# Calculate token counts (simple approximation)
token_counts = [len(sentence.split()) for sentence in sentences]

# Initialize and fit the clustering model
clustering = AdaptiveHierarchicalTextClustering(
    threshold_adjustment=0.01,
    window_size=3,
    min_split_tokens=10,
    max_split_tokens=50,
    split_tokens_tolerance=5
)
clustering.fit(embeddings, np.array(token_counts))

# Access clustering results
print(clustering.labels_)
print(clustering.tree_)

For a more detailed example, including visualization of the hierarchical structure, see the examples/hierarchy_clustering.py file.

How It Works

AdaptiveHierarchicalTextClustering works by:

  1. Calculating similarity scores between text segments using a rolling window approach
  2. Adaptively finding an optimal threshold for clustering based on token counts
  3. Building a hierarchical structure of clusters
  4. Providing both flat cluster labels and a tree structure of the hierarchy

The algorithm is particularly suited for tasks where maintaining context and creating a meaningful hierarchy are important.

Use Cases

  • Document summarization: Create hierarchical summaries of long documents
  • Topic modeling: Discover hierarchical topic structures in large text corpora
  • Content organization: Automatically organize and structure content for websites or knowledge bases
  • Text segmentation: Identify coherent segments within long texts
  • Hierarchical text classification: Create multi-level classification systems for text data

Contributing

Contributions are welcome. Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • This project was inspired by the need for better hierarchical text organization in various NLP tasks.
  • Thanks to the semantic-router for providing inspiration for the essential functionalities used in this project.

Contact

If you have any questions or feedback, please open an issue on the GitHub repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file adaptive_hierarchical_text_clustering-0.1.0.tar.gz.

File metadata

File hashes

Hashes for adaptive_hierarchical_text_clustering-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2e3bf0c03d003a72bc770acc6330ceb1a311a7c8bba4768f26a5224c6da27524
MD5 560349fd6f48714e137a1135c14277d3
BLAKE2b-256 8b1af36e9bd6a46b36c281d6cd30fab84e776d2c3adf3b2e80f3317e7e1351a7

See more details on using hashes here.

File details

Details for the file adaptive_hierarchical_text_clustering-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for adaptive_hierarchical_text_clustering-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a006797b834581769065e5968e227fafe12a01459ca0a1a8fda4430cbdad7e77
MD5 2aa7dc6f78e4f56f03f4d064de134664
BLAKE2b-256 1d0bf6f1bae9a7428e8c82e5eb36510df499a0f13bf95a8842d5895219d2286e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page