Skip to main content

A powerful and flexible Python tool for semantic clustering of text documents using vector embeddings

Project description

semantic-clustify

PyPI version Python 3.8+ License: MIT PyPI Downloads

A powerful and flexible Python tool for semantic clustering of text documents using vector embeddings with support for multiple algorithms and intelligent cluster optimization.

๐Ÿ“‹ Simple Description

semantic-clustify is a command-line tool and Python library that groups text documents by semantic similarity using pre-computed vector embeddings. It supports multiple clustering algorithms (KMeans, DBSCAN, Hierarchical), automatic cluster number optimization, and seamless JSONL processing for efficient document analysis pipelines.

๐Ÿš€ Quick Start

pip install semantic-clustify

# Basic usage with vector embeddings
semantic-clustify \
  --input vectorized_data.jsonl \
  --embedding-field "embedding" \
  --method "kmeans" \
  --n-clusters 5

# Auto-detect optimal cluster number
semantic-clustify \
  --input data.jsonl \
  --embedding-field "embedding" \
  --method "kmeans" \
  --n-clusters auto

# Using stdin input
cat vectorized_data.jsonl | semantic-clustify \
  --embedding-field "embedding" \
  --method "dbscan"

โœจ Features

  • ๐ŸŽฏ Multiple Clustering Algorithms: KMeans, DBSCAN, Hierarchical, Gaussian Mixture
  • ๐Ÿง  Intelligent Cluster Optimization: Automatic optimal cluster number detection
  • ๐Ÿ“Š Vector-Based Processing: Works with pre-computed embeddings from any source
  • ๐Ÿ“ JSONL Processing: Seamless input/output in JSONL format
  • โšก High Performance: Optimized with Faiss for large-scale clustering
  • ๐Ÿ›ก๏ธ Error Resilience: Continue processing even if individual records fail
  • ๐Ÿ“ฅ Stdin Support: Read input from pipes or stdin for flexible data processing
  • ๐ŸŽ›๏ธ Smart Defaults: Default parameters optimized for common use cases
  • ๐Ÿ”ง Flexible Input: Support file input, stdin, or explicit stdin markers
  • ๐Ÿ“ˆ Cluster Quality Metrics: Silhouette score, inertia, and cluster statistics

๐Ÿ“– Table of Contents

๐Ÿ”ง Installation

Method 1: pip install (Recommended)

# Install core package
pip install semantic-clustify

# Install with specific algorithm support
pip install semantic-clustify[faiss]          # Faiss support for large-scale clustering
pip install semantic-clustify[advanced]       # Advanced clustering algorithms
pip install semantic-clustify[all]            # All clustering algorithms and optimizations

# Install with development dependencies
pip install semantic-clustify[dev]

Method 2: From source

# Clone repository
git clone https://github.com/changyy/py-semantic-clustify.git
cd py-semantic-clustify

# Automated development setup (recommended for contributors)
./setup.sh

# Manual setup
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package  
pip install -e .                # Core package only
# or
pip install -e ".[dev]"         # With development dependencies
# or  
pip install -e ".[all,dev]"     # With all optional dependencies

Method 3: Development setup

# Automated setup for developers (recommended)
git clone https://github.com/changyy/py-semantic-clustify.git
cd py-semantic-clustify
./setup.sh

# Manual development setup
python3 -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run tests to verify installation
python test_runner.py --quick

Install additional packages based on the clustering algorithms you plan to use:

# For high-performance clustering
pip install faiss-cpu  # or faiss-gpu for GPU support

# For advanced clustering algorithms
pip install scikit-learn>=1.0.0

# For visualization
pip install matplotlib seaborn plotly

๐Ÿ“ Usage

Command Line Interface

semantic-clustify [OPTIONS]

Required Arguments

  • --embedding-field: Name of the field containing vector embeddings
  • --method: Clustering algorithm to use

Optional Arguments

  • --input: Path to input JSONL file (use "-" for stdin, or omit to read from stdin)
  • --n-clusters: Number of clusters (default: "auto" for automatic detection)
  • --min-cluster-size: Minimum cluster size (default: 2)
  • --max-clusters: Maximum clusters for auto-detection (default: 20)
  • --output-format: Output format - "grouped" or "labeled" (default: "grouped")
  • --output: Output file path (default: auto-generated)
  • --quality-metrics: Show clustering quality metrics

Quick Start Features

The tool supports smart defaults and flexible input methods for easier usage:

Default Algorithms

Each clustering method has optimized default parameters:

  • KMeans: Automatic cluster number detection with elbow method
  • DBSCAN: Adaptive eps and min_samples based on data characteristics
  • Hierarchical: Ward linkage with automatic distance threshold
  • GMM: Gaussian Mixture Model with BIC optimization

Flexible Input Methods

  • File input: --input data.jsonl
  • Stdin (auto-detect): cat data.jsonl | semantic-clustify ...
  • Explicit stdin: --input -

Minimal Example

# The simplest possible usage
cat vectorized_data.jsonl | semantic-clustify --embedding-field "embedding" --method "kmeans"

Input Format

JSONL file with pre-computed vector embeddings:

{"title": "Machine Learning Basics", "content": "Introduction to ML", "embedding": [0.1, 0.2, 0.3, ...]}
{"title": "Deep Learning Overview", "content": "Neural networks explained", "embedding": [0.15, 0.25, 0.35, ...]}
{"title": "Data Science Tools", "content": "Python libraries for data", "embedding": [0.8, 0.1, 0.2, ...]}

Output Formats

Grouped Format (Default)

[
  [
    {"title": "Machine Learning Basics", "content": "Introduction to ML", "embedding": [0.1, 0.2, 0.3, ...], "cluster_id": 0},
    {"title": "Deep Learning Overview", "content": "Neural networks explained", "embedding": [0.15, 0.25, 0.35, ...], "cluster_id": 0}
  ],
  [
    {"title": "Data Science Tools", "content": "Python libraries for data", "embedding": [0.8, 0.1, 0.2, ...], "cluster_id": 1}
  ]
]

Labeled Format

{"title": "Machine Learning Basics", "content": "Introduction to ML", "embedding": [0.1, 0.2, 0.3, ...], "cluster_id": 0}
{"title": "Deep Learning Overview", "content": "Neural networks explained", "embedding": [0.15, 0.25, 0.35, ...], "cluster_id": 0}
{"title": "Data Science Tools", "content": "Python libraries for data", "embedding": [0.8, 0.1, 0.2, ...], "cluster_id": 1}

๐Ÿค– Supported Algorithms

KMeans Clustering

  • Best for: Well-separated, spherical clusters
  • Auto-optimization: Elbow method, silhouette analysis
  • Parameters: n_clusters, init, max_iter
  • Performance: Excellent for large datasets

DBSCAN (Density-Based)

  • Best for: Arbitrary shapes, noise detection
  • Auto-optimization: Adaptive eps using k-distance graph
  • Parameters: eps, min_samples
  • Performance: Good for varying cluster densities

Hierarchical Clustering

  • Best for: Nested cluster structures
  • Auto-optimization: Dendrogram analysis for optimal cuts
  • Parameters: linkage, distance_threshold
  • Performance: Good for small to medium datasets

Gaussian Mixture Model (GMM)

  • Best for: Overlapping clusters, probabilistic assignment
  • Auto-optimization: BIC/AIC for component selection
  • Parameters: n_components, covariance_type
  • Performance: Good for probabilistic clustering

๐Ÿ“š Examples

Example 1: Basic KMeans with automatic cluster detection

semantic-clustify \
  --input documents.jsonl \
  --embedding-field "embedding" \
  --method "kmeans" \
  --n-clusters auto \
  --quality-metrics \
  --output clustered_documents.jsonl

Example 2: DBSCAN for density-based clustering

cat news_articles.jsonl | semantic-clustify \
  --embedding-field "vector" \
  --method "dbscan" \
  --min-cluster-size 3 \
  --output-format "labeled"

Example 3: Hierarchical clustering with custom parameters

semantic-clustify \
  --input research_papers.jsonl \
  --embedding-field "text_embedding" \
  --method "hierarchical" \
  --n-clusters 8 \
  --output hierarchical_clusters.jsonl

Example 4: Large-scale clustering with Faiss optimization

semantic-clustify \
  --input large_dataset.jsonl \
  --embedding-field "embedding" \
  --method "kmeans" \
  --n-clusters auto \
  --max-clusters 50 \
  --output-format "grouped"

Example 5: Using stdin with quality metrics

echo '{"title": "Sample", "embedding": [0.1, 0.2, 0.3]}' | semantic-clustify \
  --input - \
  --embedding-field "embedding" \
  --method "kmeans" \
  --quality-metrics

Example 6: GMM clustering for overlapping clusters

semantic-clustify \
  --input mixed_topics.jsonl \
  --embedding-field "semantic_vector" \
  --method "gmm" \
  --n-clusters auto \
  --output probabilistic_clusters.jsonl

๐ŸŽฏ CLI-First Development Workflow

For the most efficient development experience, we recommend starting with CLI experimentation:

# Step 1: Try the interactive workflow demo
python examples/cli_clustering_demo.py

# Step 2: Try the comprehensive clustering guide  
python examples/clustering_workflow_guide.py

# Step 3: Use your own vectorized data with CLI-first approach

Benefits: Fast iteration โ†’ Parameter tuning โ†’ Library integration โ†’ Optimal clustering

๐Ÿ“š Library Usage

For programmatic integration, semantic-clustify provides a powerful Python API that allows you to process data in-memory using List[Dict] format. We recommend a CLI-first development workflow for parameter optimization and result validation.

๐Ÿ”„ Recommended Development Workflow

Step 1: CLI Experimentation (Parameter Tuning)

Start with CLI commands on small datasets to find optimal parameters:

# Test different algorithms and parameters
semantic-clustify \
  --input small_sample.jsonl \
  --embedding-field "embedding" \
  --method "kmeans" \
  --n-clusters auto \
  --quality-metrics \
  --output test_kmeans.jsonl

# Compare with DBSCAN
semantic-clustify \
  --input small_sample.jsonl \
  --embedding-field "embedding" \
  --method "dbscan" \
  --quality-metrics \
  --output test_dbscan.jsonl

# Analyze results
head -10 test_kmeans.jsonl
python -c "import json; data=json.load(open('test_kmeans.jsonl')); print(f'Found {len(data)} clusters')"

Step 2: Library Integration (Optimized Parameters)

Switch to library usage with validated parameters:

from semantic_clustify import SemanticClusterer

# Use parameters validated from CLI experiments
clusterer = SemanticClusterer(
    method="kmeans",
    n_clusters=5,  # From CLI optimization
    min_cluster_size=2
)

# Process data in memory
data = [
    {"title": "AI Research", "embedding": [0.1, 0.2, 0.3, ...]},
    {"title": "ML Applications", "embedding": [0.15, 0.25, 0.35, ...]},
    {"title": "Data Analysis", "embedding": [0.8, 0.1, 0.2, ...]}
]

# Perform clustering
clustered_groups = clusterer.fit_predict(data, vector_field="embedding")

# Results are grouped by cluster
for cluster_id, group in enumerate(clustered_groups):
    print(f"Cluster {cluster_id}: {len(group)} documents")

๐Ÿš€ Quick Start with In-Memory Processing

from semantic_clustify import SemanticClusterer

# Process data directly in memory
data = [
    {"title": "Python Programming", "content": "Learn Python", "embedding": [0.1, 0.2, 0.3]},
    {"title": "Machine Learning", "content": "ML concepts", "embedding": [0.15, 0.25, 0.35]},
    {"title": "Web Development", "content": "Build websites", "embedding": [0.8, 0.1, 0.2]},
    {"title": "Data Science", "content": "Analyze data", "embedding": [0.12, 0.22, 0.32]}
]

# Create clusterer with automatic cluster detection
clusterer = SemanticClusterer(
    method="kmeans",
    n_clusters="auto",
    min_cluster_size=2
)

# Perform clustering
clusters = clusterer.fit_predict(data, vector_field="embedding")

# Print results
for i, cluster in enumerate(clusters):
    print(f"\nCluster {i} ({len(cluster)} documents):")
    for doc in cluster:
        print(f"  - {doc['title']}")

# Get clustering metrics
metrics = clusterer.get_quality_metrics()
print(f"\nClustering Quality:")
print(f"Silhouette Score: {metrics['silhouette_score']:.3f}")
print(f"Number of Clusters: {metrics['n_clusters']}")

๐Ÿ”ง Advanced Library Integration

Batch Processing with Multiple Algorithms

from semantic_clustify import SemanticClusterer, ClusteringComparator
from typing import List, Dict

def compare_clustering_methods(documents: List[Dict], 
                             vector_field: str = "embedding") -> Dict:
    """
    Compare different clustering algorithms and return best results.
    
    Args:
        documents: List of dictionaries with vector embeddings
        vector_field: Name of the field containing vectors
    
    Returns:
        Dictionary with comparison results and best clustering
    """
    
    methods = ["kmeans", "dbscan", "hierarchical", "gmm"]
    results = {}
    
    for method in methods:
        try:
            clusterer = SemanticClusterer(
                method=method,
                n_clusters="auto",
                min_cluster_size=2
            )
            
            clusters = clusterer.fit_predict(documents, vector_field=vector_field)
            metrics = clusterer.get_quality_metrics()
            
            results[method] = {
                "clusters": clusters,
                "metrics": metrics,
                "n_clusters": len(clusters),
                "silhouette_score": metrics.get("silhouette_score", 0)
            }
            
        except Exception as e:
            print(f"Method {method} failed: {e}")
            results[method] = None
    
    # Find best method by silhouette score
    best_method = max(
        [k for k, v in results.items() if v is not None],
        key=lambda k: results[k]["silhouette_score"]
    )
    
    return {
        "all_results": results,
        "best_method": best_method,
        "best_clusters": results[best_method]["clusters"],
        "comparison_summary": {
            method: {
                "n_clusters": res["n_clusters"] if res else 0,
                "silhouette": res["silhouette_score"] if res else 0
            } for method, res in results.items()
        }
    }

# Usage example
documents = [
    {"title": "AI Research", "embedding": [0.1, 0.2, 0.3]},
    {"title": "ML Applications", "embedding": [0.15, 0.25, 0.35]},
    {"title": "Data Analysis", "embedding": [0.8, 0.1, 0.2]},
    {"title": "Statistics", "embedding": [0.82, 0.12, 0.22]}
]

comparison = compare_clustering_methods(documents)
print(f"Best method: {comparison['best_method']}")
print(f"Best clustering has {len(comparison['best_clusters'])} clusters")

Dynamic Clustering Pipeline

from semantic_clustify import SemanticClusterer
from typing import Dict, List, Optional
import logging

class DynamicClusteringPipeline:
    """
    Pipeline for adaptive clustering based on data characteristics.
    """
    
    def __init__(self, min_cluster_size: int = 2, max_clusters: int = 20):
        self.min_cluster_size = min_cluster_size
        self.max_clusters = max_clusters
        self.clusterers = {}
        
    def analyze_data_characteristics(self, data: List[Dict], 
                                   vector_field: str) -> Dict:
        """Analyze data to suggest optimal clustering approach."""
        import numpy as np
        
        vectors = np.array([item[vector_field] for item in data])
        n_samples, n_features = vectors.shape
        
        # Calculate data characteristics
        characteristics = {
            "n_samples": n_samples,
            "n_features": n_features,
            "vector_std": float(np.std(vectors)),
            "vector_mean_norm": float(np.mean(np.linalg.norm(vectors, axis=1))),
            "suggested_method": self._suggest_method(n_samples, n_features)
        }
        
        return characteristics
    
    def _suggest_method(self, n_samples: int, n_features: int) -> str:
        """Suggest clustering method based on data size and characteristics."""
        if n_samples < 100:
            return "hierarchical"  # Good for small datasets
        elif n_samples < 1000:
            return "kmeans"  # Balanced approach
        else:
            return "kmeans"  # Scalable for large datasets
    
    def adaptive_clustering(self, data: List[Dict], 
                          vector_field: str,
                          method: Optional[str] = None) -> List[List[Dict]]:
        """Perform adaptive clustering based on data characteristics."""
        
        # Analyze data if method not specified
        if method is None:
            characteristics = self.analyze_data_characteristics(data, vector_field)
            method = characteristics["suggested_method"]
            logging.info(f"Auto-selected method: {method}")
        
        # Create or reuse clusterer
        if method not in self.clusterers:
            self.clusterers[method] = SemanticClusterer(
                method=method,
                n_clusters="auto",
                min_cluster_size=self.min_cluster_size,
                max_clusters=self.max_clusters
            )
        
        clusterer = self.clusterers[method]
        
        # Perform clustering
        clusters = clusterer.fit_predict(data, vector_field=vector_field)
        
        # Log results
        metrics = clusterer.get_quality_metrics()
        logging.info(f"Clustering completed: {len(clusters)} clusters, "
                    f"silhouette score: {metrics.get('silhouette_score', 'N/A')}")
        
        return clusters

# Usage example
pipeline = DynamicClusteringPipeline(min_cluster_size=3, max_clusters=15)

# Automatic method selection
clusters = pipeline.adaptive_clustering(documents, "embedding")

# Or force specific method
kmeans_clusters = pipeline.adaptive_clustering(documents, "embedding", method="kmeans")

๐ŸŽฏ Integration Patterns

Flask Web Application Integration

# clustering_service.py
from semantic_clustify import SemanticClusterer
from flask import Flask, request, jsonify
from typing import List, Dict
import numpy as np

app = Flask(__name__)

class ClusteringService:
    """Service for real-time document clustering in web applications."""
    
    def __init__(self):
        # Pre-initialize clusterers for different scenarios
        self.clusterers = {
            "fast": SemanticClusterer(method="kmeans", n_clusters="auto"),
            "precise": SemanticClusterer(method="hierarchical", n_clusters="auto"),
            "density": SemanticClusterer(method="dbscan", min_cluster_size=3)
        }
    
    def cluster_documents(self, documents: List[Dict], 
                         vector_field: str = "embedding",
                         mode: str = "fast") -> Dict:
        """Cluster documents with specified quality mode."""
        
        if mode not in self.clusterers:
            mode = "fast"
        
        clusterer = self.clusterers[mode]
        
        try:
            clusters = clusterer.fit_predict(documents, vector_field=vector_field)
            metrics = clusterer.get_quality_metrics()
            
            return {
                "success": True,
                "clusters": clusters,
                "metrics": metrics,
                "n_clusters": len(clusters),
                "mode": mode
            }
        
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "mode": mode
            }
    
    def find_similar_clusters(self, query_vector: List[float], 
                            existing_clusters: List[List[Dict]],
                            vector_field: str = "embedding",
                            threshold: float = 0.7) -> List[int]:
        """Find clusters similar to a query vector."""
        similar_clusters = []
        
        for cluster_id, cluster in enumerate(existing_clusters):
            # Calculate cluster centroid
            vectors = [doc[vector_field] for doc in cluster]
            centroid = np.mean(vectors, axis=0)
            
            # Calculate similarity
            similarity = self._cosine_similarity(query_vector, centroid)
            
            if similarity >= threshold:
                similar_clusters.append(cluster_id)
        
        return similar_clusters
    
    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
        """Calculate cosine similarity between two vectors."""
        a_np, b_np = np.array(a), np.array(b)
        return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))

# Initialize service
clustering_service = ClusteringService()

@app.route('/cluster', methods=['POST'])
def cluster_documents():
    """API endpoint for document clustering."""
    data = request.json
    
    documents = data.get('documents', [])
    vector_field = data.get('vector_field', 'embedding')
    mode = data.get('mode', 'fast')
    
    result = clustering_service.cluster_documents(
        documents, vector_field, mode
    )
    
    return jsonify(result)

@app.route('/find_similar', methods=['POST'])
def find_similar():
    """API endpoint for finding similar clusters."""
    data = request.json
    
    query_vector = data.get('query_vector')
    existing_clusters = data.get('clusters')
    threshold = data.get('threshold', 0.7)
    
    similar_clusters = clustering_service.find_similar_clusters(
        query_vector, existing_clusters, threshold=threshold
    )
    
    return jsonify({"similar_clusters": similar_clusters})

if __name__ == '__main__':
    app.run(debug=True)

Data Pipeline Integration

import pandas as pd
from semantic_clustify import SemanticClusterer
from typing import Iterator, Dict

def clustering_pipeline(data_source: Iterator[Dict], 
                       output_path: str,
                       vector_field: str = "embedding",
                       method: str = "kmeans",
                       batch_size: int = 1000) -> None:
    """
    Process large datasets in batches with clustering.
    
    This approach handles memory efficiently for large datasets.
    """
    
    clusterer = SemanticClusterer(
        method=method,
        n_clusters="auto",
        min_cluster_size=2
    )
    
    batch = []
    processed_count = 0
    
    for item in data_source:
        batch.append(item)
        
        if len(batch) >= batch_size:
            # Process batch
            clusters = clusterer.fit_predict(batch, vector_field=vector_field)
            
            # Save batch results
            save_batch_clusters(clusters, output_path, processed_count)
            
            processed_count += len(batch)
            batch = []
            
            print(f"Processed {processed_count} items, found {len(clusters)} clusters")
    
    # Process remaining items
    if batch:
        clusters = clusterer.fit_predict(batch, vector_field=vector_field)
        save_batch_clusters(clusters, output_path, processed_count)

def save_batch_clusters(clusters: List[List[Dict]], 
                       output_path: str, 
                       batch_offset: int) -> None:
    """Save clustering results for a batch."""
    import json
    
    mode = 'a' if batch_offset > 0 else 'w'
    
    with open(output_path, mode) as f:
        for cluster_id, cluster in enumerate(clusters):
            cluster_data = {
                "batch_offset": batch_offset,
                "cluster_id": cluster_id,
                "documents": cluster,
                "size": len(cluster)
            }
            f.write(json.dumps(cluster_data) + '\n')

# Usage with pandas
def cluster_dataframe(df: pd.DataFrame, 
                     vector_column: str = "embedding",
                     method: str = "kmeans") -> pd.DataFrame:
    """Add cluster labels to a pandas DataFrame."""
    
    clusterer = SemanticClusterer(method=method, n_clusters="auto")
    
    # Convert DataFrame to list of dicts
    data = df.to_dict('records')
    
    # Perform clustering  
    clusters = clusterer.fit_predict(data, vector_field=vector_column)
    
    # Add cluster labels back to DataFrame
    cluster_labels = []
    for cluster_id, cluster in enumerate(clusters):
        for doc in cluster:
            # Find original index and assign cluster label
            original_idx = next(i for i, row in enumerate(data) if row == doc)
            cluster_labels.append((original_idx, cluster_id))
    
    # Sort by original index and extract labels
    cluster_labels.sort(key=lambda x: x[0])
    df['cluster_id'] = [label for _, label in cluster_labels]
    
    return df

๐Ÿ” Performance Optimization

Memory-Efficient Large-Scale Clustering

from semantic_clustify import SemanticClusterer
import numpy as np
from typing import List, Dict, Generator

class LargeScaleClusterer:
    """
    Memory-efficient clustering for large datasets.
    """
    
    def __init__(self, method: str = "kmeans", 
                 chunk_size: int = 10000,
                 use_faiss: bool = True):
        self.method = method
        self.chunk_size = chunk_size
        self.use_faiss = use_faiss
        
    def cluster_large_dataset(self, data_generator: Generator[Dict, None, None],
                            vector_field: str = "embedding",
                            sample_ratio: float = 0.1) -> List[List[Dict]]:
        """
        Cluster large dataset using sampling and batch processing.
        
        Args:
            data_generator: Generator yielding data items
            vector_field: Field containing vector embeddings
            sample_ratio: Ratio of data to sample for initial clustering
        
        Returns:
            List of clusters
        """
        
        # Step 1: Sample data for initial clustering
        sample_data = self._sample_data(data_generator, sample_ratio)
        
        # Step 2: Perform clustering on sample
        sample_clusterer = SemanticClusterer(
            method=self.method,
            n_clusters="auto"
        )
        
        sample_clusters = sample_clusterer.fit_predict(
            sample_data, vector_field=vector_field
        )
        
        # Step 3: Extract cluster centroids
        centroids = self._extract_centroids(sample_clusters, vector_field)
        
        # Step 4: Assign remaining data to clusters
        full_clusters = self._assign_to_clusters(
            data_generator, centroids, vector_field
        )
        
        return full_clusters
    
    def _sample_data(self, data_generator: Generator, 
                    sample_ratio: float) -> List[Dict]:
        """Sample data from generator."""
        import random
        
        sample_data = []
        for item in data_generator:
            if random.random() < sample_ratio:
                sample_data.append(item)
                
            # Limit sample size
            if len(sample_data) >= 10000:
                break
        
        return sample_data
    
    def _extract_centroids(self, clusters: List[List[Dict]], 
                          vector_field: str) -> np.ndarray:
        """Extract cluster centroids."""
        centroids = []
        
        for cluster in clusters:
            vectors = np.array([doc[vector_field] for doc in cluster])
            centroid = np.mean(vectors, axis=0)
            centroids.append(centroid)
        
        return np.array(centroids)
    
    def _assign_to_clusters(self, data_generator: Generator,
                           centroids: np.ndarray,
                           vector_field: str) -> List[List[Dict]]:
        """Assign all data points to nearest centroids."""
        
        # Initialize clusters
        clusters = [[] for _ in range(len(centroids))]
        
        # Process data in chunks
        chunk = []
        for item in data_generator:
            chunk.append(item)
            
            if len(chunk) >= self.chunk_size:
                self._assign_chunk_to_clusters(chunk, centroids, clusters, vector_field)
                chunk = []
        
        # Process remaining chunk
        if chunk:
            self._assign_chunk_to_clusters(chunk, centroids, clusters, vector_field)
        
        return clusters
    
    def _assign_chunk_to_clusters(self, chunk: List[Dict],
                                 centroids: np.ndarray,
                                 clusters: List[List[Dict]],
                                 vector_field: str) -> None:
        """Assign chunk of data to clusters."""
        
        # Extract vectors from chunk
        vectors = np.array([item[vector_field] for item in chunk])
        
        # Calculate distances to centroids
        distances = np.linalg.norm(
            vectors[:, np.newaxis] - centroids[np.newaxis, :], 
            axis=2
        )
        
        # Assign to nearest centroid
        assignments = np.argmin(distances, axis=1)
        
        # Add to clusters
        for item, cluster_id in zip(chunk, assignments):
            item['cluster_id'] = int(cluster_id)
            clusters[cluster_id].append(item)

# Usage example
def process_large_jsonl(file_path: str, output_path: str):
    """Process large JSONL file with memory-efficient clustering."""
    
    def data_generator():
        import json
        with open(file_path, 'r') as f:
            for line in f:
                yield json.loads(line.strip())
    
    clusterer = LargeScaleClusterer(
        method="kmeans",
        chunk_size=5000,
        use_faiss=True
    )
    
    clusters = clusterer.cluster_large_dataset(
        data_generator(), 
        vector_field="embedding",
        sample_ratio=0.05  # Use 5% for initial clustering
    )
    
    # Save results
    import json
    with open(output_path, 'w') as f:
        json.dump(clusters, f, indent=2)

๐Ÿ’ก Key Benefits of CLI-First + Library Workflow

  1. ๐Ÿ”ฌ Fast Parameter Discovery: CLI for quick algorithm and parameter testing
  2. ๐Ÿ“Š Quality Validation: Easy visualization of clustering quality with CLI output
  3. ๐Ÿงช Improved Reproducibility: Validate parameters before library integration
  4. โšก Optimized Performance: Choose best algorithm based on data characteristics
  5. ๐ŸŽฏ Custom Configuration: Configure clustering per dataset or use case
  6. ๐Ÿ”„ Seamless Transition: Move from CLI prototyping to library integration
  7. ๐Ÿ›ก๏ธ Production Ready: Robust error handling and scalability options

๐ŸŽฏ CLI-First Workflow Benefits

  • Exploration Phase: Use CLI with small datasets (100-1000 samples)
  • Parameter Tuning: Test different algorithms and find optimal parameters
  • Quality Assessment: Analyze silhouette scores and cluster distributions
  • Integration Phase: Switch to library with validated configuration
  • Production Phase: Scale up processing with optimized parameters

๐Ÿ”— Next Steps

  • Try the workflow: python examples/clustering_workflow_guide.py
  • See API Reference for detailed method documentation
  • Check Examples for more CLI usage patterns
  • Review Configuration for performance tuning

๐Ÿ”ง API Reference

Python API Usage

from semantic_clustify import SemanticClusterer

# Create clusterer
clusterer = SemanticClusterer(
    method="kmeans",
    n_clusters="auto",
    min_cluster_size=2,
    max_clusters=20
)

# Process data with vectors
data = [
    {"title": "Doc 1", "embedding": [0.1, 0.2, 0.3]},
    {"title": "Doc 2", "embedding": [0.15, 0.25, 0.35]}
]

clusters = clusterer.fit_predict(data, vector_field="embedding")

# Get quality metrics
metrics = clusterer.get_quality_metrics()
print(f"Silhouette Score: {metrics['silhouette_score']}")

Available Clustering Methods

from semantic_clustify import SemanticClusterer

# List all available methods
methods = SemanticClusterer.list_methods()
print(methods)
# ['kmeans', 'dbscan', 'hierarchical', 'gmm']

# Get method-specific parameters
params = SemanticClusterer.get_method_params("kmeans")
print(params)

Core Classes

SemanticClusterer

class SemanticClusterer:
    def __init__(self, method: str, n_clusters: Union[int, str] = "auto", 
                 min_cluster_size: int = 2, max_clusters: int = 20, **kwargs)
    
    def fit_predict(self, data: List[Dict], vector_field: str) -> List[List[Dict]]
    def get_quality_metrics(self) -> Dict[str, float]
    def predict_new_data(self, new_data: List[Dict], vector_field: str) -> List[int]

ClusteringComparator

class ClusteringComparator:
    def compare_methods(self, data: List[Dict], vector_field: str, 
                       methods: List[str]) -> Dict[str, Dict]
    def get_best_method(self, comparison_results: Dict) -> str

โš™๏ธ Configuration

Algorithm Parameters

KMeans

clusterer = SemanticClusterer(
    method="kmeans",
    n_clusters=5,          # Number of clusters
    init="k-means++",      # Initialization method
    max_iter=300,          # Maximum iterations
    random_state=42        # For reproducibility
)

DBSCAN

clusterer = SemanticClusterer(
    method="dbscan",
    eps=0.5,               # Neighborhood radius
    min_samples=5,         # Minimum samples per cluster
    metric="cosine"        # Distance metric
)

Hierarchical

clusterer = SemanticClusterer(
    method="hierarchical",
    n_clusters=5,          # Number of clusters
    linkage="ward",        # Linkage criterion
    distance_threshold=None # Distance threshold (alternative to n_clusters)
)

Gaussian Mixture Model

clusterer = SemanticClusterer(
    method="gmm",
    n_components=5,        # Number of components
    covariance_type="full", # Covariance type
    max_iter=100           # Maximum iterations
)

Performance Settings

# For large datasets
clusterer = SemanticClusterer(
    method="kmeans",
    use_faiss=True,        # Enable Faiss optimization
    batch_size=10000,      # Batch processing size
    n_jobs=-1              # Use all CPU cores
)

Output Formats

# Configure output format
clusterer = SemanticClusterer(
    method="kmeans",
    output_format="grouped",    # or "labeled"
    include_metrics=True,       # Include quality metrics
    include_centroids=True      # Include cluster centroids
)

๐Ÿ” Performance Tips

  1. Algorithm Selection:

    • Use KMeans for well-separated clusters
    • Use DBSCAN for arbitrary shapes and noise detection
    • Use Hierarchical for nested structures
    • Use GMM for overlapping clusters
  2. Large Dataset Optimization:

    • Enable Faiss for datasets >10k documents
    • Use sampling for initial parameter tuning
    • Process in batches to manage memory
  3. Vector Quality:

    • Ensure vectors are normalized
    • Use appropriate embedding dimensions (384-768 recommended)
    • Consider dimensionality reduction for very high-dimensional vectors
  4. Parameter Tuning:

    • Start with auto-detection for number of clusters
    • Use silhouette score for quality assessment
    • Validate with small samples before full processing
  5. Memory Management:

    • Use batch processing for large datasets
    • Consider streaming processing for very large files

๐Ÿ› Troubleshooting

Common Issues

Import Error: Missing dependencies

pip install scikit-learn faiss-cpu numpy

Memory Error: Large dataset processing

# Use batch processing or sampling
semantic-clustify --input large_file.jsonl --batch-size 5000

Poor Clustering Quality: Low silhouette score

  • Try different algorithms (DBSCAN instead of KMeans)
  • Adjust parameters (eps, min_samples)
  • Check vector quality and normalization

Empty Clusters: No documents in some clusters

  • Reduce number of clusters
  • Increase min_cluster_size parameter
  • Check for duplicate or invalid vectors

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Automated setup (recommended)
git clone https://github.com/changyy/py-semantic-clustify.git
cd py-semantic-clustify
./setup.sh

# Manual setup
git clone https://github.com/changyy/py-semantic-clustify.git
cd py-semantic-clustify
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

Running Tests

Using the test runner (recommended):

# Quick validation (fastest)
python test_runner.py --quick

# Core functionality tests
python test_runner.py --core

# Algorithm-specific tests
python test_runner.py --algorithms

# Performance tests
python test_runner.py --performance

# All tests
python test_runner.py --all

# Tests with coverage report
python test_runner.py --coverage

Direct pytest commands:

# Quick smoke tests
pytest -m "quick or smoke" -v

# Core clustering functionality
pytest -m "core" -v

# Algorithm-specific tests
pytest -m "kmeans" -v
pytest -m "dbscan" -v
pytest -m "hierarchical" -v

# Integration tests
pytest -m "integration" -v

# All tests
pytest -v

# With coverage
pytest --cov=semantic_clustify --cov-report=html -v

Development Tools

The project includes a convenient development script and tools organized in the tools/ directory:

# Quick development commands (using dev.py script)
python dev.py install      # Install in development mode
python dev.py test         # Run all tests
python dev.py test-quick   # Run quick smoke tests
python dev.py test-coverage # Run tests with coverage
python dev.py typecheck    # Run mypy type checking
python dev.py lint         # Run flake8 linting
python dev.py format       # Format code with black
python dev.py clean        # Clean build artifacts
python dev.py build        # Build distribution packages
python dev.py demo         # Run comprehensive demo

# Traditional pytest commands
pytest -v                  # All tests
pytest -m "quick or smoke" # Quick tests only
pytest --cov=semantic_clustify --cov-report=html -v  # With coverage

# Tools directory contains:
# - tools/demo_comprehensive.py     # Comprehensive feature demonstration
# - tools/test_runner.py           # Advanced test runner with options
# - tools/setup.sh                 # Development environment setup
# - tools/README.md                # Detailed tools documentation

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ“Š Benchmarks

Algorithm Dataset Size Time (seconds) Memory (MB) Silhouette Score
KMeans 1,000 docs 0.5 50 0.45
KMeans 10,000 docs 3.2 200 0.42
DBSCAN 1,000 docs 0.8 60 0.38
DBSCAN 10,000 docs 8.5 250 0.35
Hierarchical 1,000 docs 1.2 80 0.48
GMM 1,000 docs 2.1 90 0.41

Benchmarks on Intel i7, 16GB RAM, 768-dimensional vectors

๐Ÿ”— Related Projects

๐Ÿ“ž Support

๐ŸŽฏ Integration with text-vectorify

semantic-clustify is designed to work seamlessly with text-vectorify:

# Step 1: Generate embeddings
text-vectorify \
  --input articles.jsonl \
  --input-field-main "title" \
  --input-field-subtitle "content" \
  --process-method "BGEEmbedder" \
  --output vectorized_articles.jsonl

# Step 2: Cluster documents  
semantic-clustify \
  --input vectorized_articles.jsonl \
  --embedding-field "vector" \
  --method "kmeans" \
  --n-clusters auto \
  --output clustered_articles.jsonl

Made with โค๏ธ for the semantic analysis community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_clustify-1.0.0.tar.gz (51.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_clustify-1.0.0-py3-none-any.whl (27.4 kB view details)

Uploaded Python 3

File details

Details for the file semantic_clustify-1.0.0.tar.gz.

File metadata

  • Download URL: semantic_clustify-1.0.0.tar.gz
  • Upload date:
  • Size: 51.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for semantic_clustify-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cea8136de12e6ddc7df324e9348b76363ba3b09ddab004ff18dfd57fcc3d11de
MD5 fcbec9461b57660fd885b6bf1c20e15c
BLAKE2b-256 909b1d747e72843508db21b054d82a7dda4c7c19c8af7971b65d0d041966f10e

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_clustify-1.0.0.tar.gz:

Publisher: python-publish.yml on changyy/py-semantic-clustify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semantic_clustify-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_clustify-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd31f840b571547bd2b8bbc91a9781aba4f5b481d18673c833a4981ba1d95700
MD5 d25b13accfb8118524cd724d9d8cf4fc
BLAKE2b-256 11489cb5e8a8ded88a9e7b7a7bf5ba95bcfadfc2f023601dc65eafb134672b1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for semantic_clustify-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on changyy/py-semantic-clustify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page