Topic modeling toolkit for messy text data

These details have not been verified by PyPI

Project links

Project description

Meno: Topic Modeling Toolkit

Installation

Basic Installation

Install the basic package with core dependencies:

pip install meno

CPU-Optimized Installation (Recommended)

Install with embeddings for CPU-only operation (recommended for most users):

pip install meno[embeddings]

For a truly CPU-only version with no NVIDIA packages:

pip install meno[embeddings] -f https://download.pytorch.org/whl/torch_stable.html

Installation with Optional Components

# For additional topic modeling approaches (BERTopic, Top2Vec)
pip install meno[additional_models]

# For embeddings with GPU acceleration (only if needed)
pip install meno[embeddings-gpu]

# For LDA topic modeling
pip install meno[lda]

# For visualization capabilities
pip install meno[viz]

# For NLP processing capabilities
pip install meno[nlp]

# For large dataset optimization using Polars
pip install meno[optimization]

# For developers
pip install meno[dev,test]

# For all features (full installation, CPU only)
pip install meno[full]

# For all features with GPU acceleration
pip install meno[full-gpu]

Development Installation

For development work, clone the repository and install in editable mode:

git clone https://github.com/srepho/meno.git
cd meno
pip install -e ".[dev,test]"

Quick Start

from meno import MenoTopicModeler
import pandas as pd

# Load your data
data = pd.DataFrame({
    "text": [
        "Customer's vehicle was damaged in a parking lot by a shopping cart.",
        "Claimant's home flooded due to heavy rain. Water damage to first floor.",
        "Vehicle collided with another car at an intersection. Front-end damage.",
        "Tree fell on roof during storm causing damage to shingles and gutters.",
        "Insured slipped on ice in parking lot and broke wrist requiring treatment."
    ]
})

# Initialize topic modeler
modeler = MenoTopicModeler()

# Preprocess documents
processed_docs = modeler.preprocess(data, text_column="text")

# Generate embeddings
embeddings = modeler.embed_documents()

# Discover topics
topics_df = modeler.discover_topics(method="embedding_cluster", num_topics=3)

# Visualize results
fig = modeler.visualize_embeddings()
fig.show()

# Generate HTML report
report_path = modeler.generate_report(output_path="topics_report.html")

Overview

Meno is designed to streamline topic modeling on free text data, with a special focus on messy datasets such as insurance claims notes and customer correspondence. The package combines classical methods like Latent Dirichlet Allocation (LDA) with modern techniques leveraging large language models (LLMs) via Hugging Face, dimensionality reduction with UMAP, and advanced visualizations. It is built to be primarily used in Jupyter environments while also being flexible enough for other settings.

Key Features

Unsupervised Topic Modeling:
- Automatically discover topics when no pre-existing topics are available using LDA and LLM-based embedding and clustering techniques.
Supervised Topic Matching:
- Match free text against a user-provided list of topics using semantic similarity and classification techniques.
Advanced Visualization:
- Create interactive and static visualizations including topic distributions, embeddings (UMAP projections), cluster analyses, and topic coherence metrics (e.g., word clouds per topic).
Interactive HTML Reports:
- Generate standalone, interactive HTML reports to present topic analysis to less technical stakeholders, with options for customization and data export.
Robust Data Preprocessing:
- Tackle messy data challenges (misspellings, unknown acronyms) with integrated cleaning functionalities using NLP libraries (spaCy, fuzzy matching, context-aware spelling correction, and customizable stop words/lemmatization rules).
Active Learning with Cleanlab:
- Incorporate active learning loops and fine-tuning of labels using Cleanlab, facilitating hand-labeling and iterative improvements, with multiple sampling strategies (e.g., uncertainty sampling).
Flexible Deployment Options:
- CPU-first design with optional GPU acceleration through separate installation options.
- Load models from local files for use in environments without internet access or behind firewalls.
Extensibility & Ease of Use:
- Designed with modularity in mind so that users can plug in new cleaning, modeling, or visualization techniques without deep customization while still maintaining a simple interface.

Example Usage

Basic Topic Discovery

from meno import MenoTopicModeler

# Initialize modeler
modeler = MenoTopicModeler()

# Load and preprocess data
df = pd.read_csv("my_documents.csv")
processed_docs = modeler.preprocess(df, text_column="document_text")

# Discover topics
topics_df = modeler.discover_topics(method="embedding_cluster", num_topics=10)

# Visualize results
fig = modeler.visualize_embeddings()
fig.show()

Matching Documents to Predefined Topics

# Define topics and descriptions
predefined_topics = [
    "Vehicle Damage",
    "Water Damage",
    "Personal Injury",
    "Property Damage"
]

topic_descriptions = [
    "Damage to vehicles from collisions, parking incidents, or natural events",
    "Damage from water including floods, leaks, and burst pipes",
    "Injuries to people including slips, falls, and accidents",
    "Damage to property from fire, storms, or other causes"
]

# Match documents to topics
matched_df = modeler.match_topics(
    topics=predefined_topics,
    descriptions=topic_descriptions,
    threshold=0.5
)

# View the topic assignments
print(matched_df[["text", "topic", "topic_probability"]].head())

Generating Reports

# Generate an interactive HTML report
report_path = modeler.generate_report(
    output_path="topic_analysis.html",
    include_interactive=True,
    title="Document Topic Analysis"
)

Documentation

For detailed usage information, see the full documentation.

Examples

The package includes several example notebooks and scripts:

examples/basic_workflow.ipynb: Basic topic modeling workflow in a Jupyter notebook
examples/cpu_only_example.py: Demonstrates CPU-optimized topic modeling
examples/insurance_topic_modeling.py: Topic modeling on insurance complaint dataset
examples/minimal_sample.py: Simple script to generate visualizations
examples/sample_reports/: Directory with pre-generated sample visualizations

Insurance Complaint Analysis

The package includes an example that demonstrates topic modeling on the Australian Insurance PII Dataset from Hugging Face. This dataset contains over 1,500 insurance complaint letters with various types of insurance issues.

To run the insurance example:

# Install required dependencies
pip install -r requirements_insurance_example.txt

# Run the example script
python examples/insurance_topic_modeling.py

The results will be saved in the output directory.

Architecture & Design

The package follows a modular design with clear separation of concerns:

Data Preprocessing Module:

Spelling correction using thefuzz
Acronym resolution
Text normalization (lowercasing, punctuation removal, stemming/lemmatization)
Customizable stop words and lemmatization

Topic Modeling Module:

Unsupervised modeling with LDA or LLM-based embeddings + clustering
Supervised topic matching using semantic similarity
CPU-first design with optional GPU acceleration

Visualization Module:

Static plots (topic distributions)
Interactive embedding plots with UMAP projections
Topic coherence visualizations

Report Generation Module:

Interactive HTML reports using Plotly and Jinja2
Customizable appearance and content
Data export options

Dependencies & Requirements

Python: 3.8, 3.9, 3.10, 3.11, 3.12 (primary target: 3.10)
Core Libraries (always installed):
- Data Processing: pandas, pyarrow
- Machine Learning: scikit-learn
- Text Processing: thefuzz
- Configuration: pydantic, PyYAML, jinja2
Optional Libraries (install based on needs):
- Topic Modeling: gensim (for LDA)
- Additional Topic Models: bertopic, top2vec
- Embeddings (CPU): transformers, sentence-transformers, torch
- Embeddings (GPU): Additional accelerate, bitsandbytes
- Dimensionality Reduction: umap-learn
- Clustering: hdbscan
- Data Cleaning & NLP: spaCy
- Visualization: plotly
- Active Learning: cleanlab
- Large Dataset Optimization: polars (for streaming and memory efficiency)

Testing & Contribution

Running Tests

# Run basic tests
python -m pytest -xvs tests/

# Run full tests including embedding model tests
python -m pytest -xvs tests/ --run-functional

# Run with coverage reporting
python -m pytest --cov=meno

Contribution Guidelines

Contributions are welcome! Please see CONTRIBUTING.md for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.3.5

Mar 24, 2025

1.3.4

Mar 21, 2025

1.3.3

Mar 21, 2025

1.3.2

Mar 19, 2025

1.3.1

Mar 17, 2025

1.3.0

Mar 17, 2025

1.2.10

Mar 17, 2025

1.2.9

Mar 17, 2025

1.2.8

Mar 14, 2025

1.2.7

Mar 13, 2025

1.2.6

Mar 13, 2025

1.2.5

Mar 13, 2025

1.2.4

Mar 12, 2025

1.2.2

Mar 11, 2025

1.2.1

Mar 11, 2025

1.2.0

Mar 11, 2025

1.1.2

Mar 11, 2025

1.1.1

Mar 7, 2025

1.1.0

Mar 7, 2025

1.0.3

Mar 7, 2025

1.0.2

Mar 7, 2025

1.0.1

Mar 7, 2025

1.0.0

Mar 7, 2025

0.9.1

Mar 6, 2025

0.9.0

Mar 6, 2025

0.8.0

Mar 6, 2025

0.7.0

Mar 6, 2025

0.6.0

Mar 6, 2025

This version

0.5.0

Mar 6, 2025

0.4.1

Mar 6, 2025

0.4.0

Mar 6, 2025

0.3.0

Mar 6, 2025

0.2.0

Mar 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meno-0.5.0.tar.gz (5.4 MB view details)

Uploaded Mar 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

meno-0.5.0-py3-none-any.whl (55.3 kB view details)

Uploaded Mar 6, 2025 Python 3

File details

Details for the file meno-0.5.0.tar.gz.

File metadata

Download URL: meno-0.5.0.tar.gz
Upload date: Mar 6, 2025
Size: 5.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meno-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`637f42c933e206bac885c64e504bbdad322657a6bceb218342256c8d4a8219a7`
MD5	`b93b81588c17416450741f988cede420`
BLAKE2b-256	`76ce7141583f27d5716f105af4107788cf9259cac034186a73e875aa1ecb178a`

See more details on using hashes here.

File details

Details for the file meno-0.5.0-py3-none-any.whl.

File metadata

Download URL: meno-0.5.0-py3-none-any.whl
Upload date: Mar 6, 2025
Size: 55.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for meno-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cfca7659f4fe475a2971b0f0ee4e9fd73673c452489290f87b493cd41940281e`
MD5	`d4039319f6b92bda464075443c4e59b7`
BLAKE2b-256	`9ad7778d36104317985e3c67ed53cc3903d19e3569efd04dd83fd93c77aaa45d`

See more details on using hashes here.

meno 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Meno: Topic Modeling Toolkit

Installation

Basic Installation

CPU-Optimized Installation (Recommended)

Installation with Optional Components

Development Installation

Quick Start

Overview

Key Features

Example Usage

Basic Topic Discovery

Matching Documents to Predefined Topics

Generating Reports

Documentation

Examples

Insurance Complaint Analysis

Architecture & Design

Data Preprocessing Module:

Topic Modeling Module:

Visualization Module:

Report Generation Module:

Dependencies & Requirements

Testing & Contribution

Running Tests

Contribution Guidelines

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes