Skip to main content

Interactive topic modeling framework for social science research

Project description

Interactive Topic Model (ITM)

An interactive topic modeling framework designed for social science research, where interpretability, iterative refinement, and human-in-the-loop validation are made accessible via a user-friendly interface.

Features

  • Single-line functions: Perform complex operations with minimal coding skills required
  • Interactive visualizations: Explore text data quickly and intuitively
  • Interactive topic refinement: Split topics, merge topics, manually reassign documents
  • Hierarchical structures: Create topic hierarchies via splitting or grouping operations with support for multiple embedding spaces
  • Semantic search: Find documents with similar meanings without relying on them using the same words
  • Preview-then-commit workflow: Preview topic splits before committing changes
  • Full undo/redo support: Track and reverse all modeling operations
  • Document validation: Mark high-confidence assignments and optionally freeze them during refitting
  • Representative documents: Automatically identifies most representative documents for each topic
  • Rich topic metadata: Auto-generated labels, top terms via c-TF-IDF, centroid embeddings
  • Flexible scoring: Multiple modes for document-topic similarity (embedding, TF-IDF, harmonic mean)

Quick Start

import pandas as pd
from interactive_topic_model import InteractiveTopicModel

# Prepare your documents
texts = [
    "Machine learning is a subset of artificial intelligence...",
    "Deep neural networks have revolutionized computer vision...",
    # ... more documents
]

# Initialize model
itm = InteractiveTopicModel(texts)

# Fit initial model
itm.fit()

# View topic summary
print(itm.get_topic_info())

# Access individual topics
topic = itm.topics[0]
print(f"Topic {topic.topic_id}: {topic.label}")
print(f"Top terms: {topic.get_top_terms(n=5)}")
print(topic.get_examples(n=5))
print(topic.get_representative_docs(n=5))

# Interactive refinement
preview = topic.split()
print(preview)
topic.commit_split()

# Merge topics
itm.merge_topics([1, 2], into="new", new_label="Combined Topic")

# Reassign a document
itm.assign_doc(doc_id=42, topic_id=3, validated=True)

# Visualize
fig = visualize_documents(itm)
fig.show()

# Undo if needed
itm.undo()

# Check undo history
print(itm.get_history())

Architecture

Key Classes

InteractiveTopicModel

The main engine that orchestrates topic modeling operations:

  • Manages document assignments, strengths, and validation status
  • Coordinates semantic spaces via BasicTopicModel instances
  • Provides operations: fit, assign, split, merge, group, archive
  • Tracks undo/redo history

InteractiveTopic

Represents a single topic with:

  • Access to documents: get_doc_ids(), get_texts(), get_examples()
  • Representations: get_embedding(), get_ctfidf(), get_top_terms()
  • Representative documents: get_representative_doc_ids(), get_representative_docs()
  • Operations: split(), commit_split()
  • Metadata: label, parent, active status

BasicTopicModel

Manages a semantic space:

  • Embeds documents via embedder (e.g., sentence-transformers)
  • Reduces dimensions via reducer (e.g., UMAP)
  • Clusters via clusterer (e.g., HDBSCAN)
  • Caches embeddings and reduced representations
  • Can have its own vocabulary (for sub-splits with custom vectorizers)

Topic Representations

Topics are characterized by multiple representations:

  1. Representative documents: Documents with highest cluster membership strength
  2. c-TF-IDF vectors: Class-based TF-IDF distinguishing each topic from siblings
  3. Centroid embeddings: Mean embedding of representative documents
  4. Top terms: Highest-weighted terms from c-TF-IDF
  5. Auto-generated labels: Based on top terms

All representations are computed lazily and invalidated when assignments change.

Component Customization

ITM provides default components but allows full customization:

from interactive_topic_model import (
    InteractiveTopicModel,
    SentenceTransformerEmbedder,
    UMAPReducer,
    HDBSCANClusterer,
    custom_vectorizer,
    harmonic_scorer,
)

# Custom components
itm = InteractiveTopicModel(
    texts=texts,
    embedder=SentenceTransformerEmbedder("all-mpnet-base-v2"),
    reducer=UMAPReducer(n_components=5, n_neighbors=15),
    clusterer=HDBSCANClusterer(min_cluster_size=10),
    vectorizer=custom_vectorizer(
        ngram_range=(1, 3),
        min_df=3,
        max_df=0.85,
        use_en_stopwords=True,
    ),
    scorer=harmonic_scorer,
    n_representative_docs=5,
)

Available Components

Embedders:

  • SentenceTransformerEmbedder - sentence-transformers models
  • e5_base_embedder() - Convenience function for E5-base-v2

Reducers:

  • UMAPReducer - UMAP dimensionality reduction

Clusterers:

  • HDBSCANClusterer - Hierarchical DBSCAN clustering

Scorers:

  • embedding_scorer - Cosine similarity on embeddings
  • tfidf_scorer - Cosine similarity on TF-IDF vectors
  • harmonic_scorer - Harmonic mean of both (default)

Vectorizers:

  • default_vectorizer() - Balanced defaults for general use
  • custom_vectorizer() - Customizable n-grams, stopwords, etc.

Workflow Patterns

Refer to vignette for more details.

Iterative Refinement

# 1. Fit initial model
itm.fit()

# 2. Explore topics
df = itm.get_topic_info()

# 3. Split heterogeneous topics
topic = itm.topics[5]
preview = topic.split()
# ... inspect preview ...
new_topic_ids = topic.commit_split()

# 4. Merge similar topics
itm.merge_topics([7, 8, 9], into="new", new_label="Sports")

# 5. Archive noise topics
itm.archive_topic(topic_id=12)

# 6. Validate good assignments
for doc_id in good_docs:
    itm.validate_doc(doc_id)

Preview-Commit Pattern

# Split returns a preview
topic = itm.topics[0]
preview = topic.split(
    clusterer=HDBSCANClusterer(min_cluster_size=5)
)

# Inspect before committing
print(f"Will create {len(preview.new_topic_ids)} new topics")

# Commit when satisfied
new_ids = topic.commit_split()

# Or undo if not satisfied
itm.undo()

Validation and Refitting

# Suggest assignments for unvalidated docs
itm.suggest_assignment(doc_id=42, mode="harmonic", threshold=0.5)

# Validate assignments
itm.validate_doc(doc_id=42)

# Auto-refit outliers
itm.refit(
    target=[itm.OUTLIER_ID],
    mode="harmonic",
    threshold=0.3,
    auto_reassign=True,
    validate_refits=True,
)

Visualization

from interactive_topic_model import visualize_documents, visualize_topic_hierarchy

# 2D scatter plot of documents colored by topic
fig = visualize_documents(itm, use_reduced=True)
fig.show()

# Topic hierarchy dendrogram
fig = visualize_topic_hierarchy(
    itm,
    similarity="harmonic",  # or "embedding", "tfidf"
    include_inactive=False,
)
fig.show()

API Reference

InteractiveTopicModel Methods

Fitting:

  • fit() - Fit initial topic model

Assignment Operations:

  • assign_doc(doc_id, topic_id, strength, validated) - Assign document to topic
  • validate_doc(doc_id) - Mark document as validated
  • suggest_assignment(doc_id, mode, threshold) - Suggest best topic

Topic Operations:

  • merge_topics(topic_ids, into, new_label, archive_sources, validate_refits) - Merge topics
  • archive_topic(topic_id) - Deactivate a topic
  • group_topics(topic_ids, label) - Create hierarchical grouping

Information Retrieval:

  • get_topic_info(top_words_n, show_inactive, show_outliers) - Summary DataFrame
  • get_representative_documents(topic_ids, n) - Representative docs per topic
  • get_examples(topics, n, include_descendants, random_state) - Sample documents
  • get_active_topics() - List of active topic IDs
  • get_outlier_count() - Count of outlier documents
  • get_validated_count() - Count of validated documents

Undo/Redo:

  • undo(steps) - Undo operations
  • redo(steps) - Redo operations
  • can_undo() - Check if undo is possible
  • can_redo() - Check if redo is possible
  • get_history(stack, reverse, include_timestamp) - View edit history

InteractiveTopic Methods

Document Access:

  • get_count(include_descendants) - Number of documents
  • get_doc_ids(include_descendants) - List of document IDs
  • get_texts(include_descendants) - List of document texts
  • get_examples(n, include_descendants, random_state) - Sample documents
  • get_representative_doc_ids() - IDs of representative documents
  • get_representative_docs() - Texts of representative documents

Representations:

  • get_embedding() - Centroid embedding
  • get_ctfidf() - c-TF-IDF vector
  • get_top_terms(n) - Top n terms with scores
  • get_auto_label(n) - Generate label from top terms

Operations:

  • split(clusterer, ...) - Preview topic split
  • commit_split() - Commit the split preview

Properties:

  • label - Topic label (get/set)
  • parent - Parent topic or ITM
  • active - Whether topic is active
  • semantic_space - Associated BasicTopicModel

To-do

  • Update vignette to demonstrate more features and navigation
  • Add serialization features
  • Add advanced semantic features (idea training, etc.)
  • Add assignment export/import features

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interactive_topic_model-0.1.0.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

interactive_topic_model-0.1.0-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file interactive_topic_model-0.1.0.tar.gz.

File metadata

  • Download URL: interactive_topic_model-0.1.0.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for interactive_topic_model-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d5385b8f884fdd2f13c3badc76f01f92753c0c55f192b6aaeb220e795dd5319b
MD5 4bc795366b145051b7d130c24ffdad97
BLAKE2b-256 90b02327df47ff18a6f9f39266ab3075c52dada52bfc4297f575f61c3b971b7b

See more details on using hashes here.

File details

Details for the file interactive_topic_model-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for interactive_topic_model-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bfe5fdeb76e4df3864b20c48bb2bbfd67345b8aaccbc0d7df01ca5aae8c650c6
MD5 006e0a7daf393739d0246a3591fbc37a
BLAKE2b-256 193a7d881d112c043d3742724bae336dff310aee2b6674ecbbfdb69d381e6248

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page