Skip to main content

Biterm Topic Model with sklearn-compatible API

Project description

Biterm Topic Model

GitHub Actions Workflow Status Documentation Status Codacy grade Issues Downloads PyPI

Bitermplus is a high-performance implementation of the Biterm Topic Model for short text analysis, originally developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Built on a cythonized version of BTM, it features OpenMP parallelization and a modern scikit-learn compatible API for seamless integration into ML workflows.

Key Features

  • Scikit-learn Compatible API — Familiar fit(), transform(), and fit_transform() methods for easy adoption
  • ML Pipeline Integration — Seamless compatibility with sklearn workflows, cross-validation, and grid search
  • High-Performance Computing — Cythonized implementation with OpenMP parallel processing for speed
  • Advanced Inference Methods — Multiple approaches including sum of biterms, sum of words, and mixed inference
  • Comprehensive Model Evaluation — Built-in perplexity, semantic coherence, and entropy metrics
  • Intuitive Topic Interpretation — Simple extraction of topic keywords and document-topic assignments
  • Flexible Text Preprocessing — Customizable vectorization pipeline with sklearn CountVectorizer integration

Donate

If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.

Buy Me A Coffee

Requirements

  • Python ≥ 3.9
  • NumPy ≥ 1.19.0 — Numerical computing foundation
  • Pandas ≥ 1.2.0 — Data manipulation and analysis
  • SciPy ≥ 1.6.0 — Scientific computing library
  • scikit-learn ≥ 1.0.0 — Machine learning utilities and API compatibility
  • tqdm ≥ 4.50.0 — Progress bars for model training

Installation

Standard Installation

Install the latest stable release from PyPI:

pip install bitermplus

Development Version

Install the latest development version directly from the repository:

pip install git+https://github.com/maximtrp/bitermplus.git

Platform-Specific Setup

Linux/Ubuntu: Ensure Python development headers are installed:

sudo apt-get install python3.x-dev  # where x is your Python minor version

Windows: No additional setup required with standard Python installations.

macOS: Install OpenMP support for parallel processing:

# Install Xcode Command Line Tools and Homebrew (if not already installed)
xcode-select --install
# Install OpenMP library
brew install libomp
pip install bitermplus

If you encounter OpenMP compilation errors, configure the environment:

export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
pip install bitermplus

Quick Start

Sklearn-style API (Recommended)

import bitermplus as btm

# Sample documents
texts = [
    "machine learning algorithms are powerful",
    "deep learning neural networks process data",
    "natural language processing understands text"
]

# Create and train model
model = btm.BTMClassifier(n_topics=2, random_state=42)
doc_topics = model.fit_transform(texts)

# Get topic keywords
topic_words = model.get_topic_words(n_words=5)
print("Topic 0:", topic_words[0])
print("Topic 1:", topic_words[1])

# Evaluate model
coherence_score = model.score(texts)
print(f"Coherence: {coherence_score:.3f}")

Traditional API

import bitermplus as btm
import numpy as np
import pandas as pd

# Importing data
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# Preprocessing
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

# Initializing and running model
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# Metrics
coherence = model.coherence_
perplexity = model.perplexity_

Visualization

Visualize your topic modeling results with tmplot:

pip install tmplot
import tmplot as tmp

# Generate interactive topic visualization
tmp.report(model=model, docs=texts)

Topic Modeling Visualization

Documentation

Sklearn-style API Guide Complete guide to the modern sklearn-compatible interface with examples and best practices

Traditional API Tutorial In-depth tutorial covering advanced topic modeling techniques and model evaluation

API Reference Comprehensive documentation of all functions, classes, and parameters

Migration from v0.7.0 to v0.8.0

The traditional API remains fully compatible. The new sklearn-style API provides a simpler alternative:

Old approach (still works)

# Multi-step manual process
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=100)
p_zd = model.transform(docs_vec)

New approach (recommended)

# One-liner with automatic preprocessing
model = btm.BTMClassifier(n_topics=8, random_state=42, max_iter=100)
p_zd = model.fit_transform(texts)

Migration Benefits

  • Streamlined Workflow — Direct text input with automatic preprocessing eliminates manual steps
  • Enhanced ML Integration — Native support for sklearn pipelines, cross-validation, and hyperparameter tuning
  • Improved Developer Experience — Clear parameter validation with informative error messages
  • Advanced Model Evaluation — Built-in scoring methods and intuitive topic interpretation tools
  • Backward Compatibility: All existing code using the traditional API will continue to work without modifications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bitermplus-0.9.1.tar.gz (391.9 kB view details)

Uploaded Source

File details

Details for the file bitermplus-0.9.1.tar.gz.

File metadata

  • Download URL: bitermplus-0.9.1.tar.gz
  • Upload date:
  • Size: 391.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for bitermplus-0.9.1.tar.gz
Algorithm Hash digest
SHA256 49c421d71e2212efdaf12a78ac378362720d88e6516ea61a8910d11e2f35486b
MD5 37257b2f2013509789a8c787becb0c8f
BLAKE2b-256 3d4bab35910c6838239fe4f7e15db51ff16fca768d8c589bca442a5215cb897a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page