Biterm Topic Model with sklearn-compatible API
Project description
Biterm Topic Model
Bitermplus is a high-performance implementation of the Biterm Topic Model for short text analysis, originally developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Built on a cythonized version of BTM, it features OpenMP parallelization and a modern scikit-learn compatible API for seamless integration into ML workflows.
Key Features
- Scikit-learn Compatible API — Familiar
fit(),transform(), andfit_transform()methods for easy adoption - ML Pipeline Integration — Seamless compatibility with sklearn workflows, cross-validation, and grid search
- High-Performance Computing — Cythonized implementation with OpenMP parallel processing for speed
- Advanced Inference Methods — Multiple approaches including sum of biterms, sum of words, and mixed inference
- Comprehensive Model Evaluation — Built-in perplexity, semantic coherence, and entropy metrics
- Intuitive Topic Interpretation — Simple extraction of topic keywords and document-topic assignments
- Flexible Text Preprocessing — Customizable vectorization pipeline with sklearn CountVectorizer integration
Donate
If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.
Requirements
- Python ≥ 3.9
- NumPy ≥ 1.19.0 — Numerical computing foundation
- Pandas ≥ 1.2.0 — Data manipulation and analysis
- SciPy ≥ 1.6.0 — Scientific computing library
- scikit-learn ≥ 1.0.0 — Machine learning utilities and API compatibility
- tqdm ≥ 4.50.0 — Progress bars for model training
Installation
Standard Installation
Install the latest stable release from PyPI:
pip install bitermplus
Development Version
Install the latest development version directly from the repository:
pip install git+https://github.com/maximtrp/bitermplus.git
Platform-Specific Setup
Linux/Ubuntu: Ensure Python development headers are installed:
sudo apt-get install python3.x-dev # where x is your Python minor version
Windows: No additional setup required with standard Python installations.
macOS: Install OpenMP support for parallel processing:
# Install Xcode Command Line Tools and Homebrew (if not already installed)
xcode-select --install
# Install OpenMP library
brew install libomp
pip install bitermplus
If you encounter OpenMP compilation errors, configure the environment:
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
pip install bitermplus
Quick Start
Sklearn-style API (Recommended)
import bitermplus as btm
# Sample documents
texts = [
"machine learning algorithms are powerful",
"deep learning neural networks process data",
"natural language processing understands text"
]
# Create and train model
model = btm.BTMClassifier(n_topics=2, random_state=42)
doc_topics = model.fit_transform(texts)
# Get topic keywords
topic_words = model.get_topic_words(n_words=5)
print("Topic 0:", topic_words[0])
print("Topic 1:", topic_words[1])
# Evaluate model
coherence_score = model.score(texts)
print(f"Coherence: {coherence_score:.3f}")
Traditional API
import bitermplus as btm
import numpy as np
import pandas as pd
# Importing data
df = pd.read_csv(
'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()
# Preprocessing
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)
# Initializing and running model
model = btm.BTM(
X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)
# Metrics
coherence = model.coherence_
perplexity = model.perplexity_
Visualization
Visualize your topic modeling results with tmplot:
pip install tmplot
import tmplot as tmp
# Generate interactive topic visualization
tmp.report(model=model, docs=texts)
Documentation
Sklearn-style API Guide Complete guide to the modern sklearn-compatible interface with examples and best practices
Traditional API Tutorial In-depth tutorial covering advanced topic modeling techniques and model evaluation
API Reference Comprehensive documentation of all functions, classes, and parameters
Migration from v0.7.0 to v0.8.0
The traditional API remains fully compatible. The new sklearn-style API provides a simpler alternative:
Old approach (still works)
# Multi-step manual process
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)
model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=100)
p_zd = model.transform(docs_vec)
New approach (recommended)
# One-liner with automatic preprocessing
model = btm.BTMClassifier(n_topics=8, random_state=42, max_iter=100)
p_zd = model.fit_transform(texts)
Migration Benefits
- Streamlined Workflow — Direct text input with automatic preprocessing eliminates manual steps
- Enhanced ML Integration — Native support for sklearn pipelines, cross-validation, and hyperparameter tuning
- Improved Developer Experience — Clear parameter validation with informative error messages
- Advanced Model Evaluation — Built-in scoring methods and intuitive topic interpretation tools
- Backward Compatibility: All existing code using the traditional API will continue to work without modifications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file bitermplus-0.9.1.tar.gz.
File metadata
- Download URL: bitermplus-0.9.1.tar.gz
- Upload date:
- Size: 391.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49c421d71e2212efdaf12a78ac378362720d88e6516ea61a8910d11e2f35486b
|
|
| MD5 |
37257b2f2013509789a8c787becb0c8f
|
|
| BLAKE2b-256 |
3d4bab35910c6838239fe4f7e15db51ff16fca768d8c589bca442a5215cb897a
|