Biterm Topic Model
Project description
Biterm Topic Model
This is a simple Python implementation of the awesome Biterm Topic Model. This model is accurate in short text classification. It explicitly models the word co-occurrence patterns in the whole corpus to solve the problem of sparse word co-occurrence at document-level.
Simply install by:
pip install biterm
Load some short texts and vectorize them via sklearn.
from sklearn.feature_extraction.text import CountVectorizer texts = open('./data/reuters.titles').read().splitlines()[:50] vec = CountVectorizer(stop_words='english') X = vec.fit_transform(texts).toarray()
Get the vocabulary and the biterms from the texts.
from biterm.utility import vec_to_biterms vocab = np.array(vec.get_feature_names()) biterms = vec_to_biterms(X)
Create a BTM and pass the biterms to train it.
from biterm.cbtm import oBTM btm = oBTM(num_topics=20, V=vocab) topics = btm.fit_transform(biterms, iterations=100)
Save a topic plot using pyLDAvis and explore the results! (also see simple_btml.py)
from biterm.btm import oBTM btm = oBTM(num_topics=20, V=vocab) topics = btm.fit_transform(biterms, iterations=100)
Inference is done with Gibbs Sampling and it's not really fast. The implementation is not meant for production. But if you have to classify a lot of texts you can try using online learning. Use the Cython version to speed up performance a bit.
import numpy as np import pyLDAvis from biterm.cbtm import oBTM from sklearn.feature_extraction.text import CountVectorizer from biterm.utility import vec_to_biterms, topic_summuary # helper functions if __name__ == "__main__": texts = open('./data/reuters.titles').read().splitlines() # vectorize texts vec = CountVectorizer(stop_words='english') X = vec.fit_transform(texts).toarray() # get vocabulary vocab = np.array(vec.get_feature_names()) # get biterms biterms = vec_to_biterms(X) # create btm btm = oBTM(num_topics=20, V=vocab) print("\n\n Train Online BTM ..") for i in range(0, len(biterms), 100): # prozess chunk of 200 texts biterms_chunk = biterms[i:i + 100] btm.fit(biterms_chunk, iterations=50) topics = btm.transform(biterms) print("\n\n Visualize Topics ..") vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0)) pyLDAvis.save_html(vis, './vis/online_btm.html') print("\n\n Topic coherence ..") topic_summuary(btm.phi_wz.T, X, vocab, 10) print("\n\n Texts & Topics ..") for i in range(len(texts)): print("{} (topic: {})".format(texts[i], topics[i].argmax()))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl (62.3 kB) | File type Wheel | Python version cp36 | Upload date | Hashes View |
Filename, size biterm-0.1.5.tar.gz (79.7 kB) | File type Source | Python version None | Upload date | Hashes View |
Hashes for biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb3b28fdbb31365ee27209c4bc132d6ae9802c2f44648a977f101014e709fa66 |
|
MD5 | ab819c4d671c710e6d900ae3d0195b33 |
|
BLAKE2-256 | 194b8267896db7dc084d2f077253f67aab1e96bafbf3768017f183ce2d216cc1 |