Biterm Topic Model
Project description
Biterm Topic Model
This is a simple Python implementation of the awesome Biterm Topic Model. This model is accurate in short text classification. It explicitly models the word co-occurrence patterns in the whole corpus to solve the problem of sparse word co-occurrence at document-level.
Simply install by:
pip install biterm
Load some short texts and vectorize them via sklearn.
from sklearn.feature_extraction.text import CountVectorizer
texts = open('./data/reuters.titles').read().splitlines()[:50]
vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(texts).toarray()
Get the vocabulary and the biterms from the texts.
from biterm.utility import vec_to_biterms
vocab = np.array(vec.get_feature_names())
biterms = vec_to_biterms(X)
Create a BTM and pass the biterms to train it.
from biterm.cbtm import oBTM
btm = oBTM(num_topics=20, V=vocab)
topics = btm.fit_transform(biterms, iterations=100)
Save a topic plot using pyLDAvis and explore the results! (also see simple_btml.py)
from biterm.btm import oBTM
btm = oBTM(num_topics=20, V=vocab)
topics = btm.fit_transform(biterms, iterations=100)
Inference is done with Gibbs Sampling and it's not really fast. The implementation is not meant for production. But if you have to classify a lot of texts you can try using online learning. Use the Cython version to speed up performance a bit.
import numpy as np
import pyLDAvis
from biterm.cbtm import oBTM
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary # helper functions
if __name__ == "__main__":
texts = open('./data/reuters.titles').read().splitlines()
# vectorize texts
vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(texts).toarray()
# get vocabulary
vocab = np.array(vec.get_feature_names())
# get biterms
biterms = vec_to_biterms(X)
# create btm
btm = oBTM(num_topics=20, V=vocab)
print("\n\n Train Online BTM ..")
for i in range(0, len(biterms), 100): # prozess chunk of 200 texts
biterms_chunk = biterms[i:i + 100]
btm.fit(biterms_chunk, iterations=50)
topics = btm.transform(biterms)
print("\n\n Visualize Topics ..")
vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0))
pyLDAvis.save_html(vis, './vis/online_btm.html')
print("\n\n Topic coherence ..")
topic_summuary(btm.phi_wz.T, X, vocab, 10)
print("\n\n Texts & Topics ..")
for i in range(len(texts)):
print("{} (topic: {})".format(texts[i], topics[i].argmax()))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file biterm-0.1.5.tar.gz
.
File metadata
- Download URL: biterm-0.1.5.tar.gz
- Upload date:
- Size: 79.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.3 requests-toolbelt/0.9.1 tqdm/4.29.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e46ed58b95e39247d0c56f3339e15a4a14545f2c1957d95c565f0ed3e3d20384 |
|
MD5 | f9763474ec9de44636d4c1cc87daf172 |
|
BLAKE2b-256 | 36ca5a43511e6ea8ca02cc9e8be1b8898ad79b140c055d4400342dc210ba23bb |
File details
Details for the file biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl
.
File metadata
- Download URL: biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl
- Upload date:
- Size: 62.3 kB
- Tags: CPython 3.6m, macOS 10.7+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.3 requests-toolbelt/0.9.1 tqdm/4.29.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb3b28fdbb31365ee27209c4bc132d6ae9802c2f44648a977f101014e709fa66 |
|
MD5 | ab819c4d671c710e6d900ae3d0195b33 |
|
BLAKE2b-256 | 194b8267896db7dc084d2f077253f67aab1e96bafbf3768017f183ce2d216cc1 |