Skip to main content

Biterm Topic Model

Project description

Biterm Topic Model

This is a simple Python implementation of the awesome Biterm Topic Model. This model is accurate in short text classification. It explicitly models the word co-occurrence patterns in the whole corpus to solve the problem of sparse word co-occurrence at document-level.

Simply install by:

pip install biterm

Load some short texts and vectorize them via sklearn.

    from sklearn.feature_extraction.text import CountVectorizer

    texts = open('./data/reuters.titles').read().splitlines()[:50]
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

Get the vocabulary and the biterms from the texts.

    from biterm.utility import vec_to_biterms

    vocab = np.array(vec.get_feature_names())
    biterms = vec_to_biterms(X)

Create a BTM and pass the biterms to train it.

    from biterm.cbtm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)

Save a topic plot using pyLDAvis and explore the results! (also see simple_btml.py)

    from biterm.btm import oBTM

    btm = oBTM(num_topics=20, V=vocab)
    topics = btm.fit_transform(biterms, iterations=100)

pyLDAvis Visualization

Inference is done with Gibbs Sampling and it's not really fast. The implementation is not meant for production. But if you have to classify a lot of texts you can try using online learning. Use the Cython version to speed up performance a bit.

import numpy as np
import pyLDAvis
from biterm.cbtm import oBTM 
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms, topic_summuary # helper functions

if __name__ == "__main__":

    texts = open('./data/reuters.titles').read().splitlines()

    # vectorize texts
    vec = CountVectorizer(stop_words='english')
    X = vec.fit_transform(texts).toarray()

    # get vocabulary
    vocab = np.array(vec.get_feature_names())

    # get biterms
    biterms = vec_to_biterms(X)

    # create btm
    btm = oBTM(num_topics=20, V=vocab)

    print("\n\n Train Online BTM ..")
    for i in range(0, len(biterms), 100): # prozess chunk of 200 texts
        biterms_chunk = biterms[i:i + 100]
        btm.fit(biterms_chunk, iterations=50)
    topics = btm.transform(biterms)

    print("\n\n Visualize Topics ..")
    vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(X, axis=1), vocab, np.sum(X, axis=0))
    pyLDAvis.save_html(vis, './vis/online_btm.html')

    print("\n\n Topic coherence ..")
    topic_summuary(btm.phi_wz.T, X, vocab, 10)

    print("\n\n Texts & Topics ..")
    for i in range(len(texts)):
        print("{} (topic: {})".format(texts[i], topics[i].argmax()))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biterm-0.1.5.tar.gz (79.7 kB view details)

Uploaded Source

Built Distribution

biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl (62.3 kB view details)

Uploaded CPython 3.6m macOS 10.7+ x86-64

File details

Details for the file biterm-0.1.5.tar.gz.

File metadata

  • Download URL: biterm-0.1.5.tar.gz
  • Upload date:
  • Size: 79.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.3 requests-toolbelt/0.9.1 tqdm/4.29.1 CPython/3.6.8

File hashes

Hashes for biterm-0.1.5.tar.gz
Algorithm Hash digest
SHA256 e46ed58b95e39247d0c56f3339e15a4a14545f2c1957d95c565f0ed3e3d20384
MD5 f9763474ec9de44636d4c1cc87daf172
BLAKE2b-256 36ca5a43511e6ea8ca02cc9e8be1b8898ad79b140c055d4400342dc210ba23bb

See more details on using hashes here.

File details

Details for the file biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl.

File metadata

  • Download URL: biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl
  • Upload date:
  • Size: 62.3 kB
  • Tags: CPython 3.6m, macOS 10.7+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.3 requests-toolbelt/0.9.1 tqdm/4.29.1 CPython/3.6.8

File hashes

Hashes for biterm-0.1.5-cp36-cp36m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 bb3b28fdbb31365ee27209c4bc132d6ae9802c2f44648a977f101014e709fa66
MD5 ab819c4d671c710e6d900ae3d0195b33
BLAKE2b-256 194b8267896db7dc084d2f077253f67aab1e96bafbf3768017f183ce2d216cc1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page