Skip to main content

Tomoto, The Topic Modeling Tool for Python

Project description

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

  • Latent Dirichlet Allocation (tomotopy.LDAModel),

  • Labeled LDA (tomotopy.LLDAModel),

  • Supervised LDA (tomotopy.SLDAModel),

  • Dirichlet Multinomial Regression (tomotopy.DMRModel),

  • Hierarchical Dirichlet Process (tomotopy.HDPModel),

  • Multi Grain LDA (tomotopy.MGLDAModel),

  • Pachinko Allocation (tomotopy.PAModel),

  • Hierarchical PA (tomotopy.HPAModel),

  • Correlated Topic Model (tomotopy.CTModel).

The most recent version of tomotopy is 0.3.0.

https://badge.fury.io/py/tomotopy.svg

Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)

$ pip install tomotopy

For Linux, it is neccesary to have gcc 5 or more for compiling C++14 codes. After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

↑ Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

↑ Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

https://bab2min.github.io/tomotopy/images/tmt_r7_3700x.png

↑ Performance in AMD Ryzen7 3700X, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words) # doc_inst is an instance of the unseen document

Inference for Unseen Documents

If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.

Examples

You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .

You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

  • 0.3.0 (2019-10-06)
    • A new model, tomotopy.LLDAModel was added into the package.

    • A crashing issue of HDPModel was fixed.

    • Since hyperparameter estimation for HDPModel was implemented, the result of HDPModel may differ from previous versions.

      If you want to turn off hyperparameter estimation of HDPModel, set optim_interval to zero.

  • 0.2.0 (2019-08-18)
    • New models including tomotopy.CTModel and tomotopy.SLDAModel were added into the package.

    • A new parameter option rm_top was added for all topic models.

    • The problems in save and load method for PAModel and HPAModel were fixed.

    • An occassional crash in loading HDPModel was fixed.

    • The problem that ll_per_word was calculated incorrectly when min_cf > 0 was fixed.

  • 0.1.6 (2019-08-09)
    • Compiling errors at clang with macOS environment were fixed.

  • 0.1.4 (2019-08-05)
    • The issue when add_doc receives an empty list as input was fixed.

    • The issue that tomotopy.PAModel.get_topic_words doesn’t extract the word distribution of subtopic was fixed.

  • 0.1.3 (2019-05-19)
    • The parameter min_cf and its stopword-removing function were added for all topic models.

  • 0.1.0 (2019-05-12)
    • First version of tomotopy

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomotopy-0.3.0.tar.gz (953.2 kB view details)

Uploaded Source

Built Distributions

tomotopy-0.3.0-cp37-cp37m-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.7mWindows x86-64

tomotopy-0.3.0-cp37-cp37m-win32.whl (1.5 MB view details)

Uploaded CPython 3.7mWindows x86

tomotopy-0.3.0-cp37-cp37m-manylinux1_x86_64.whl (49.0 MB view details)

Uploaded CPython 3.7m

tomotopy-0.3.0-cp36-cp36m-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.6mWindows x86-64

tomotopy-0.3.0-cp36-cp36m-win32.whl (1.5 MB view details)

Uploaded CPython 3.6mWindows x86

tomotopy-0.3.0-cp36-cp36m-manylinux1_x86_64.whl (49.0 MB view details)

Uploaded CPython 3.6m

tomotopy-0.3.0-cp35-cp35m-win_amd64.whl (2.6 MB view details)

Uploaded CPython 3.5mWindows x86-64

tomotopy-0.3.0-cp35-cp35m-win32.whl (1.5 MB view details)

Uploaded CPython 3.5mWindows x86

tomotopy-0.3.0-cp35-cp35m-manylinux1_x86_64.whl (49.0 MB view details)

Uploaded CPython 3.5m

tomotopy-0.3.0-cp34-cp34m-manylinux1_x86_64.whl (49.0 MB view details)

Uploaded CPython 3.4m

File details

Details for the file tomotopy-0.3.0.tar.gz.

File metadata

  • Download URL: tomotopy-0.3.0.tar.gz
  • Upload date:
  • Size: 953.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bf6d5fd9eb4fd8a3cc8bd146afe0d3e541b9cb476bdf93db671dff5a2a60e46d
MD5 f208ba75fb41bf0eb17deb2b1dd24fd1
BLAKE2b-256 3ca6ff94f78f323b536caa834968efcbd520666d4b0ecb15751d068761b76a16

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.3.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 24fed08eb2138593ef807cdc2e73a8650f81942d77a5de6016caf23a202d5830
MD5 07236ed49cc3a301abaf205d2d1f5517
BLAKE2b-256 64abd8bdd289501ed609a23ac3d449c4ee1e982be8a6c851a4198692caf8d852

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.3.0-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 4e167aa1e10e7496b3b4c7949119f91476124cbca363d7d7f99015f7d1d388f4
MD5 8d9258469471c37006bbf8e5275b23a0
BLAKE2b-256 8c41997395fbda67ba871f50a8f371824aa238362b050acdb796f142fe0164ec

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 49.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for tomotopy-0.3.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0341b88f03954e56aecd06f4faae8635dd65fb40549430a221601e4e68a6fd37
MD5 8d8168b988e6c2e40611d55613e4e6d5
BLAKE2b-256 0ee457c13c898078ccd9654aba8512c423d85f5b680f9a96826a916ca68f29bd

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.3.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 ecfd3fc13f04ac8ba3c5411add5bf2cbe4df01f810b7f6540c26e95b55b13b6e
MD5 1dcce3994c3547017e48dadb633683d4
BLAKE2b-256 b31a81b3de5718306833736801a3a638eddb4da61b1f59c9e69916cbfb73417b

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.3.0-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 e65474120078dfbdfe8c400d7a47b4c3fba44105a1ad926e6bd841f217219081
MD5 02996edc9023c2c425960891eacf5140
BLAKE2b-256 8edafe762075a791ec1e30763c4470df0e1ed6a8fa0dba6be0f6543e501d3b17

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 49.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.9

File hashes

Hashes for tomotopy-0.3.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2a43c43a167fdafec489ab4c6c809c017b343f006def0ac2017267b026ede88e
MD5 c5b992cb63fb4bfb25a23efcb2d36d7d
BLAKE2b-256 213d8b8b62af3cbde1efe45eca94bab8cf89187a68a589c8bd2c9ba49a4762db

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.3.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 0b2b00e0f51ea7b39c64c861347262a160577dac3aeb412cc914fac5e321b1e0
MD5 69bca89f0622ab66bb4bb0536795371c
BLAKE2b-256 4d14a13f1827bed846e03a8a76f25301813399e98dc5bb1d8f087c0527aefef7

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.3.0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 f225b10f71996ba75cdf940a6b03cef608481574f720ead478060d6b7e8e821a
MD5 6dedb47ab2abb3c386b444cf6b86a987
BLAKE2b-256 4880e4f42a64efc71c98614e9f7f29add523710cff081f8d1ed6c480dc4fe15b

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 49.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.5.7

File hashes

Hashes for tomotopy-0.3.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e338a68539c9c67c271abe924a32ac4b0dd7c8e8be0d95df9aa5383a51bb0b7c
MD5 81fa22b9e7406dd2070ae869db0a0d6f
BLAKE2b-256 c372da0401b93ba378eaa70e30ae4324fa18259a03aaab62c08cf97aa4c5d480

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.0-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.0-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 49.0 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.4.10

File hashes

Hashes for tomotopy-0.3.0-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cabd09f893718322c0d342c4d07c9e35264323cb98f5d7d152c8455cd26d3acf
MD5 40b5bd6a6cdf6afda1d239de7e896fc7
BLAKE2b-256 6dd4258957a4e313548a94815233282d38c9631f8afe40770ed4de67a061e324

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page