Skip to main content

Tomoto, The Topic Modeling Tool for Python

Project description

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

  • Latent Dirichlet Allocation (tomotopy.LDAModel),

  • Dirichlet Multinomial Regression (tomotopy.DMRModel),

  • Hierarchical Dirichlet Process (tomotopy.HDPModel),

  • Multi Grain LDA (tomotopy.MGLDAModel),

  • Pachinko Allocation (tomotopy.PAModel),

  • Hierarchical PA (tomotopy.HPAModel).

Getting Started

You can install tomotopy easily using pip.

$ pip install tomotopy

For Linux, it is neccesary to have gcc 5 or more for compiling C++14 codes. After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words) # doc_inst is an instance of the unseen document

Inference for Unseen Documents

If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomotopy-0.1.4.tar.gz (861.4 kB view details)

Uploaded Source

Built Distributions

tomotopy-0.1.4-cp37-cp37m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.7mWindows x86-64

tomotopy-0.1.4-cp37-cp37m-win32.whl (968.5 kB view details)

Uploaded CPython 3.7mWindows x86

tomotopy-0.1.4-cp36-cp36m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.6mWindows x86-64

tomotopy-0.1.4-cp36-cp36m-win32.whl (968.5 kB view details)

Uploaded CPython 3.6mWindows x86

tomotopy-0.1.4-cp35-cp35m-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.5mWindows x86-64

tomotopy-0.1.4-cp35-cp35m-win32.whl (968.5 kB view details)

Uploaded CPython 3.5mWindows x86

File details

Details for the file tomotopy-0.1.4.tar.gz.

File metadata

  • Download URL: tomotopy-0.1.4.tar.gz
  • Upload date:
  • Size: 861.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.1.4.tar.gz
Algorithm Hash digest
SHA256 193ff3359fbde3a17325f86665fa4e8b0848624fc577cee2df6efb8c75b12ed8
MD5 9fcef6ba505ff517ef9b9fd239f8ec0b
BLAKE2b-256 8cef6418666ac27c1a9de9ccfa4b759524239807eecaba247cd60336a3fcbf77

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.4-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.1.4-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.1.4-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 6a1829e25cb6b801f05517a5d50f2f07ca56d477abb328178340d7211b9e458c
MD5 83372e8af9d9c67ea4d9258b6a4876e2
BLAKE2b-256 9c2bd89de501b82d9e07a02839ba95890033a9b61407ec921e29c0b87c5f1ce1

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.4-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tomotopy-0.1.4-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 968.5 kB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.1.4-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 190ca5bc0b5ccd3a9a185fa962dff5574c3f03bfc16a62f57a721b3916aafa49
MD5 12036cae925bd5f671843bd8f3f4d70e
BLAKE2b-256 d202d31387cb18f2088ac71270617a0b8abc15542c53468bcc7aa2cf1b4f0446

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.4-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.1.4-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.4-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 9e3b8d9f7e9c295a8f115112e1e4f5192624ce7cda84d870b7bbe923fe5d2f51
MD5 dcc93a66e7af03d904ca11a812f1436a
BLAKE2b-256 70cc05adf075a11caa723d6542fe29a72168b0b3fb3de5f5d5f6e27cdd2f9e83

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.4-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tomotopy-0.1.4-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 968.5 kB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.4-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 6d4213f3903aa998d9903253f49e57eead6c849bffe275e44954b88a6d1c8c07
MD5 dbff59f92355299112fa7aeea3f2f5da
BLAKE2b-256 f475f50bfed6353335ed8b11621a67e5a8d4882430c3497278a161159f8b03d3

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.4-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.1.4-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.1.4-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 fa3e9bdc4d83051c5aa7fdccda7fa11a7bbda6070465b6740bde145ba23b3b61
MD5 1b9c3577be33ef170ccb9241883ab2a7
BLAKE2b-256 f6feaf5d2a244735b94bc5eb041a6977a826ffbc7d51d975ba7110982d7adbae

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.4-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tomotopy-0.1.4-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 968.5 kB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.1.4-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 4a2a4d00854b40a10bdb08eda739b6caa13a5d40fe91b067f317c4b01c2cfe22
MD5 7a4425b4f1fd0b3743ed54b7a5ff8361
BLAKE2b-256 f9cf4eab6dfcfe782fcb29778818892d886e74b84518f226963237760b7308b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page