Skip to main content

Tomoto, The Topic Modeling Tool for Python

Project description

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including Latent Dirichlet Allocation(tomotopy.LDAModel), Dirichlet Multinomial Regression(tomotopy.DMRModel), Hierarchical Dirichlet Process(tomotopy.HDPModel), Multi Grain LDA(tomotopy.MGLDAModel), Pachinko Allocation(tomotopy.PAModel) and Hierarchical PA(tomotopy.HPAModel).

Getting Started

You can install tomotopy easily using pip.

$ pip install tomotopy

For Linux, it is neccesary to have gcc 5 or more for compiling C++14 codes. After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(100):
    mdl.train()
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(100):
    mdl.train()
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDA.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

Inference for Unseen Documents

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomotopy-0.1.3.tar.gz (851.4 kB view details)

Uploaded Source

Built Distributions

tomotopy-0.1.3-cp37-cp37m-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.7mWindows x86-64

tomotopy-0.1.3-cp37-cp37m-win32.whl (898.0 kB view details)

Uploaded CPython 3.7mWindows x86

tomotopy-0.1.3-cp36-cp36m-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.6mWindows x86-64

tomotopy-0.1.3-cp36-cp36m-win32.whl (898.0 kB view details)

Uploaded CPython 3.6mWindows x86

tomotopy-0.1.3-cp35-cp35m-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.5mWindows x86-64

tomotopy-0.1.3-cp35-cp35m-win32.whl (898.0 kB view details)

Uploaded CPython 3.5mWindows x86

File details

Details for the file tomotopy-0.1.3.tar.gz.

File metadata

  • Download URL: tomotopy-0.1.3.tar.gz
  • Upload date:
  • Size: 851.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.3.tar.gz
Algorithm Hash digest
SHA256 79b223e8ba6cbf33167a369866319f9f220a3f938464328e0f6effaee49757ec
MD5 49cb612d10f53d06e951dbb91fcbcbf7
BLAKE2b-256 e308f4a0aa28a400b6a3305792447ec8420eaf942c803becf6f48ef4f830994d

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.1.3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.1.3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 db398595378913f1c2300e7ba75f04d96208d9f38678ac4ed22ff02610f7dd8a
MD5 f505400ca262ab55b15f62392d0aa602
BLAKE2b-256 4ae7ff46e562cef79120327fe1802bd12ba317c489342beb570e609f04409608

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tomotopy-0.1.3-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 898.0 kB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.1.3-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 8a0b1855d168e2c33bb286ce95d1287411f7df8827a4629299cf938c676867fe
MD5 04c31993af7ffeb2d7880001142fadcb
BLAKE2b-256 e3c5e1f0486bed78c7735d8b7280f464eb6c05c1c33f1b3dd5233374d62c1b64

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.1.3-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.3-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 8487b71ef524f73798d20edf4ad064b0d66d08a6b132f08ba567238685df3865
MD5 235b71140258be73ce8daa357b0e3ac4
BLAKE2b-256 21a1d3b52ed9b0e07f7e622b15e5f910395fcb0e8b3e1644ff6abb475d327389

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tomotopy-0.1.3-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 898.0 kB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.3-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 1ea86b97b9e3da22d8acaa889c3d54781ad1527a76d5d383e7133f717a30546d
MD5 10375a4148da139503504f3b2159badf
BLAKE2b-256 f1d1d232f1d8d24951c0ef5647e3ba265676aa4b8d245bbd2d74e7c0b8cd5153

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.1.3-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.1.3-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 cd20c076787d63b52a94e737ed0861f3c474364617e53281285a02863a66f6a0
MD5 8b4c4300f1f8035a7530d3c11d676d69
BLAKE2b-256 349e824459ec2163f10b80195007e1858d8341ddc5f13339461e2d010a67513b

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tomotopy-0.1.3-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 898.0 kB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.1.3-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 ec3187ccc65ecf66b0ab11a372821b40b655ae531836922ec88b1065d4ce924b
MD5 0d7844dbaa56c790e58c43ad91da9ab1
BLAKE2b-256 1b62f97b5d6878c8afcb0a21e5565e63ddf961430ea5b2d927e3201be2667e0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page