Skip to main content

Tomoto, The Topic Modeling Tool for Python

Project description

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

  • Latent Dirichlet Allocation (tomotopy.LDAModel),

  • Supervised Latent Dirichlet Allocation (tomotopy.SLDAModel),

  • Dirichlet Multinomial Regression (tomotopy.DMRModel),

  • Hierarchical Dirichlet Process (tomotopy.HDPModel),

  • Multi Grain LDA (tomotopy.MGLDAModel),

  • Pachinko Allocation (tomotopy.PAModel),

  • Hierarchical PA (tomotopy.HPAModel),

  • Correlated Topic Model (tomotopy.CTModel).

The most recent version of tomotopy is 0.2.0.

https://badge.fury.io/py/tomotopy.svg

Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)

$ pip install tomotopy

For Linux, it is neccesary to have gcc 5 or more for compiling C++14 codes. After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words) # doc_inst is an instance of the unseen document

Inference for Unseen Documents

If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

  • 0.2.0 (2019-08-18)
    • New models including tomotopy.CTModel and tomotopy.SLDAModel were added into the package.

    • A new parameter option rm_top was added for all topic models.

    • The problems in save and load method for PAModel and HPAModel were fixed.

    • An occassional crash in loading HDPModel was fixed.

    • The problem that ll_per_word was calculated incorrectly when min_cf > 0 was fixed.

  • 0.1.6 (2019-08-09)
    • Compiling errors at clang with macOS environment were fixed.

  • 0.1.4 (2019-08-05)
    • The issue when add_doc receives an empty list as input was fixed.

    • The issue that tomotopy.PAModel.get_topic_words doesn’t extract the word distribution of subtopic was fixed.

  • 0.1.3 (2019-05-19)
    • The parameter min_cf and its stopword-removing function were added for all topic models.

  • 0.1.0 (2019-05-12)
    • First version of tomotopy

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomotopy-0.2.0.tar.gz (940.6 kB view details)

Uploaded Source

Built Distributions

tomotopy-0.2.0-cp37-cp37m-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.7mWindows x86-64

tomotopy-0.2.0-cp37-cp37m-win32.whl (1.3 MB view details)

Uploaded CPython 3.7mWindows x86

tomotopy-0.2.0-cp37-cp37m-manylinux1_x86_64.whl (45.6 MB view details)

Uploaded CPython 3.7m

tomotopy-0.2.0-cp36-cp36m-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.6mWindows x86-64

tomotopy-0.2.0-cp36-cp36m-win32.whl (1.3 MB view details)

Uploaded CPython 3.6mWindows x86

tomotopy-0.2.0-cp36-cp36m-manylinux1_x86_64.whl (45.6 MB view details)

Uploaded CPython 3.6m

tomotopy-0.2.0-cp35-cp35m-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.5mWindows x86-64

tomotopy-0.2.0-cp35-cp35m-win32.whl (1.3 MB view details)

Uploaded CPython 3.5mWindows x86

tomotopy-0.2.0-cp35-cp35m-manylinux1_x86_64.whl (45.6 MB view details)

Uploaded CPython 3.5m

tomotopy-0.2.0-cp34-cp34m-manylinux1_x86_64.whl (45.6 MB view details)

Uploaded CPython 3.4m

File details

Details for the file tomotopy-0.2.0.tar.gz.

File metadata

  • Download URL: tomotopy-0.2.0.tar.gz
  • Upload date:
  • Size: 940.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3c9ba371d42d9f6260dd58a8e955ef74d4c73480f9a40253638039115a5e0ec6
MD5 48c5cbda13730e54df074ed2719c0aa3
BLAKE2b-256 d5f610e17b99e0d5cdc1784aa974ba77d348a071d933897c348cb50db43b0a73

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.2.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 cfb0a58e5e5360b894ef24b311a31d6f8105e6412cc2114dd9d098bb50aceef8
MD5 f0606d7b692d171f5eeb5467e44215eb
BLAKE2b-256 81b79f5c0cb38b141e1aa6e55be2b91147706647bb7e14964db7a650d6d7e6b2

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.2.0-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 d41672617fca8690cbb144bcde45de0448466083a8d5bc3f3949331003ad8dba
MD5 a6628f6723cdd2ed02517526057cd507
BLAKE2b-256 95e369fd8f57f27dc54cf36dddf8cf5e6716aa86fde011450ed0f32e81d73a7e

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 45.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for tomotopy-0.2.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fcb58d2d282b8c73ea38bac3705d235298e99df34d88e5d1d46622377a65f895
MD5 bc737b60d3fbf98cf760cb3982a9b14a
BLAKE2b-256 7bf8d52f7c772abfbfb9ca4de7deb8cf4d09e6659085e3e89795c7e4eff6625a

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.2.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 39b92b3a8cbddd47ae94de542014982aaccbd92c83c8ad61455d7ca0fb223bc4
MD5 046431bd479d63c4138a8a95d357b265
BLAKE2b-256 1da9ad7910967907bfc7bd1678063d89495364439cb304a4b3cd90c5ca197149

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.2.0-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 5b79136cbc3d2f642d6647aa850c0dcb5b18c908da76cf775e026054fb69d460
MD5 bf10711cd5792fbe2ed745250e2a594a
BLAKE2b-256 c992e47fa39ea45ddb072a25b68c746cd8efb6b2fbfa149c511baf82183df715

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 45.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.9

File hashes

Hashes for tomotopy-0.2.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1de76b14d01fb9ae7e9a7c41a5f4b1a30474d0a044f39a87225d10c00630928f
MD5 038f212fd962db02a9d586d880ded7b5
BLAKE2b-256 1adb9ead3690a8aa08e6d6b28ab4df0fefdd12f81002157398dae447ebe72275

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 2.3 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.2.0-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 d75e5908aa84af13403404857aa53639f76947846333c159e1bee12eaf85087e
MD5 3ecc900f36d8e5a84c795b9569010b38
BLAKE2b-256 da488b523ff53a10193881ef7b1de59b03b898ccd99418567c69352d1f2d552f

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.2.0-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 97abfb65de79ba1814a26c1bf1afc1490a15cb63e87726f2c46f253e402b01cc
MD5 f0ef059359d72b1b0b296c680e111ada
BLAKE2b-256 e9a11b846680b21a99d658487da09a8227bfc26ad3cef9a58e1491c5e2443b2d

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 45.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.5.7

File hashes

Hashes for tomotopy-0.2.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c6fd31aa42aaf1ae6a4cc708fc34a18b867c052a985d1bb838020dd561057b0c
MD5 b7fbe141e1afced7deb5149b61456df0
BLAKE2b-256 3816f726718e4f855c8ed4d8d33bb7df804658cd0c45bec938dbc7bee40b6b82

See more details on using hashes here.

File details

Details for the file tomotopy-0.2.0-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.2.0-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 45.6 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.4.10

File hashes

Hashes for tomotopy-0.2.0-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3e97bb0e750ca3f2615c40f2fb844d503ff47907651d9bdb06806ceb8cf81dce
MD5 bc695c95ebe1d56657c4cf7693acba6b
BLAKE2b-256 5f2fcdf90e035aefcb55b6bd5a7d446b38549cbcae8b114ddf9a21631dceeeeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page