Skip to main content

Tomoto, The Topic Modeling Tool for Python

Project description

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

  • Latent Dirichlet Allocation (tomotopy.LDAModel),

  • Labeled LDA (tomotopy.LLDAModel),

  • Supervised LDA (tomotopy.SLDAModel),

  • Dirichlet Multinomial Regression (tomotopy.DMRModel),

  • Hierarchical Dirichlet Process (tomotopy.HDPModel),

  • Multi Grain LDA (tomotopy.MGLDAModel),

  • Pachinko Allocation (tomotopy.PAModel),

  • Hierarchical PA (tomotopy.HPAModel),

  • Correlated Topic Model (tomotopy.CTModel).

The most recent version of tomotopy is 0.3.0.

https://badge.fury.io/py/tomotopy.svg

Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)

$ pip install tomotopy

For Linux, it is neccesary to have gcc 5 or more for compiling C++14 codes. After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

↑ Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

↑ Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

https://bab2min.github.io/tomotopy/images/tmt_r7_3700x.png

↑ Performance in AMD Ryzen7 3700X, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words) # doc_inst is an instance of the unseen document

Inference for Unseen Documents

If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_words)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.

Examples

You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .

You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

  • 0.3.1 (2019-11-05)
    • An issue where get_topic_dist() returns incorrect value when min_cf or rm_top is set was fixed.

    • The return value of get_topic_dist() of tomotopy.MGLDAModel document was fixed to include local topics.

    • The estimation speed with tw=ONE was improved.

  • 0.3.0 (2019-10-06)
    • A new model, tomotopy.LLDAModel was added into the package.

    • A crashing issue of HDPModel was fixed.

    • Since hyperparameter estimation for HDPModel was implemented, the result of HDPModel may differ from previous versions.

      If you want to turn off hyperparameter estimation of HDPModel, set optim_interval to zero.

  • 0.2.0 (2019-08-18)
    • New models including tomotopy.CTModel and tomotopy.SLDAModel were added into the package.

    • A new parameter option rm_top was added for all topic models.

    • The problems in save and load method for PAModel and HPAModel were fixed.

    • An occassional crash in loading HDPModel was fixed.

    • The problem that ll_per_word was calculated incorrectly when min_cf > 0 was fixed.

  • 0.1.6 (2019-08-09)
    • Compiling errors at clang with macOS environment were fixed.

  • 0.1.4 (2019-08-05)
    • The issue when add_doc receives an empty list as input was fixed.

    • The issue that tomotopy.PAModel.get_topic_words doesn’t extract the word distribution of subtopic was fixed.

  • 0.1.3 (2019-05-19)
    • The parameter min_cf and its stopword-removing function were added for all topic models.

  • 0.1.0 (2019-05-12)
    • First version of tomotopy

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomotopy-0.3.1.tar.gz (957.8 kB view details)

Uploaded Source

Built Distributions

tomotopy-0.3.1-cp38-cp38-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.8Windows x86-64

tomotopy-0.3.1-cp38-cp38-win32.whl (1.5 MB view details)

Uploaded CPython 3.8Windows x86

tomotopy-0.3.1-cp38-cp38-manylinux1_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.8

tomotopy-0.3.1-cp37-cp37m-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.7mWindows x86-64

tomotopy-0.3.1-cp37-cp37m-win32.whl (1.5 MB view details)

Uploaded CPython 3.7mWindows x86

tomotopy-0.3.1-cp37-cp37m-manylinux1_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.7m

tomotopy-0.3.1-cp36-cp36m-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.6mWindows x86-64

tomotopy-0.3.1-cp36-cp36m-win32.whl (1.5 MB view details)

Uploaded CPython 3.6mWindows x86

tomotopy-0.3.1-cp36-cp36m-manylinux1_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.6m

tomotopy-0.3.1-cp35-cp35m-win_amd64.whl (2.8 MB view details)

Uploaded CPython 3.5mWindows x86-64

tomotopy-0.3.1-cp35-cp35m-win32.whl (1.5 MB view details)

Uploaded CPython 3.5mWindows x86

tomotopy-0.3.1-cp35-cp35m-manylinux1_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.5m

tomotopy-0.3.1-cp34-cp34m-manylinux1_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.4m

File details

Details for the file tomotopy-0.3.1.tar.gz.

File metadata

  • Download URL: tomotopy-0.3.1.tar.gz
  • Upload date:
  • Size: 957.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.3.1.tar.gz
Algorithm Hash digest
SHA256 39d76bb86be1476f88f363025b6dd14a8c8ae688c70e84b838fa30b65d5cc325
MD5 9ae550283b375a7bf97c7270c7e1e230
BLAKE2b-256 415d546f8e7cbeaad46ddba3c4bcb4c9f94b54a4857f0dfb1ea8882d80f0d212

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.8

File hashes

Hashes for tomotopy-0.3.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 7085c5218c8054d5b78d94458fe93397184c92618d147caa8129f11feba0bafc
MD5 66db2de6ac45d2077acb9d658e90e795
BLAKE2b-256 0ae199ca9642f586f8873c8fd94a8aed0244f048ca539bcd67b0e1b5cbd59ce4

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp38-cp38-win32.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.8

File hashes

Hashes for tomotopy-0.3.1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 a9f1fd2be449bfe28b124d81d975f7ac51a07b55350448f37ff0be92ec36736b
MD5 8ae0c230c3b3234af2a1d75cf1d3d22e
BLAKE2b-256 d6c13eddea905d383d44de3e441dfb0072daac3811b363af3f0adebd4a6b6a74

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.4.10

File hashes

Hashes for tomotopy-0.3.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ce1625cab9c0df433267744be219c311c6d7d781bcc0f492683939ec9407dd16
MD5 89a22650ff941c52ab3e5e408a97657d
BLAKE2b-256 482655fc1af601344a7c34f5c2c88374b81451b91cfd4a240746ee10df46d95b

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.3.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 bcbc9a7bbc69af534202ad3a466b580438a6bc47d7361b4eb13c5b4afa6eeeeb
MD5 e0253f67485277a46e06389a9180cb30
BLAKE2b-256 146bc2575c0c22e9e8b5061e8f0f884a22060ac25a1202239173ae844bdda617

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.3.1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 0991690ff87ef6f998db7fa59954893ef7c071681443b21805e1f62c4a527be7
MD5 e9ccf616fc14c78bb2620a7cb5731d7d
BLAKE2b-256 fdf440fdfb0690ea2ac51a13662ace396f176094a99e21b47d44653d47cbb846

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4

File hashes

Hashes for tomotopy-0.3.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 38a27ce3f2a137034713aca346da61fc93b95513a1ed3db420d81f33457b7285
MD5 6f819603a704be68ed48736038689aef
BLAKE2b-256 1a6624d11daf8dc963d3813732dd4878a60a08ae71f3bddcc62a084313b7cddd

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.3.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5a600fb859c92c555adfd2fab67f382d213347c2876834ba17f79f2e65e06e70
MD5 692f4501304f192327bd465520bf3bb9
BLAKE2b-256 05d4b8aac5f05e8b8b47e9a557ce04b13ddb79f8101fbd4cc123deea816db0b9

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.3.1-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 3343b7d81528743eff9ee1f2870f2c0784305c33fb0f178b37405e58ca631f94
MD5 597d9b8f34904169f06d83287415f771
BLAKE2b-256 293d5981dae7e9408886325eebb62a1cc8681ec18b8126abab3a2dc21256ac5a

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.9

File hashes

Hashes for tomotopy-0.3.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 57ca178335e67f31565eae79efd84d1f5a2e8dc9dee01c86beb31e3735449176
MD5 f6e13600f0e3ca954f216fadb20c64ee
BLAKE2b-256 c9a5805a35c703f33617a743807901662492028824522f03499f7fd2046cc182

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 2.8 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.3.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 44561b447b62957d3661ef007ae18db3fc43979d2b736a4cb13e13a61b2e8e04
MD5 ee3cbc798e927b0af30dc36c64e06d1f
BLAKE2b-256 0683a4e6d152cb52c30abb928baa82cc12b2dafb1c4c49e9985d65c3676f57b8

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp35-cp35m-win32.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp35-cp35m-win32.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: CPython 3.5m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.3.1-cp35-cp35m-win32.whl
Algorithm Hash digest
SHA256 0a6543eb574cafd02e4fa24d5e78e4589df8174e03bccc77a70d2c7838724279
MD5 ce629bc74e3323a6f9fcc61d32cd56c4
BLAKE2b-256 3205d68c462d5535a5043aaa6e7207f0e5679faa15088cd9b7e9dccd706e9c04

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.5.7

File hashes

Hashes for tomotopy-0.3.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b6235543a22b566e08189d457de832e71f6f75d202723eda8092b83a8a9bcbb9
MD5 e2f6eade310036f000c2890284636127
BLAKE2b-256 866d70ac2f6a7458abe63b1ba5ed085b28f820e33b04b8460d9709e631329399

See more details on using hashes here.

File details

Details for the file tomotopy-0.3.1-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.3.1-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 6.0 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.4.10

File hashes

Hashes for tomotopy-0.3.1-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 54e9d19b8115f7479b22f51ee0981536ae94611a7a19032fa4768b4d43504201
MD5 527005290c8ff95b690ab19812698c23
BLAKE2b-256 f3abdbb67abed30607baa0ee2863aaac6dc14f5603ed467855b7cbce8611d23f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page