Tomoto, The Topic Modeling Tool for Python
Project description
What is tomotopy?
tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including
Latent Dirichlet Allocation (tomotopy.LDAModel),
Labeled LDA (tomotopy.LLDAModel),
Supervised LDA (tomotopy.SLDAModel),
Dirichlet Multinomial Regression (tomotopy.DMRModel),
Hierarchical Dirichlet Process (tomotopy.HDPModel),
Multi Grain LDA (tomotopy.MGLDAModel),
Pachinko Allocation (tomotopy.PAModel),
Hierarchical PA (tomotopy.HPAModel),
Correlated Topic Model (tomotopy.CTModel).
The most recent version of tomotopy is 0.3.0.
Getting Started
You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)
$ pip install tomotopy
For Linux, it is neccesary to have gcc 5 or more for compiling C++14 codes. After installing, you can start tomotopy by just importing.
import tomotopy as tp print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'
Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.
Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.
import tomotopy as tp mdl = tp.LDAModel(k=20) for line in open('sample.txt'): mdl.add_doc(line.strip().split()) for i in range(0, 100, 10): mdl.train(10) print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word)) for k in range(mdl.k): print('Top 10 words of topic #{}'.format(k)) print(mdl.get_topic_words(k, top_n=10))
Performance of tomotopy
tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.
[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html
Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.
↑ Performance in Intel i5-6600, x86-64 (4 cores)
↑ Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)
↑ Performance in AMD Ryzen7 3700X, x86-64 (8 cores, 16 threads)
Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.
It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.
The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.
Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.
Model Save and Load
tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.
import tomotopy as tp mdl = tp.HDPModel() for line in open('sample.txt'): mdl.add_doc(line.strip().split()) for i in range(0, 100, 10): mdl.train(10) print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word)) # save into file mdl.save('sample_hdp_model.bin') # load from file mdl = tp.HDPModel.load('sample_hdp_model.bin') for k in range(mdl.k): if not mdl.is_live_topic(k): continue print('Top 10 words of topic #{}'.format(k)) print(mdl.get_topic_words(k, top_n=10)) # the saved model is HDP model, # so when you load it by LDA model, it will raise an exception mdl = tp.LDAModel.load('sample_hdp_model.bin')
When you load the model from a file, a model type in the file should match the class of methods.
See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.
Documents in the Model and out of the Model
We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.
We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.
In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.
To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:
mdl = tp.LDAModel(k=20) idx = mdl.add_doc(words) if idx < 0: raise RuntimeError("Failed to add doc") doc_inst = mdl.docs[idx] # doc_inst is an instance of the added document
A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.
mdl = tp.LDAModel(k=20) # add_doc ... mdl.train(100) doc_inst = mdl.make_doc(unseen_words) # doc_inst is an instance of the unseen document
Inference for Unseen Documents
If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.
mdl = tp.LDAModel(k=20) # add_doc ... mdl.train(100) doc_inst = mdl.make_doc(unseen_words) topic_dist, ll = mdl.infer(doc_inst) print("Topic Distribution for Unseen Docs: ", topic_dist) print("Log-likelihood of inference: ", ll)
The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.
Examples
You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .
You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .
License
tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.
History
- 0.3.1 (2019-11-05)
An issue where get_topic_dist() returns incorrect value when min_cf or rm_top is set was fixed.
The return value of get_topic_dist() of tomotopy.MGLDAModel document was fixed to include local topics.
The estimation speed with tw=ONE was improved.
- 0.3.0 (2019-10-06)
A new model, tomotopy.LLDAModel was added into the package.
A crashing issue of HDPModel was fixed.
- Since hyperparameter estimation for HDPModel was implemented, the result of HDPModel may differ from previous versions.
If you want to turn off hyperparameter estimation of HDPModel, set optim_interval to zero.
- 0.2.0 (2019-08-18)
New models including tomotopy.CTModel and tomotopy.SLDAModel were added into the package.
A new parameter option rm_top was added for all topic models.
The problems in save and load method for PAModel and HPAModel were fixed.
An occassional crash in loading HDPModel was fixed.
The problem that ll_per_word was calculated incorrectly when min_cf > 0 was fixed.
- 0.1.6 (2019-08-09)
Compiling errors at clang with macOS environment were fixed.
- 0.1.4 (2019-08-05)
The issue when add_doc receives an empty list as input was fixed.
The issue that tomotopy.PAModel.get_topic_words doesn’t extract the word distribution of subtopic was fixed.
- 0.1.3 (2019-05-19)
The parameter min_cf and its stopword-removing function were added for all topic models.
- 0.1.0 (2019-05-12)
First version of tomotopy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for tomotopy-0.3.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7085c5218c8054d5b78d94458fe93397184c92618d147caa8129f11feba0bafc |
|
MD5 | 66db2de6ac45d2077acb9d658e90e795 |
|
BLAKE2b-256 | 0ae199ca9642f586f8873c8fd94a8aed0244f048ca539bcd67b0e1b5cbd59ce4 |
Hashes for tomotopy-0.3.1-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a9f1fd2be449bfe28b124d81d975f7ac51a07b55350448f37ff0be92ec36736b |
|
MD5 | 8ae0c230c3b3234af2a1d75cf1d3d22e |
|
BLAKE2b-256 | d6c13eddea905d383d44de3e441dfb0072daac3811b363af3f0adebd4a6b6a74 |
Hashes for tomotopy-0.3.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce1625cab9c0df433267744be219c311c6d7d781bcc0f492683939ec9407dd16 |
|
MD5 | 89a22650ff941c52ab3e5e408a97657d |
|
BLAKE2b-256 | 482655fc1af601344a7c34f5c2c88374b81451b91cfd4a240746ee10df46d95b |
Hashes for tomotopy-0.3.1-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcbc9a7bbc69af534202ad3a466b580438a6bc47d7361b4eb13c5b4afa6eeeeb |
|
MD5 | e0253f67485277a46e06389a9180cb30 |
|
BLAKE2b-256 | 146bc2575c0c22e9e8b5061e8f0f884a22060ac25a1202239173ae844bdda617 |
Hashes for tomotopy-0.3.1-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0991690ff87ef6f998db7fa59954893ef7c071681443b21805e1f62c4a527be7 |
|
MD5 | e9ccf616fc14c78bb2620a7cb5731d7d |
|
BLAKE2b-256 | fdf440fdfb0690ea2ac51a13662ace396f176094a99e21b47d44653d47cbb846 |
Hashes for tomotopy-0.3.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38a27ce3f2a137034713aca346da61fc93b95513a1ed3db420d81f33457b7285 |
|
MD5 | 6f819603a704be68ed48736038689aef |
|
BLAKE2b-256 | 1a6624d11daf8dc963d3813732dd4878a60a08ae71f3bddcc62a084313b7cddd |
Hashes for tomotopy-0.3.1-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5a600fb859c92c555adfd2fab67f382d213347c2876834ba17f79f2e65e06e70 |
|
MD5 | 692f4501304f192327bd465520bf3bb9 |
|
BLAKE2b-256 | 05d4b8aac5f05e8b8b47e9a557ce04b13ddb79f8101fbd4cc123deea816db0b9 |
Hashes for tomotopy-0.3.1-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3343b7d81528743eff9ee1f2870f2c0784305c33fb0f178b37405e58ca631f94 |
|
MD5 | 597d9b8f34904169f06d83287415f771 |
|
BLAKE2b-256 | 293d5981dae7e9408886325eebb62a1cc8681ec18b8126abab3a2dc21256ac5a |
Hashes for tomotopy-0.3.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57ca178335e67f31565eae79efd84d1f5a2e8dc9dee01c86beb31e3735449176 |
|
MD5 | f6e13600f0e3ca954f216fadb20c64ee |
|
BLAKE2b-256 | c9a5805a35c703f33617a743807901662492028824522f03499f7fd2046cc182 |
Hashes for tomotopy-0.3.1-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44561b447b62957d3661ef007ae18db3fc43979d2b736a4cb13e13a61b2e8e04 |
|
MD5 | ee3cbc798e927b0af30dc36c64e06d1f |
|
BLAKE2b-256 | 0683a4e6d152cb52c30abb928baa82cc12b2dafb1c4c49e9985d65c3676f57b8 |
Hashes for tomotopy-0.3.1-cp35-cp35m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a6543eb574cafd02e4fa24d5e78e4589df8174e03bccc77a70d2c7838724279 |
|
MD5 | ce629bc74e3323a6f9fcc61d32cd56c4 |
|
BLAKE2b-256 | 3205d68c462d5535a5043aaa6e7207f0e5679faa15088cd9b7e9dccd706e9c04 |
Hashes for tomotopy-0.3.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6235543a22b566e08189d457de832e71f6f75d202723eda8092b83a8a9bcbb9 |
|
MD5 | e2f6eade310036f000c2890284636127 |
|
BLAKE2b-256 | 866d70ac2f6a7458abe63b1ba5ed085b28f820e33b04b8460d9709e631329399 |
Hashes for tomotopy-0.3.1-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54e9d19b8115f7479b22f51ee0981536ae94611a7a19032fa4768b4d43504201 |
|
MD5 | 527005290c8ff95b690ab19812698c23 |
|
BLAKE2b-256 | f3abdbb67abed30607baa0ee2863aaac6dc14f5603ed467855b7cbce8611d23f |