tomotopy·PyPI

Tomoto, The Topic Modeling Tool for Python

Project description

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including Latent Dirichlet Allocation(tomotopy.LDAModel), Dirichlet Multinomial Regression(tomotopy.DMRModel), Hierarchical Dirichlet Process(tomotopy.HDPModel), Multi Grain LDA(tomotopy.MGLDAModel), Pachinko Allocation(tomotopy.PAModel) and Hierarchical PA(tomotopy.HPAModel).

Getting Started

You can install tomotopy easily using pip.

$ pip install tomotopy

For Linux, it is neccesary to have gcc 5 or more for compiling C++14 codes. After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(100):
    mdl.train()
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(100):
    mdl.train()
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDA.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

Inference for Unseen Documents

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

Project details

Release history Release notifications | RSS feed

0.13.0

Aug 7, 2024

0.12.7

Dec 18, 2023

0.12.6

Dec 11, 2023

0.12.5

Aug 2, 2023

0.12.4

Jan 22, 2023

0.12.3

Jul 20, 2022

0.12.2

Sep 6, 2021

0.12.1

Jun 20, 2021

0.12.0

Apr 29, 2021

0.11.1

Mar 27, 2021

0.10.2

Feb 16, 2021

0.10.1

Feb 14, 2021

0.10.0

Dec 19, 2020

0.9.1

Aug 8, 2020

0.9.0

Aug 4, 2020

0.8.2

Jul 15, 2020

0.8.1

Jun 9, 2020

0.8.0

Jun 6, 2020

0.7.1

May 8, 2020

0.7.0

Apr 18, 2020

0.6.2

Mar 28, 2020

0.5.2

Mar 1, 2020

0.5.1

Jan 11, 2020

0.5.0

Dec 29, 2019

0.4.2

Nov 30, 2019

0.4.1

Nov 27, 2019

0.4.0

Nov 18, 2019

0.3.1

Nov 5, 2019

0.3.0

Oct 6, 2019

0.2.0

Aug 17, 2019

0.1.6

Aug 8, 2019

0.1.4

Aug 4, 2019

This version

0.1.3

May 19, 2019

0.1.0

May 11, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomotopy-0.1.3.tar.gz (851.4 kB view details)

Uploaded May 19, 2019 Source

Built Distributions

tomotopy-0.1.3-cp37-cp37m-win_amd64.whl (1.6 MB view details)

Uploaded May 19, 2019 CPython 3.7mWindows x86-64

tomotopy-0.1.3-cp37-cp37m-win32.whl (898.0 kB view details)

Uploaded May 19, 2019 CPython 3.7mWindows x86

tomotopy-0.1.3-cp36-cp36m-win_amd64.whl (1.6 MB view details)

Uploaded May 19, 2019 CPython 3.6mWindows x86-64

tomotopy-0.1.3-cp36-cp36m-win32.whl (898.0 kB view details)

Uploaded May 19, 2019 CPython 3.6mWindows x86

tomotopy-0.1.3-cp35-cp35m-win_amd64.whl (1.6 MB view details)

Uploaded May 19, 2019 CPython 3.5mWindows x86-64

tomotopy-0.1.3-cp35-cp35m-win32.whl (898.0 kB view details)

Uploaded May 19, 2019 CPython 3.5mWindows x86

File details

Details for the file tomotopy-0.1.3.tar.gz.

File metadata

Download URL: tomotopy-0.1.3.tar.gz
Upload date: May 19, 2019
Size: 851.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`79b223e8ba6cbf33167a369866319f9f220a3f938464328e0f6effaee49757ec`
MD5	`49cb612d10f53d06e951dbb91fcbcbf7`
BLAKE2b-256	`e308f4a0aa28a400b6a3305792447ec8420eaf942c803becf6f48ef4f830994d`

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp37-cp37m-win_amd64.whl.

File metadata

Download URL: tomotopy-0.1.3-cp37-cp37m-win_amd64.whl
Upload date: May 19, 2019
Size: 1.6 MB
Tags: CPython 3.7m, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.1.3-cp37-cp37m-win_amd64.whl
Algorithm	Hash digest
SHA256	`db398595378913f1c2300e7ba75f04d96208d9f38678ac4ed22ff02610f7dd8a`
MD5	`f505400ca262ab55b15f62392d0aa602`
BLAKE2b-256	`4ae7ff46e562cef79120327fe1802bd12ba317c489342beb570e609f04409608`

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp37-cp37m-win32.whl.

File metadata

Download URL: tomotopy-0.1.3-cp37-cp37m-win32.whl
Upload date: May 19, 2019
Size: 898.0 kB
Tags: CPython 3.7m, Windows x86
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.7

File hashes

Hashes for tomotopy-0.1.3-cp37-cp37m-win32.whl
Algorithm	Hash digest
SHA256	`8a0b1855d168e2c33bb286ce95d1287411f7df8827a4629299cf938c676867fe`
MD5	`04c31993af7ffeb2d7880001142fadcb`
BLAKE2b-256	`e3c5e1f0486bed78c7735d8b7280f464eb6c05c1c33f1b3dd5233374d62c1b64`

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp36-cp36m-win_amd64.whl.

File metadata

Download URL: tomotopy-0.1.3-cp36-cp36m-win_amd64.whl
Upload date: May 19, 2019
Size: 1.6 MB
Tags: CPython 3.6m, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.3-cp36-cp36m-win_amd64.whl
Algorithm	Hash digest
SHA256	`8487b71ef524f73798d20edf4ad064b0d66d08a6b132f08ba567238685df3865`
MD5	`235b71140258be73ce8daa357b0e3ac4`
BLAKE2b-256	`21a1d3b52ed9b0e07f7e622b15e5f910395fcb0e8b3e1644ff6abb475d327389`

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp36-cp36m-win32.whl.

File metadata

Download URL: tomotopy-0.1.3-cp36-cp36m-win32.whl
Upload date: May 19, 2019
Size: 898.0 kB
Tags: CPython 3.6m, Windows x86
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.6

File hashes

Hashes for tomotopy-0.1.3-cp36-cp36m-win32.whl
Algorithm	Hash digest
SHA256	`1ea86b97b9e3da22d8acaa889c3d54781ad1527a76d5d383e7133f717a30546d`
MD5	`10375a4148da139503504f3b2159badf`
BLAKE2b-256	`f1d1d232f1d8d24951c0ef5647e3ba265676aa4b8d245bbd2d74e7c0b8cd5153`

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp35-cp35m-win_amd64.whl.

File metadata

Download URL: tomotopy-0.1.3-cp35-cp35m-win_amd64.whl
Upload date: May 19, 2019
Size: 1.6 MB
Tags: CPython 3.5m, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.1.3-cp35-cp35m-win_amd64.whl
Algorithm	Hash digest
SHA256	`cd20c076787d63b52a94e737ed0861f3c474364617e53281285a02863a66f6a0`
MD5	`8b4c4300f1f8035a7530d3c11d676d69`
BLAKE2b-256	`349e824459ec2163f10b80195007e1858d8341ddc5f13339461e2d010a67513b`

See more details on using hashes here.

File details

Details for the file tomotopy-0.1.3-cp35-cp35m-win32.whl.

File metadata

Download URL: tomotopy-0.1.3-cp35-cp35m-win32.whl
Upload date: May 19, 2019
Size: 898.0 kB
Tags: CPython 3.5m, Windows x86
Uploaded using Trusted Publishing? No
Uploaded via: Python-urllib/3.5

File hashes

Hashes for tomotopy-0.1.3-cp35-cp35m-win32.whl
Algorithm	Hash digest
SHA256	`ec3187ccc65ecf66b0ab11a372821b40b655ae531836922ec88b1065d4ce924b`
MD5	`0d7844dbaa56c790e58c43ad91da9ab1`
BLAKE2b-256	`1b62f97b5d6878c8afcb0a21e5565e63ddf961430ea5b2d927e3201be2667e0e`

See more details on using hashes here.

tomotopy 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

What is tomotopy?

Getting Started

Performance of tomotopy

Model Save and Load

Documents in the Model and out of the Model

Inference for Unseen Documents

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes