Skip to main content

Tomoto, Topic Modeling Tool for Python

Project description

What is tomotopy?

tomotopy is a Python extension of tomoto (Topic Modeling Tool) which is a Gibbs-sampling based topic model library written in C++. It utilizes a vectorization of modern CPUs for maximizing speed. The current version of tomoto supports several major topic models including

  • Latent Dirichlet Allocation (tomotopy.LDAModel)

  • Labeled LDA (tomotopy.LLDAModel)

  • Partially Labeled LDA (tomotopy.PLDAModel)

  • Supervised LDA (tomotopy.SLDAModel)

  • Dirichlet Multinomial Regression (tomotopy.DMRModel)

  • Generalized Dirichlet Multinomial Regression (tomotopy.GDMRModel)

  • Hierarchical Dirichlet Process (tomotopy.HDPModel)

  • Hierarchical LDA (tomotopy.HLDAModel)

  • Multi Grain LDA (tomotopy.MGLDAModel)

  • Pachinko Allocation (tomotopy.PAModel)

  • Hierarchical PA (tomotopy.HPAModel)

  • Correlated Topic Model (tomotopy.CTModel)

  • Dynamic Topic Model (tomotopy.DTModel)

  • Pseudo-document based Topic Model (tomotopy.PTModel).

https://badge.fury.io/py/tomotopy.svg

Getting Started

You can install tomotopy easily using pip. (https://pypi.org/project/tomotopy/)

$ pip install --upgrade pip
$ pip install tomotopy

The supported OS and Python versions are:

  • Linux (x86-64) with Python >= 3.6

  • macOS >= 10.13 with Python >= 3.6

  • Windows 7 or later (x86, x86-64) with Python >= 3.6

  • Other OS with Python >= 3.6: Compilation from source code required (with c++14 compatible compiler)

After installing, you can start tomotopy by just importing.

import tomotopy as tp
print(tp.isa) # prints 'avx2', 'avx', 'sse2' or 'none'

Currently, tomotopy can exploits AVX2, AVX or SSE2 SIMD instruction set for maximizing performance. When the package is imported, it will check available instruction sets and select the best option. If tp.isa tells none, iterations of training may take a long time. But, since most of modern Intel or AMD CPUs provide SIMD instruction set, the SIMD acceleration could show a big improvement.

Here is a sample code for simple LDA training of texts from ‘sample.txt’ file.

import tomotopy as tp
mdl = tp.LDAModel(k=20)
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

for k in range(mdl.k):
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

mdl.summary()

Performance of tomotopy

tomotopy uses Collapsed Gibbs-Sampling(CGS) to infer the distribution of topics and the distribution of words. Generally CGS converges more slowly than Variational Bayes(VB) that [gensim’s LdaModel] uses, but its iteration can be computed much faster. In addition, tomotopy can take advantage of multicore CPUs with a SIMD instruction set, which can result in faster iterations.

[gensim’s LdaModel]: https://radimrehurek.com/gensim/models/ldamodel.html

Following chart shows the comparison of LDA model’s running time between tomotopy and gensim. The input data consists of 1000 random documents from English Wikipedia with 1,506,966 words (about 10.1 MB). tomotopy trains 200 iterations and gensim trains 10 iterations.

https://bab2min.github.io/tomotopy/images/tmt_i5.png

↑ Performance in Intel i5-6600, x86-64 (4 cores)

https://bab2min.github.io/tomotopy/images/tmt_xeon.png

↑ Performance in Intel Xeon E5-2620 v4, x86-64 (8 cores, 16 threads)

https://bab2min.github.io/tomotopy/images/tmt_r7_3700x.png

↑ Performance in AMD Ryzen7 3700X, x86-64 (8 cores, 16 threads)

Although tomotopy iterated 20 times more, the overall running time was 5~10 times faster than gensim. And it yields a stable result.

It is difficult to compare CGS and VB directly because they are totaly different techniques. But from a practical point of view, we can compare the speed and the result between them. The following chart shows the log-likelihood per word of two models’ result.

https://bab2min.github.io/tomotopy/images/LLComp.png

The SIMD instruction set has a great effect on performance. Following is a comparison between SIMD instruction sets.

https://bab2min.github.io/tomotopy/images/SIMDComp.png

Fortunately, most of recent x86-64 CPUs provide AVX2 instruction set, so we can enjoy the performance of AVX2.

Vocabulary controlling using CF and DF

CF(collection frequency) and DF(document frequency) are concepts used in information retreival, and each represents the total number of times the word appears in the corpus and the number of documents in which the word appears within the corpus, respectively. tomotopy provides these two measures under the parameters of min_cf and min_df to trim low frequency words when building the corpus.

For example, let’s say we have 5 documents #0 ~ #4 which are composed of the following words:

#0 : a, b, c, d, e, c
#1 : a, b, e, f
#2 : c, d, c
#3 : a, e, f, g
#4 : a, b, g

Both CF of a and CF of c are 4 because it appears 4 times in the entire corpus. But DF of a is 4 and DF of c is 2 because a appears in #0, #1, #3 and #4 and c only appears in #0 and #2. So if we trim low frequency words using min_cf=3, the result becomes follows:

(d, f and g are removed.)
#0 : a, b, c, e, c
#1 : a, b, e
#2 : c, c
#3 : a, e
#4 : a, b

However when min_df=3 the result is like :

(c, d, f and g are removed.)
#0 : a, b, e
#1 : a, b, e
#2 : (empty doc)
#3 : a, e
#4 : a, b

As we can see, min_df is a stronger criterion than min_cf. In performing topic modeling, words that appear repeatedly in only one document do not contribute to estimating the topic-word distribution. So, removing words with low df is a good way to reduce model size while preserving the results of the final model. In short, please prefer using min_df to min_cf.

Model Save and Load

tomotopy provides save and load method for each topic model class, so you can save the model into the file whenever you want, and re-load it from the file.

import tomotopy as tp

mdl = tp.HDPModel()
for line in open('sample.txt'):
    mdl.add_doc(line.strip().split())

for i in range(0, 100, 10):
    mdl.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl.ll_per_word))

# save into file
mdl.save('sample_hdp_model.bin')

# load from file
mdl = tp.HDPModel.load('sample_hdp_model.bin')
for k in range(mdl.k):
    if not mdl.is_live_topic(k): continue
    print('Top 10 words of topic #{}'.format(k))
    print(mdl.get_topic_words(k, top_n=10))

# the saved model is HDP model,
# so when you load it by LDA model, it will raise an exception
mdl = tp.LDAModel.load('sample_hdp_model.bin')

When you load the model from a file, a model type in the file should match the class of methods.

See more at tomotopy.LDAModel.save and tomotopy.LDAModel.load methods.

Documents in the Model and out of the Model

We can use Topic Model for two major purposes. The basic one is to discover topics from a set of documents as a result of trained model, and the more advanced one is to infer topic distributions for unseen documents by using trained model.

We named the document in the former purpose (used for model training) as document in the model, and the document in the later purpose (unseen document during training) as document out of the model.

In tomotopy, these two different kinds of document are generated differently. A document in the model can be created by tomotopy.LDAModel.add_doc method. add_doc can be called before tomotopy.LDAModel.train starts. In other words, after train called, add_doc cannot add a document into the model because the set of document used for training has become fixed.

To acquire the instance of the created document, you should use tomotopy.LDAModel.docs like:

mdl = tp.LDAModel(k=20)
idx = mdl.add_doc(words)
if idx < 0: raise RuntimeError("Failed to add doc")
doc_inst = mdl.docs[idx]
# doc_inst is an instance of the added document

A document out of the model is generated by tomotopy.LDAModel.make_doc method. make_doc can be called only after train starts. If you use make_doc before the set of document used for training has become fixed, you may get wrong results. Since make_doc returns the instance directly, you can use its return value for other manipulations.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc) # doc_inst is an instance of the unseen document

Inference for Unseen Documents

If a new document is created by tomotopy.LDAModel.make_doc, its topic distribution can be inferred by the model. Inference for unseen document should be performed using tomotopy.LDAModel.infer method.

mdl = tp.LDAModel(k=20)
# add_doc ...
mdl.train(100)
doc_inst = mdl.make_doc(unseen_doc)
topic_dist, ll = mdl.infer(doc_inst)
print("Topic Distribution for Unseen Docs: ", topic_dist)
print("Log-likelihood of inference: ", ll)

The infer method can infer only one instance of tomotopy.Document or a list of instances of tomotopy.Document. See more at tomotopy.LDAModel.infer.

Corpus and transform

Every topic model in tomotopy has its own internal document type. A document can be created and added into suitable for each model through each model’s add_doc method. However, trying to add the same list of documents to different models becomes quite inconvenient, because add_doc should be called for the same list of documents to each different model. Thus, tomotopy provides tomotopy.utils.Corpus class that holds a list of documents. tomotopy.utils.Corpus can be inserted into any model by passing as argument corpus to __init__ or add_corpus method of each model. So, inserting tomotopy.utils.Corpus just has the same effect to inserting documents the corpus holds.

Some topic models requires different data for its documents. For example, tomotopy.DMRModel requires argument metadata in str type, but tomotopy.PLDAModel requires argument labels in List[str] type. Since tomotopy.utils.Corpus holds an independent set of documents rather than being tied to a specific topic model, data types required by a topic model may be inconsistent when a corpus is added into that topic model. In this case, miscellaneous data can be transformed to be fitted target topic model using argument transform. See more details in the following code:

from tomotopy import DMRModel
from tomotopy.utils import Corpus

corpus = Corpus()
corpus.add_doc("a b c d e".split(), a_data=1)
corpus.add_doc("e f g h i".split(), a_data=2)
corpus.add_doc("i j k l m".split(), a_data=3)

model = DMRModel(k=10)
model.add_corpus(corpus)
# You lose `a_data` field in `corpus`,
# and `metadata` that `DMRModel` requires is filled with the default value, empty str.

assert model.docs[0].metadata == ''
assert model.docs[1].metadata == ''
assert model.docs[2].metadata == ''

def transform_a_data_to_metadata(misc: dict):
    return {'metadata': str(misc['a_data'])}
# this function transforms `a_data` to `metadata`

model = DMRModel(k=10)
model.add_corpus(corpus, transform=transform_a_data_to_metadata)
# Now docs in `model` has non-default `metadata`, that generated from `a_data` field.

assert model.docs[0].metadata == '1'
assert model.docs[1].metadata == '2'
assert model.docs[2].metadata == '3'

Parallel Sampling Algorithms

Since version 0.5.0, tomotopy allows you to choose a parallelism algorithm. The algorithm provided in versions prior to 0.4.2 is COPY_MERGE, which is provided for all topic models. The new algorithm PARTITION, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.

The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.

https://bab2min.github.io/tomotopy/images/algo_comp.png https://bab2min.github.io/tomotopy/images/algo_comp2.png

Performance by Version

Performance changes by version are shown in the following graph. The time it takes to run the LDA model train with 1000 iteration was measured. (Docs: 11314, Vocab: 60382, Words: 2364724, Intel Xeon Gold 5120 @2.2GHz)

https://bab2min.github.io/tomotopy/images/lda-perf-t1.png https://bab2min.github.io/tomotopy/images/lda-perf-t4.png https://bab2min.github.io/tomotopy/images/lda-perf-t8.png

Pining Topics using Word Priors

Since version 0.6.0, a new method tomotopy.LDAModel.set_word_prior has been added. It allows you to control word prior for each topic. For example, we can set the weight of the word ‘church’ to 1.0 in topic 0, and the weight to 0.1 in the rest of the topics by following codes. This means that the probability that the word ‘church’ is assigned to topic 0 is 10 times higher than the probability of being assigned to another topic. Therefore, most of ‘church’ is assigned to topic 0, so topic 0 contains many words related to ‘church’. This allows to manipulate some topics to be placed at a specific topic number.

import tomotopy as tp
mdl = tp.LDAModel(k=20)

# add documents into `mdl`

# setting word prior
mdl.set_word_prior('church', [1.0 if k == 0 else 0.1 for k in range(20)])

See word_prior_example in example.py for more details.

Examples

You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/main/examples/ .

You can also get the data file used in the example code at https://drive.google.com/file/d/18OpNijd4iwPyYZ2O7pQoPyeTAKEXa71J/view .

License

tomotopy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

  • 0.12.3 (2022-07-19)
    • New features
      • Now, inserting an empty document using tomotopy.LDAModel.add_doc() just ignores it instead of raising an exception. If the newly added argument ignore_empty_words is set to False, an exception is raised as before.

      • tomotopy.HDPModel.purge_dead_topics() method is added to remove non-live topics from the model.

    • Bug fixes
      • Fixed an issue that prevents setting user defined values for nuSq in tomotopy.SLDAModel (by @jucendrero).

      • Fixed an issue where tomotopy.utils.Coherence did not work for tomotopy.DTModel.

      • Fixed an issue that often crashed when calling make_dic() before calling train().

      • Resolved the problem that the results of tomotopy.DMRModel and tomotopy.GDMRModel are different even when the seed is fixed.

      • The parameter optimization process of tomotopy.DMRModel and tomotopy.GDMRModel has been improved.

      • Fixed an issue that sometimes crashed when calling tomotopy.PTModel.copy().

  • 0.12.2 (2021-09-06)
    • An issue where calling convert_to_lda of tomotopy.HDPModel with min_cf > 0, min_df > 0 or rm_top > 0 causes a crash has been fixed.

    • A new argument from_pseudo_doc is added to tomotopy.Document.get_topics and tomotopy.Document.get_topic_dist. This argument is only valid for documents of PTModel, it enables to control a source for computing topic distribution.

    • A default value for argument p of tomotopy.PTModel has been changed. The new default value is k * 10.

    • Using documents generated by make_doc without calling infer doesn’t cause a crash anymore, but just print warning messages.

    • An issue where the internal C++ code isn’t compiled at clang c++17 environment has been fixed.

  • 0.12.1 (2021-06-20)
    • An issue where tomotopy.LDAModel.set_word_prior() causes a crash has been fixed.

    • Now tomotopy.LDAModel.perplexity and tomotopy.LDAModel.ll_per_word return the accurate value when TermWeight is not ONE.

    • tomotopy.LDAModel.used_vocab_weighted_freq was added, which returns term-weighted frequencies of words.

    • Now tomotopy.LDAModel.summary() shows not only the entropy of words, but also the entropy of term-weighted words.

  • 0.12.0 (2021-04-26)
    • Now tomotopy.DMRModel and tomotopy.GDMRModel support multiple values of metadata (see https://github.com/bab2min/tomotopy/blob/main/examples/dmr_multi_label.py )

    • The performance of tomotopy.GDMRModel was improved.

    • A copy() method has been added for all topic models to do a deep copy.

    • An issue was fixed where words that are excluded from training (by min_cf, min_df) have incorrect topic id. Now all excluded words have -1 as topic id.

    • Now all exceptions and warnings that generated by tomotopy follow standard Python types.

    • Compiler requirements have been raised to C++14.

  • 0.11.1 (2021-03-28)
    • A critical bug of asymmetric alphas was fixed. Due to this bug, version 0.11.0 has been removed from releases.

  • 0.11.0 (2021-03-26) (removed)
    • A new topic model tomotopy.PTModel for short texts was added into the package.

    • An issue was fixed where tomotopy.HDPModel.infer causes a segmentation fault sometimes.

    • A mismatch of numpy API version was fixed.

    • Now asymmetric document-topic priors are supported.

    • Serializing topic models to bytes in memory is supported.

    • An argument normalize was added to get_topic_dist(), get_topic_word_dist() and get_sub_topic_dist() for controlling normalization of results.

    • Now tomotopy.DMRModel.lambdas and tomotopy.DMRModel.alpha give correct values.

    • Categorical metadata supports for tomotopy.GDMRModel were added (see https://github.com/bab2min/tomotopy/blob/main/examples/gdmr_both_categorical_and_numerical.py ).

    • Python3.5 support was dropped.

  • 0.10.2 (2021-02-16)
    • An issue was fixed where tomotopy.CTModel.train fails with large K.

    • An issue was fixed where tomotopy.utils.Corpus loses their uid values.

  • 0.10.1 (2021-02-14)
    • An issue was fixed where tomotopy.utils.Corpus.extract_ngrams craches with empty input.

    • An issue was fixed where tomotopy.LDAModel.infer raises exception with valid input.

    • An issue was fixed where tomotopy.HLDAModel.infer generates wrong tomotopy.Document.path.

    • Since a new parameter freeze_topics for tomotopy.HLDAModel.train was added, you can control whether to create a new topic or not when training.

  • 0.10.0 (2020-12-19)
    • The interface of tomotopy.utils.Corpus and of tomotopy.LDAModel.docs were unified. Now you can access the document in corpus with the same manner.

    • __getitem__ of tomotopy.utils.Corpus was improved. Not only indexing by int, but also by Iterable[int], slicing are supported. Also indexing by uid is supported.

    • New methods tomotopy.utils.Corpus.extract_ngrams and tomotopy.utils.Corpus.concat_ngrams were added. They extracts n-gram collocations using PMI and concatenates them into a single words.

    • A new method tomotopy.LDAModel.add_corpus was added, and tomotopy.LDAModel.infer can receive corpus as input.

    • A new module tomotopy.coherence was added. It provides the way to calculate coherence of the model.

    • A paramter window_size was added to tomotopy.label.FoRelevance.

    • An issue was fixed where NaN often occurs when training tomotopy.HDPModel.

    • Now Python3.9 is supported.

    • A dependency to py-cpuinfo was removed and the initializing of the module was improved.

  • 0.9.1 (2020-08-08)
    • Memory leaks of version 0.9.0 was fixed.

    • tomotopy.CTModel.summary() was fixed.

  • 0.9.0 (2020-08-04)
    • The tomotopy.LDAModel.summary() method, which prints human-readable summary of the model, has been added.

    • The random number generator of package has been replaced with [EigenRand]. It speeds up the random number generation and solves the result difference between platforms.

    • Due to above, even if seed is the same, the model training result may be different from the version before 0.9.0.

    • Fixed a training error in tomotopy.HDPModel.

    • tomotopy.DMRModel.alpha now shows Dirichlet prior of per-document topic distribution by metadata.

    • tomotopy.DTModel.get_count_by_topics() has been modified to return a 2-dimensional ndarray.

    • tomotopy.DTModel.alpha has been modified to return the same value as tomotopy.DTModel.get_alpha().

    • Fixed an issue where the metadata value could not be obtained for the document of tomotopy.GDMRModel.

    • tomotopy.HLDAModel.alpha now shows Dirichlet prior of per-document depth distribution.

    • tomotopy.LDAModel.global_step has been added.

    • tomotopy.MGLDAModel.get_count_by_topics() now returns the word count for both global and local topics.

    • tomotopy.PAModel.alpha, tomotopy.PAModel.subalpha, and tomotopy.PAModel.get_count_by_super_topic() have been added.

[EigenRand]: https://github.com/bab2min/EigenRand

  • 0.8.2 (2020-07-14)
    • New properties tomotopy.DTModel.num_timepoints and tomotopy.DTModel.num_docs_by_timepoint have been added.

    • A bug which causes different results with the different platform even if seeds were the same was partially fixed. As a result of this fix, now tomotopy in 32 bit yields different training results from earlier version.

  • 0.8.1 (2020-06-08)
    • A bug where tomotopy.LDAModel.used_vocabs returned an incorrect value was fixed.

    • Now tomotopy.CTModel.prior_cov returns a covariance matrix with shape [k, k].

    • Now tomotopy.CTModel.get_correlations with empty arguments returns a correlation matrix with shape [k, k].

  • 0.8.0 (2020-06-06)
    • Since NumPy was introduced in tomotopy, many methods and properties of tomotopy return not just list, but numpy.ndarray now.

    • Tomotopy has a new dependency NumPy >= 1.10.0.

    • A wrong estimation of tomotopy.HDPModel.infer was fixed.

    • A new method about converting HDPModel to LDAModel was added.

    • New properties including tomotopy.LDAModel.used_vocabs, tomotopy.LDAModel.used_vocab_freq and tomotopy.LDAModel.used_vocab_df were added into topic models.

    • A new g-DMR topic model(tomotopy.GDMRModel) was added.

    • An error at initializing tomotopy.label.FoRelevance in macOS was fixed.

    • An error that occured when using tomotopy.utils.Corpus created without raw parameters was fixed.

  • 0.7.1 (2020-05-08)
    • tomotopy.Document.path was added for tomotopy.HLDAModel.

    • A memory corruption bug in tomotopy.label.PMIExtractor was fixed.

    • A compile error in gcc 7 was fixed.

  • 0.7.0 (2020-04-18)
    • tomotopy.DTModel was added into the package.

    • A bug in tomotopy.utils.Corpus.save was fixed.

    • A new method tomotopy.Document.get_count_vector was added into Document class.

    • Now linux distributions use manylinux2010 and an additional optimization is applied.

  • 0.6.2 (2020-03-28)
    • A critical bug related to save and load was fixed. Version 0.6.0 and 0.6.1 have been removed from releases.

  • 0.6.1 (2020-03-22) (removed)
    • A bug related to module loading was fixed.

  • 0.6.0 (2020-03-22) (removed)
    • tomotopy.utils.Corpus class that manages multiple documents easily was added.

    • tomotopy.LDAModel.set_word_prior method that controls word-topic priors of topic models was added.

    • A new argument min_df that filters words based on document frequency was added into every topic model’s __init__.

    • tomotopy.label, the submodule about topic labeling was added. Currently, only tomotopy.label.FoRelevance is provided.

  • 0.5.2 (2020-03-01)
    • A segmentation fault problem was fixed in tomotopy.LLDAModel.add_doc.

    • A bug was fixed that infer of tomotopy.HDPModel sometimes crashes the program.

    • A crash issue was fixed of tomotopy.LDAModel.infer with ps=tomotopy.ParallelScheme.PARTITION, together=True.

  • 0.5.1 (2020-01-11)
    • A bug was fixed that tomotopy.SLDAModel.make_doc doesn’t support missing values for y.

    • Now tomotopy.SLDAModel fully supports missing values for response variables y. Documents with missing values (NaN) are included in modeling topic, but excluded from regression of response variables.

  • 0.5.0 (2019-12-30)
    • Now tomotopy.PAModel.infer returns both topic distribution nd sub-topic distribution.

    • New methods get_sub_topics and get_sub_topic_dist were added into tomotopy.Document. (for PAModel)

    • New parameter parallel was added for tomotopy.LDAModel.train and tomotopy.LDAModel.infer method. You can select parallelism algorithm by changing this parameter.

    • tomotopy.ParallelScheme.PARTITION, a new algorithm, was added. It works efficiently when the number of workers is large, the number of topics or the size of vocabulary is big.

    • A bug where rm_top didn’t work at min_cf < 2 was fixed.

  • 0.4.2 (2019-11-30)
    • Wrong topic assignments of tomotopy.LLDAModel and tomotopy.PLDAModel were fixed.

    • Readable __repr__ of tomotopy.Document and tomotopy.Dictionary was implemented.

  • 0.4.1 (2019-11-27)
    • A bug at init function of tomotopy.PLDAModel was fixed.

  • 0.4.0 (2019-11-18)
    • New models including tomotopy.PLDAModel and tomotopy.HLDAModel were added into the package.

  • 0.3.1 (2019-11-05)
    • An issue where get_topic_dist() returns incorrect value when min_cf or rm_top is set was fixed.

    • The return value of get_topic_dist() of tomotopy.MGLDAModel document was fixed to include local topics.

    • The estimation speed with tw=ONE was improved.

  • 0.3.0 (2019-10-06)
    • A new model, tomotopy.LLDAModel was added into the package.

    • A crashing issue of HDPModel was fixed.

    • Since hyperparameter estimation for HDPModel was implemented, the result of HDPModel may differ from previous versions.

      If you want to turn off hyperparameter estimation of HDPModel, set optim_interval to zero.

  • 0.2.0 (2019-08-18)
    • New models including tomotopy.CTModel and tomotopy.SLDAModel were added into the package.

    • A new parameter option rm_top was added for all topic models.

    • The problems in save and load method for PAModel and HPAModel were fixed.

    • An occassional crash in loading HDPModel was fixed.

    • The problem that ll_per_word was calculated incorrectly when min_cf > 0 was fixed.

  • 0.1.6 (2019-08-09)
    • Compiling errors at clang with macOS environment were fixed.

  • 0.1.4 (2019-08-05)
    • The issue when add_doc receives an empty list as input was fixed.

    • The issue that tomotopy.PAModel.get_topic_words doesn’t extract the word distribution of subtopic was fixed.

  • 0.1.3 (2019-05-19)
    • The parameter min_cf and its stopword-removing function were added for all topic models.

  • 0.1.0 (2019-05-12)
    • First version of tomotopy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tomotopy-0.12.3.tar.gz (1.3 MB view details)

Uploaded Source

Built Distributions

tomotopy-0.12.3-cp310-cp310-win_amd64.whl (5.7 MB view details)

Uploaded CPython 3.10Windows x86-64

tomotopy-0.12.3-cp310-cp310-win32.whl (3.4 MB view details)

Uploaded CPython 3.10Windows x86

tomotopy-0.12.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (16.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.12+ x86-64

tomotopy-0.12.3-cp39-cp39-win_amd64.whl (5.7 MB view details)

Uploaded CPython 3.9Windows x86-64

tomotopy-0.12.3-cp39-cp39-win32.whl (3.4 MB view details)

Uploaded CPython 3.9Windows x86

tomotopy-0.12.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (16.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.12+ x86-64

tomotopy-0.12.3-cp39-cp39-macosx_10_15_x86_64.whl (14.6 MB view details)

Uploaded CPython 3.9macOS 10.15+ x86-64

tomotopy-0.12.3-cp38-cp38-win_amd64.whl (5.7 MB view details)

Uploaded CPython 3.8Windows x86-64

tomotopy-0.12.3-cp38-cp38-win32.whl (3.4 MB view details)

Uploaded CPython 3.8Windows x86

tomotopy-0.12.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (16.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

tomotopy-0.12.3-cp38-cp38-macosx_10_15_x86_64.whl (14.7 MB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

tomotopy-0.12.3-cp37-cp37m-win_amd64.whl (5.7 MB view details)

Uploaded CPython 3.7mWindows x86-64

tomotopy-0.12.3-cp37-cp37m-win32.whl (3.4 MB view details)

Uploaded CPython 3.7mWindows x86

tomotopy-0.12.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (16.5 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

tomotopy-0.12.3-cp37-cp37m-macosx_10_15_x86_64.whl (14.7 MB view details)

Uploaded CPython 3.7mmacOS 10.15+ x86-64

tomotopy-0.12.3-cp36-cp36m-win32.whl (3.4 MB view details)

Uploaded CPython 3.6mWindows x86

tomotopy-0.12.3-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (16.6 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

tomotopy-0.12.3-cp36-cp36m-macosx_10_14_x86_64.whl (14.7 MB view details)

Uploaded CPython 3.6mmacOS 10.14+ x86-64

File details

Details for the file tomotopy-0.12.3.tar.gz.

File metadata

  • Download URL: tomotopy-0.12.3.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for tomotopy-0.12.3.tar.gz
Algorithm Hash digest
SHA256 5cf5a016b1d2a8df30785f6260bb47202049d1ff4f0c53808d4e686f3cd6b787
MD5 95192a892571bc2759d9eeff6c438fe9
BLAKE2b-256 40f744ac077820935845dfd63923c3cbb4165dd72b27629f40b1775ed78eaebd

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for tomotopy-0.12.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 e937078fe23206043fdaa09c6fbdc1dccdf50beca566704af87ebe04362667d1
MD5 e9de7b499bba8397f8ce78628ba59413
BLAKE2b-256 4f8570a7b5c1693da6a9d2bd4724df5c297d5cb3108182386bb2b7908cf6535a

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp310-cp310-win32.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp310-cp310-win32.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.10, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for tomotopy-0.12.3-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 1b43d6039a46525376e551bb3ffa11b39d201d7d63bf9b868968ab33c2bb6e40
MD5 4efc60e48452d03c710fbe16ff21cdad
BLAKE2b-256 0cbc9e32eb0ff32c24fc82feb02a055767434283ae978d43b3844ae457db9c66

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7a8d387f81c9a0073f0df9c6d4793f3b3127058d27423f6aeb7056dfd8630d93
MD5 b91cf98c19aec89978614ac7c8c07064
BLAKE2b-256 43d26a07407fdab27b3b5bbd84409fcf02e84512226f7a8c20558b15ff734469

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for tomotopy-0.12.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ffb57d9d0eb0ac64ca8809a2bfee0d63ccd3ec9f2c4231974b24ab25134b5648
MD5 9dc8fcaf71db023f433bdc077f243558
BLAKE2b-256 722a33dde73d5f21884fa1fdc1a8eed0c2131256e1d25f4a1ff3fa990255a251

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp39-cp39-win32.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp39-cp39-win32.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for tomotopy-0.12.3-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 1989c2631741624c12f83d9dda8634a297242a8ab6e064cb3c6eb9d408dca3f7
MD5 591eefa7a6fd07f078482675fcd453e9
BLAKE2b-256 29d268800c75fc5f402bbf4220d4bc209eb627208c755a258afee3a754f396eb

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 ca6345002fe64fbcb1ed8263a9c3198c0180573e4bc87ef6f3c97270a6482fd5
MD5 ada406a909d8bd03f8475d781318982d
BLAKE2b-256 1c353b242b7d8b3956d3edcb99f5c7b99ad4e21680c39abf978bccdd9015775d

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 05805151a168fb3a962fee2605e334e6617273e37f2a8fe9c24b242b458543d8
MD5 c0cb6310533da5ba3f942fe901d35837
BLAKE2b-256 82582ae4dedecf273fa9944e8222b8bc847650d0108e889a4727a0285b9ac1c1

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for tomotopy-0.12.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 0176bafe397281443515f6682bd49935cc75bcfa0682d683ba5b51f2534b1e4f
MD5 f22d791670f1e65d84ef305cc2a22dff
BLAKE2b-256 fcd16002f180fdcd30148268c4a67fddaec8660606308a00885cdcfcfde2291b

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp38-cp38-win32.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp38-cp38-win32.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.10

File hashes

Hashes for tomotopy-0.12.3-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 a24e453340c3a7154b1043de27e8a7fad1b4f74bb4280310d877b6e8314dd8ba
MD5 793afbdb9fc3fe6254346555079d0743
BLAKE2b-256 b90052a16e6285e4880a313661c40a8e1514b5c8a49b1ba1b1d39c702000743c

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c6930bd63a513261821299982cd57b3566d22d42e2e95786487a6adeaa86c034
MD5 4147851359c7ce772f13de9659f283f8
BLAKE2b-256 7f77a39c50a550bac30b19a010a309a17dd3568de177ac2f205f315dc3d00327

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 4dc4aec2d526626487aa370a4f2642592fe5ce9d2ac98b9e71793d725fb55bee
MD5 df55fe6a9cd05ae86e51a6d4dfbf25b9
BLAKE2b-256 a5526cbace2b67bb9ea310113b75afac4ee38c3030e5b87f37415cb528707a52

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 5.7 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.9

File hashes

Hashes for tomotopy-0.12.3-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 2dea746f7709dce2761331320f2a696f9da732801bb585908a446ff0597e11a8
MD5 3ad6611e4c878d21ca5faf8bd185e355
BLAKE2b-256 f86353ad058f3e42f8f50980c73a0df3cd783a541db37f7372ddf3eca6142481

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp37-cp37m-win32.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.7.9

File hashes

Hashes for tomotopy-0.12.3-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 dff4b8e2c51172ee1d42644ee50e978e533f63f1ddabb346d23055ffda388d3a
MD5 e7f992f6ea87f44d9f7f0b8750f9aaa7
BLAKE2b-256 7b1cfe10b647515a96657e281d68608d8fe815c1378bd3f807804c1e69c0e674

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 76d247981cf65b90126a801ff97b4f83704133ee47e525069d6b7b5b278a0231
MD5 ad561c31e79ac57b1409f27bbdbb1aec
BLAKE2b-256 542621a753239d70005a77141391db57da7eb56e67d2122bc01a08f31c49e756

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 cdc32406a7405299c4205c58363c525b460724b016280cffa145de916e306a6d
MD5 dac07dfa97c2eec70dc195a0c6dbf41b
BLAKE2b-256 accc1945aea351fab86e0419677698fb77d6506b6b4c378ea6d723f3eae6e780

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp36-cp36m-win32.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp36-cp36m-win32.whl
  • Upload date:
  • Size: 3.4 MB
  • Tags: CPython 3.6m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.10 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.8

File hashes

Hashes for tomotopy-0.12.3-cp36-cp36m-win32.whl
Algorithm Hash digest
SHA256 424b3d9908c86daa53ce3cf355b38a7e4e99832fdd190b6add2c6c17976c78a3
MD5 038bd346f556a9f106531d09712a88cc
BLAKE2b-256 9f9804bbea255cf094c0bab54ce73e32d8c938ba872ff57acb15ff480805d6e6

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for tomotopy-0.12.3-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 ba9cfe68133efea53451ba6753b0a783126617bfac5be211ab7094d0cccea157
MD5 e12dcf3d24bf31a0c1cf246fa2b72a1c
BLAKE2b-256 52a34e423ad6acac7780da868466ffa0cbc0fa63cdf825dbfac84497f44d75b7

See more details on using hashes here.

File details

Details for the file tomotopy-0.12.3-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: tomotopy-0.12.3-cp36-cp36m-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 14.7 MB
  • Tags: CPython 3.6m, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.10 tqdm/4.64.0 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.15

File hashes

Hashes for tomotopy-0.12.3-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 0aa1aceb92d6a070eac49913e511ea92cdb5adf4e1b0fa9346c9de1925e5919f
MD5 540e821daa2cfc0d9423e7b3f98362db
BLAKE2b-256 f16aed4675c83602e413c76fb773bb23e67804fd8a94884aec1f756d3172a3a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page