This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!
Project Description

TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license. Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus. It also offers a common interface for two topic models (namely LDA using either variational inference or Gibbs sampling, and NMF using alternating least-square with a projected gradient method), and implements three state-of-the-art methods for estimating the optimal number of topics to model a corpus. What is more, TOM constructs an interactive Web-based browser that makes it easy to explore a topic model and the related corpus.

Features

Vector space modeling

  • Feature selection based on word frequency
  • Weighting
    • tf
    • tf-idf

Topic modeling

  • Latent Dirichlet Allocation
    • Standard variational Bayesian inference (Latent Dirichlet Allocation. Blei et al, 2003)
    • Online variational Bayesian inference (Online learning for Latent Dirichlet Allocation. Hoffman et al, 2010)
    • Collapsed Gibbs sampling (Finding scientific topics. Griffiths & Steyvers, 2004)
  • Non-negative Matrix Factorization (NMF)
    • Alternating least-square with a projected gradient method (Projected gradient methods for non-negative matrix factorization. Lin, 2007)

Estimating the optimal number of topics

  • Stability analysis (How many topics? Stability analysis for topic models. Greene et al, 2014)
  • Spectral analysis (On finding the natural number of topics with latent dirichlet allocation: Some observations. Arun et al, 2010)
  • Consensus-based analysis (Metagenes and molecular pattern discovery using matrix factorization. Brunet et al, 2004)

Installation

We recommend you to install Anaconda (https://www.continuum.io) which will automatically install most of the required dependencies (i.e. pandas, numpy, scipy, scikit-learn, matplotlib, flask). You should then install the lda module (pip install lda). Eventually, clone or download this repo and run the following command:

python setup.py install

Or, install it directly from PyPi:

pip install tom_lib

Usage

We provide two sample programs, topic_model.py (which shows you how to load and prepare a corpus, estimate the optimal number of topics, infer the topic model and then manipulate it) and topic_model_browser.py (which shows you how to generate a topic model browser to explore a corpus), to help you get started using TOM.

Load and prepare a textual corpus

The following code snippet shows how to load a corpus of French documents and vectorize them using tf-idf with unigrams.

corpus = Corpus(source_file_path='input/raw_corpus.csv',
                language='french',
                vectorization='tfidf',
                n_gram=1,
                max_relative_frequency=0.8,
                min_absolute_frequency=4)
print('corpus size:', corpus.size)
print('vocabulary size:', len(corpus.vocabulary))
print('Vector representation of document 0:\n', corpus.vector_for_document(0))

Instantiate a topic model and infer topics

It is possible to instantiate a NMF or LDA object then infer topics.

NMF:

topic_model = NonNegativeMatrixFactorization(corpus)
topic_model.infer_topics(num_topics=15)

LDA (using either the standard variational Bayesian inference or Gibbs sampling):

topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='variational')
topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='gibbs')

Instantiate a topic model and estimate the optimal number of topics

Here we instantiate a NMF object, then generate plots with the three metrics for estimating the optimal number of topics.

topic_model = NonNegativeMatrixFactorization(corpus)
viz = Visualization(topic_model)
viz.plot_greene_metric(min_num_topics=5,
                       max_num_topics=50,
                       tao=10, step=1,
                       top_n_words=10)
viz.plot_arun_metric(min_num_topics=5,
                     max_num_topics=50,
                     iterations=10)
viz.plot_brunet_metric(min_num_topics=5,
                       max_num_topics=50,
                       iterations=10)

Save/load a topic model

To allow reusing previously learned topics models, TOM can save them on disk, as shown below.

utils.save_topic_model(topic_model, 'output/NMF_15topics.tom')
topic_model = utils.load_topic_model('output/NMF_15topics.tom')

Topic model browser: screenshots

Topic cloud

### Topic details ### Document details

Release History

Release History

0.2.2

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2.1

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.2.0

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.1.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
tom_lib-0.2.2.tar.gz (819.5 kB) Copy SHA256 Checksum SHA256 Source Jun 24, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting