Skip to main content

Topic modeling with latent Dirichlet allocation

Project description

pypi version travis-ci build status pypi download statistics

Topic modeling with latent Dirichlet allocation. lda aims for simplicity.

lda implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. LDA is described in Blei et al. (2003) and Pritchard et al. (2000).

Installation

pip install lda

Installing lda is tested on Linux, OS X, and Windows.

Getting started

lda.LDA implements latent Dirichlet allocation (LDA). The interface follows conventions found in scikit-learn.

The following demonstrates how to inspect a model of a subset of the Reuters news dataset.

>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
>>> model.fit(X)
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: church people told years last year time
Topic 1: elvis music fans york show concert king
Topic 2: pope trip mass vatican poland health john
Topic 3: film french against france festival magazine quebec
Topic 4: king michael romania president first service romanian
Topic 5: police family versace miami cunanan west home
Topic 6: germany german war political government minister nazi
Topic 7: harriman u.s clinton churchill ambassador paris british
Topic 8: yeltsin russian russia president kremlin moscow operation
Topic 9: prince queen bowles church king royal public
Topic 10: simpson million years south irish churches says
Topic 11: charles diana parker camilla marriage family royal
Topic 12: east peace prize president award catholic timor
Topic 13: order nuns india successor election roman sister
Topic 14: pope vatican hospital surgery rome roman doctors
Topic 15: mother teresa heart calcutta missionaries hospital charity
Topic 16: bernardin cardinal cancer church life catholic chicago
Topic 17: died funeral church city death buddhist israel
Topic 18: museum kennedy cultural city culture greek byzantine
Topic 19: art exhibition century city tour works madonna

The document-topic distributions are available in model.doc_topic_.

>>> doc_topic = model.doc_topic_
>>> for i in range(10):
...     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 11)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 0)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 15)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 11)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 15)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 15)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 15)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 15)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 15)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 0)

Requirements

Python 2.7 or Python 3.3+ is required. The following packages are required

Caveat

lda aims for simplicity. (It happens to be fast, as essential parts are written in C via Cython.) If you are working with a very large corpus you may wish to use more sophisticated topic models such as those implemented in hca and MALLET. hca is written entirely in C and MALLET is written in Java. Unlike lda, hca can use more than one processor at a time. Both MALLET and hca implement topic models known to be more robust than standard latent Dirichlet allocation.

Similar projects

License

lda is licensed under Version 2.0 of the Mozilla Public License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for lda, version 0.3.0
Filename, size File type Python version Upload date Hashes
Filename, size lda-0.3.0-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (371.6 kB) File type Wheel Python version cp27 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp27-none-win32.whl (290.3 kB) File type Wheel Python version cp27 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp27-none-win_amd64.whl (297.9 kB) File type Wheel Python version cp27 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp33-cp33m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (372.0 kB) File type Wheel Python version cp33 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp33-none-win32.whl (290.4 kB) File type Wheel Python version cp33 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp33-none-win_amd64.whl (297.0 kB) File type Wheel Python version cp33 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp34-cp34m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (372.2 kB) File type Wheel Python version cp34 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp34-none-win32.whl (290.5 kB) File type Wheel Python version cp34 Upload date Hashes View hashes
Filename, size lda-0.3.0-cp34-none-win_amd64.whl (297.0 kB) File type Wheel Python version cp34 Upload date Hashes View hashes
Filename, size lda-0.3.0.tar.gz (249.2 kB) File type Source Python version None Upload date Hashes View hashes
Filename, size lda-0.3.0.win32-py2.7.exe (489.5 kB) File type Windows Installer Python version 2.7 Upload date Hashes View hashes
Filename, size lda-0.3.0.win32-py3.3.exe (484.5 kB) File type Windows Installer Python version 3.3 Upload date Hashes View hashes
Filename, size lda-0.3.0.win32-py3.4.exe (484.5 kB) File type Windows Installer Python version 3.4 Upload date Hashes View hashes
Filename, size lda-0.3.0.win-amd64-py2.7.exe (524.7 kB) File type Windows Installer Python version 2.7 Upload date Hashes View hashes
Filename, size lda-0.3.0.win-amd64-py3.3.exe (522.3 kB) File type Windows Installer Python version 3.3 Upload date Hashes View hashes
Filename, size lda-0.3.0.win-amd64-py3.4.exe (522.3 kB) File type Windows Installer Python version 3.4 Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page