Skip to main content

Topic modeling with latent Dirichlet allocation

Project description

pypi version travis-ci build status pypi download statistics Zenodo citation

lda implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. lda is fast and is tested on Linux, OS X, and Windows.

You can read more about lda in the documentation.

Installation

If you have NumPy installed,

pip install lda

Installation does not require a compiler on Windows or OS X.

Getting started

lda.LDA implements latent Dirichlet allocation (LDA). The interface follows conventions found in scikit-learn.

The following demonstrates how to inspect a model of a subset of the Reuters news dataset. The input below, X, is a document-term matrix (sparse matrices are accepted).

>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> X.sum()
84010
>>> model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
>>> model.fit(X)  # model.fit_transform(X) is also available
>>> topic_word = model.topic_word_  # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
...     topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
...     print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: british churchill sale million major letters west britain
Topic 1: church government political country state people party against
Topic 2: elvis king fans presley life concert young death
Topic 3: yeltsin russian russia president kremlin moscow michael operation
Topic 4: pope vatican paul john surgery hospital pontiff rome
Topic 5: family funeral police miami versace cunanan city service
Topic 6: simpson former years court president wife south church
Topic 7: order mother successor election nuns church nirmala head
Topic 8: charles prince diana royal king queen parker bowles
Topic 9: film french france against bardot paris poster animal
Topic 10: germany german war nazi letter christian book jews
Topic 11: east peace prize award timor quebec belo leader
Topic 12: n't life show told very love television father
Topic 13: years year time last church world people say
Topic 14: mother teresa heart calcutta charity nun hospital missionaries
Topic 15: city salonika capital buddhist cultural vietnam byzantine show
Topic 16: music tour opera singer israel people film israeli
Topic 17: church catholic bernardin cardinal bishop wright death cancer
Topic 18: harriman clinton u.s ambassador paris president churchill france
Topic 19: city museum art exhibition century million churches set

The document-topic distributions are available in model.doc_topic_.

>>> doc_topic = model.doc_topic_
>>> for i in range(10):
...     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 8)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 13)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 14)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 8)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 14)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 14)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 14)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 14)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 14)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 8)

Requirements

Python 2.7 or Python 3.3+ is required. The following packages are required

Caveat

lda aims for simplicity. (It happens to be fast, as essential parts are written in C via Cython.) If you are working with a very large corpus you may wish to use more sophisticated topic models such as those implemented in hca and MALLET. hca is written entirely in C and MALLET is written in Java. Unlike lda, hca can use more than one processor at a time. Both MALLET and hca implement topic models known to be more robust than standard latent Dirichlet allocation.

Notes

Latent Dirichlet allocation is described in Blei et al. (2003) and Pritchard et al. (2000). Inference using collapsed Gibbs sampling is described in Griffiths and Steyvers (2004).

Other implementations

License

lda is licensed under Version 2.0 of the Mozilla Public License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lda-1.0.3-cp35-cp35m-manylinux1_x86_64.whl (501.0 kB view details)

Uploaded CPython 3.5m

lda-1.0.3-cp34-cp34m-manylinux1_x86_64.whl (504.0 kB view details)

Uploaded CPython 3.4m

lda-1.0.3-cp27-cp27mu-manylinux1_x86_64.whl (486.5 kB view details)

Uploaded CPython 2.7mu

lda-1.0.3-cp27-cp27m-manylinux1_x86_64.whl (486.5 kB view details)

Uploaded CPython 2.7m

File details

Details for the file lda-1.0.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for lda-1.0.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 45038a47c3159cb57ac623496a757320c4fdf7abb339162b73eb00359372fd1f
MD5 27650972bd7a53e9065b521106d3ae35
BLAKE2b-256 651befb0b53297bd6b8b1f073a4f0ae8263ba8bc7c58e6938d4a6b5751f76b11

See more details on using hashes here.

File details

Details for the file lda-1.0.3-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for lda-1.0.3-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c26fe19d6b92ca261385d8dae4d700d8a476e3d14477aaefaa10b286072eff9d
MD5 2b87bfc57feb6bf5e38d7896958c7152
BLAKE2b-256 0e09846f4c186540389204961123f0ce9566c166a5cd1955a3e84bf8eb1f2a5d

See more details on using hashes here.

File details

Details for the file lda-1.0.3-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for lda-1.0.3-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5ab4bbd8ff2c32ae5e870214673067f7049544babcfa6b37d3254c8ed77f0d75
MD5 cc52d4f81e99dc6b60c6c94fc5f80b71
BLAKE2b-256 568303062319cfc0572fe01f8a7b77ca3902fcc41e2e4a6cffafc170cffff1d7

See more details on using hashes here.

File details

Details for the file lda-1.0.3-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for lda-1.0.3-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3c7c03278166bcea15c276b5abf2d0f54d4d8bbda51472278ace8d1d532ff892
MD5 774d5099ee17a0ac53b0bd3e818b189d
BLAKE2b-256 c894743b2cda664b5d8339fb2ee4623b2c4c9555d6e8ca04309d9dc0e59d6b2f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page