Topic modeling with latent Dirichlet allocation
Project description
lda: Topic modeling with latent Dirichlet allocation
====================================================
|pypi| |travis| |crate|
Topic modeling with latent Dirichlet allocation. ``lda`` aims for simplicity.
``lda`` implements latent Dirichlet allocation (LDA) using collapsed Gibbs
sampling. LDA is described in `Blei et al. (2003)`_ and `Pritchard et al. (2000)`_.
Installation
------------
``pip install lda``
Getting started
---------------
``lda.LDA`` implements latent Dirichlet allocation (LDA). The interface follows
conventions found in scikit-learn_.
The following demonstrates how to inspect a model of a subset of the Reuters
news dataset.
.. code-block:: python
>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
>>> model.fit(X)
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: church people told years last year time
Topic 1: elvis music fans york show concert king
Topic 2: pope trip mass vatican poland health john
Topic 3: film french against france festival magazine quebec
Topic 4: king michael romania president first service romanian
Topic 5: police family versace miami cunanan west home
Topic 6: germany german war political government minister nazi
Topic 7: harriman u.s clinton churchill ambassador paris british
Topic 8: yeltsin russian russia president kremlin moscow operation
Topic 9: prince queen bowles church king royal public
Topic 10: simpson million years south irish churches says
Topic 11: charles diana parker camilla marriage family royal
Topic 12: east peace prize president award catholic timor
Topic 13: order nuns india successor election roman sister
Topic 14: pope vatican hospital surgery rome roman doctors
Topic 15: mother teresa heart calcutta missionaries hospital charity
Topic 16: bernardin cardinal cancer church life catholic chicago
Topic 17: died funeral church city death buddhist israel
Topic 18: museum kennedy cultural city culture greek byzantine
Topic 19: art exhibition century city tour works madonna
The document-topic distributions are available in ``model.doc_topic_``.
.. code-block:: python
>>> doc_topic = model.doc_topic_
>>> for i in range(10):
... print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 11)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 0)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 15)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 11)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 15)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 15)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 15)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 15)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 15)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 0)
Requirements
------------
Python 2.7 or Python 3.3+ is required. The following packages are required
- numpy_
- scipy_
- pbr_
Caveat
------
``lda`` aims for simplicity. (It happens to be fast, as essential parts are
written in C via Cython_.) If you are working with a very large corpus you may
wish to use more sophisticated topic models such as those implemented in hca_
and MALLET_. hca_ is written entirely in C and MALLET_ is written in Java.
Unlike ``lda``, hca_ can use more than one processor at a time. Both MALLET_ and
hca_ implement topic models known to be more robust than standard latent
Dirichlet allocation.
Important links
---------------
- Documentation: http://pythonhosted.org/lda
- Source code: https://github.com/ariddell/lda/
- Issue tracker: https://github.com/ariddell/lda/issues
License
-------
lda is licensed under Version 2.0 of the Mozilla Public License.
.. _Python: http://www.python.org/
.. _scikit-learn: http://scikit-learn.org
.. _hca: http://www.mloss.org/software/view/527/
.. _MALLET: http://mallet.cs.umass.edu/
.. _numpy: http://www.numpy.org/
.. _scipy: http://docs.scipy.org/doc/
.. _pbr: https://pypi.python.org/pypi/pbr
.. _Blei et al. (2003): http://jmlr.org/papers/v3/blei03a.html
.. _Pritchard et al. (2000): http://www.genetics.org/content/164/4/1567.full
.. |pypi| image:: https://badge.fury.io/py/lda.png
:target: https://badge.fury.io/py/lda
:alt: pypi version
.. |travis| image:: https://travis-ci.org/ariddell/lda.png?branch=master
:target: https://travis-ci.org/ariddell/lda
:alt: travis-ci build status
.. |crate| image:: https://pypip.in/d/lda/badge.png
:target: https://pypi.python.org/pypi/lda
:alt: pypi download statistics
====================================================
|pypi| |travis| |crate|
Topic modeling with latent Dirichlet allocation. ``lda`` aims for simplicity.
``lda`` implements latent Dirichlet allocation (LDA) using collapsed Gibbs
sampling. LDA is described in `Blei et al. (2003)`_ and `Pritchard et al. (2000)`_.
Installation
------------
``pip install lda``
Getting started
---------------
``lda.LDA`` implements latent Dirichlet allocation (LDA). The interface follows
conventions found in scikit-learn_.
The following demonstrates how to inspect a model of a subset of the Reuters
news dataset.
.. code-block:: python
>>> import numpy as np
>>> import lda
>>> import lda.datasets
>>> X = lda.datasets.load_reuters()
>>> vocab = lda.datasets.load_reuters_vocab()
>>> titles = lda.datasets.load_reuters_titles()
>>> X.shape
(395, 4258)
>>> model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
>>> model.fit(X)
>>> topic_word = model.topic_word_ # model.components_ also works
>>> n_top_words = 8
>>> for i, topic_dist in enumerate(topic_word):
... topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
... print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Topic 0: church people told years last year time
Topic 1: elvis music fans york show concert king
Topic 2: pope trip mass vatican poland health john
Topic 3: film french against france festival magazine quebec
Topic 4: king michael romania president first service romanian
Topic 5: police family versace miami cunanan west home
Topic 6: germany german war political government minister nazi
Topic 7: harriman u.s clinton churchill ambassador paris british
Topic 8: yeltsin russian russia president kremlin moscow operation
Topic 9: prince queen bowles church king royal public
Topic 10: simpson million years south irish churches says
Topic 11: charles diana parker camilla marriage family royal
Topic 12: east peace prize president award catholic timor
Topic 13: order nuns india successor election roman sister
Topic 14: pope vatican hospital surgery rome roman doctors
Topic 15: mother teresa heart calcutta missionaries hospital charity
Topic 16: bernardin cardinal cancer church life catholic chicago
Topic 17: died funeral church city death buddhist israel
Topic 18: museum kennedy cultural city culture greek byzantine
Topic 19: art exhibition century city tour works madonna
The document-topic distributions are available in ``model.doc_topic_``.
.. code-block:: python
>>> doc_topic = model.doc_topic_
>>> for i in range(10):
... print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20 (top topic: 11)
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21 (top topic: 0)
2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23 (top topic: 15)
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25 (top topic: 11)
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25 (top topic: 15)
5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25 (top topic: 15)
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26 (top topic: 15)
7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25 (top topic: 15)
8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26 (top topic: 15)
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26 (top topic: 0)
Requirements
------------
Python 2.7 or Python 3.3+ is required. The following packages are required
- numpy_
- scipy_
- pbr_
Caveat
------
``lda`` aims for simplicity. (It happens to be fast, as essential parts are
written in C via Cython_.) If you are working with a very large corpus you may
wish to use more sophisticated topic models such as those implemented in hca_
and MALLET_. hca_ is written entirely in C and MALLET_ is written in Java.
Unlike ``lda``, hca_ can use more than one processor at a time. Both MALLET_ and
hca_ implement topic models known to be more robust than standard latent
Dirichlet allocation.
Important links
---------------
- Documentation: http://pythonhosted.org/lda
- Source code: https://github.com/ariddell/lda/
- Issue tracker: https://github.com/ariddell/lda/issues
License
-------
lda is licensed under Version 2.0 of the Mozilla Public License.
.. _Python: http://www.python.org/
.. _scikit-learn: http://scikit-learn.org
.. _hca: http://www.mloss.org/software/view/527/
.. _MALLET: http://mallet.cs.umass.edu/
.. _numpy: http://www.numpy.org/
.. _scipy: http://docs.scipy.org/doc/
.. _pbr: https://pypi.python.org/pypi/pbr
.. _Blei et al. (2003): http://jmlr.org/papers/v3/blei03a.html
.. _Pritchard et al. (2000): http://www.genetics.org/content/164/4/1567.full
.. |pypi| image:: https://badge.fury.io/py/lda.png
:target: https://badge.fury.io/py/lda
:alt: pypi version
.. |travis| image:: https://travis-ci.org/ariddell/lda.png?branch=master
:target: https://travis-ci.org/ariddell/lda
:alt: travis-ci build status
.. |crate| image:: https://pypip.in/d/lda/badge.png
:target: https://pypi.python.org/pypi/lda
:alt: pypi download statistics
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lda-0.2.0.tar.gz
(247.6 kB
view hashes)
Built Distributions
lda-0.2.0.win-amd64-py3.4.exe
(522.1 kB
view hashes)
lda-0.2.0.win-amd64-py2.7.exe
(524.5 kB
view hashes)
lda-0.2.0.win32-py3.4.exe
(484.3 kB
view hashes)
lda-0.2.0.win32-py2.7.exe
(489.3 kB
view hashes)
lda-0.2.0-cp34-none-win_amd64.whl
(297.0 kB
view hashes)
lda-0.2.0-cp34-none-win32.whl
(290.4 kB
view hashes)
lda-0.2.0-cp27-none-win_amd64.whl
(297.9 kB
view hashes)
lda-0.2.0-cp27-none-win32.whl
(290.3 kB
view hashes)
Close
Hashes for lda-0.2.0-cp34-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f63b95cfacbcbfbd7afa2dbbd57fa0afab4083bb15e9e48fcdb537a62289492 |
|
MD5 | 3c509e7179707f616887d2109791f672 |
|
BLAKE2b-256 | 3bd4af2cb3b83a09ee9a4ee3be72db4389a45e498dce6c17d8db36c7a2f40720 |
Close
Hashes for lda-0.2.0-cp33-cp33m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ffe0b900bf130d089ef2483f9eb93d2fc5fbdc06dac36792c9b031a52ec0496 |
|
MD5 | 610853d5bbb6795d16050e61f2006468 |
|
BLAKE2b-256 | 52668ce24228de4ee97a44d6ee94bd9f77d270f5697630c0c93fde9b321cde3c |
Close
Hashes for lda-0.2.0-cp27-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60c81a496df479951d8181939a31c5f937c667ba8c38e976b5de967489cef27e |
|
MD5 | 0cc4780c46f45401df1e804cabb08221 |
|
BLAKE2b-256 | 3b089314ffc11de3bdbddea0fab8f995d1c822f8d75a854ddf4426a0189d349b |
Close
Hashes for lda-0.2.0-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8dea27ea17e4a66ba8deac3a89b73b4dbf0356ac36ae085a47026c2cb08fddf0 |
|
MD5 | 4224963e3a348dd3f5de87813b4d2728 |
|
BLAKE2b-256 | bdb4fcb8b816c24b5e36b900c28ea7185ff5b956f7888c154495c36ff932bab5 |