Python Framework for Topic Modeling
Project description
Gensim is a Python framework for unsupervised learning from raw, unstructured digital texts.
It provides a framework for learning hidden (*latent*) corpus structure.
Once found, documents can be succinctly expressed in terms of this structure, queried
for topical similarity and so on.
Gensim includes the following features:
* Memory independence -- there is no need for the whole text corpus (or any
intermediate term-document matrices) to reside fully in RAM at any one time.
* Provides implementations for several popular topic inference algorithms,
including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA),
and makes adding new ones simple.
* Contains I/O wrappers and converters around several popular data formats.
* Allows similarity queries across documents in their latent, topical representation.
The principal design objectives behind gensim are:
1. Straightforward interfaces and low API learning curve for developers,
facilitating modifications and rapid prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, processing one document
at a time.
It provides a framework for learning hidden (*latent*) corpus structure.
Once found, documents can be succinctly expressed in terms of this structure, queried
for topical similarity and so on.
Gensim includes the following features:
* Memory independence -- there is no need for the whole text corpus (or any
intermediate term-document matrices) to reside fully in RAM at any one time.
* Provides implementations for several popular topic inference algorithms,
including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA),
and makes adding new ones simple.
* Contains I/O wrappers and converters around several popular data formats.
* Allows similarity queries across documents in their latent, topical representation.
The principal design objectives behind gensim are:
1. Straightforward interfaces and low API learning curve for developers,
facilitating modifications and rapid prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, processing one document
at a time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gensim-0.3.0.tar.gz
(124.4 kB
view hashes)
Built Distribution
gensim-0.3.0-py2.5.egg
(130.8 kB
view hashes)