Python Framework for Topic Modeling
Gensim is a Python framework for unsupervised learning from raw, unstructured digital texts. It provides a framework for learning hidden (latent) corpus structure. Once found, documents can be succinctly expressed in terms of this structure, queried for topical similarity and so on.
- Gensim includes the following features:
- Memory independence – there is no need for the whole text corpus (or any intermediate term-document matrices) to reside fully in RAM at any one time.
- Provides implementations for several popular topic inference algorithms, including Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA), and makes adding new ones simple.
- Contains I/O wrappers and converters around several popular data formats.
- Allows similarity queries across documents in their latent, topical representation.
- The principal design objectives behind gensim are:
- Straightforward interfaces and low API learning curve for developers, facilitating modifications and rapid prototyping.
- Memory independence with respect to the size of the input corpus; all intermediate steps and algorithms operate in a streaming fashion, processing one document at a time.
Release history Release notifications
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size gensim-0.2-py2.5.egg (130.7 kB)||File type Egg||Python version 2.5||Upload date||Hashes View hashes|
|Filename, size gensim-0.2.tar.gz (119.1 kB)||File type Source||Python version None||Upload date||Hashes View hashes|