an implementation of spectral clustering for text document collections
Project description
- Homepage:
- Contact:
Overview
Spectral clustering a modern clustering technique considered to be effective for image clustering among others. [1] [2]
This software find clusters among documents based on the bag-of-words representation [3] and TF-IDF weighting [4].
Requirements
Following softwares are required.
Python 2 or 3
Numpy
Scipy
How to use
Prepare documents as raw-text files, and put them in a directory, for example, ‘reuters’.
Prepare a category file. For example, ‘cats.txt’ may contain:
14833 palm-oil veg-oil 14839 ship
This means that the file ‘14833’ has ‘palm-oil’ and ‘veg-oil’ as its categories, and ‘14839’ has ‘ship’ as its category.
Run: python scluster/clusterer.py cats.txt reusters/ -m kmeans,
Notes
When you use the Reuters set, notice No 17980 might contain non-Unicode character at Line 10. It should probably read: “world economic growth-side measures …”
http://www.daviddlewis.com/resources/testcollections/reuters21578/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.