an implementation of spectral clustering for text document collections
Project description
- Homepage:
- Contact:
Overview
Spectral clustering a modern clustering technique considered to be effective for image clustering among others. [1] [2]
This software find clusters among documents based on the bag-of-words representation [3] and TF-IDF weighting [4].
Requirements
Following softwares are required.
Python 2 or 3
Numpy
Scipy
How to use
Prepare documents as raw-text files, and put them in a directory, for example, ‘reuters’.
Prepare a category file. For example, ‘cats.txt’ may contain:
14833 palm-oil veg-oil 14839 ship
This means that the file ‘14833’ has ‘palm-oil’ and ‘veg-oil’ as its categories, and ‘14839’ has ‘ship’ as its category.
Run: python scluster/clusterer.py cats.txt reusters/ -m kmeans,
Notes
When you use the Reuters set, notice No 17980 might contain non-Unicode character at Line 10. It should probably read: “world economic growth-side measures …”
http://www.daviddlewis.com/resources/testcollections/reuters21578/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scluster-0.0.2.tar.gz
.
File metadata
- Download URL: scluster-0.0.2.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18cdb698ccca8c2355b1ef9dbef1340f8ea6003b0cdec845d8f0507cb97b83ad |
|
MD5 | bddeab556f84f542bc6376110a8679b3 |
|
BLAKE2b-256 | 41868cd37687f4f6580707e40ebc5f8722ba517cff4ec1c47f271b03eb047829 |