Skip to main content
This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

an implementation of spectral clustering for text document collections

Project Description
Homepage:http://github.com/whym/scluster
Contact:http://whym.org

Overview

Spectral clustering a modern clustering technique considered to be effective for image clustering among others. [1] [2]

This software find clusters among documents based on the bag-of-words representation [3] and TF-IDF weighting [4].

[1]Ulrike von Luxburg, A Tutorial on Spectral Clustering, 2006. http://arxiv.org/abs/0711.0189
[2]Chris H. Q. Ding, Spectral Clustering, 2004. http://ranger.uta.edu/~chqding/Spectral/
[3]http://en.wikipedia.org/wiki/Bag_of_words_model
[4]http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Requirements

Following softwares are required.

  • Python 2 or 3
  • Numpy
  • Scipy

How to use

  1. Prepare documents as raw-text files, and put them in a directory, for example, ‘reuters’.

  2. Prepare a category file. For example, ‘cats.txt’ may contain:

    14833 palm-oil veg-oil
    14839 ship
    

    This means that the file ‘14833’ has ‘palm-oil’ and ‘veg-oil’ as its categories, and ‘14839’ has ‘ship’ as its category.

  3. Run: python scluster/clusterer.py cats.txt reusters/ -m kmeans,

Notes

  • When you use the Reuters set, notice No 17980 might contain non-Unicode character at Line 10. It should probably read: “world economic growth-side measures …”
[5]http://www.daviddlewis.com/resources/testcollections/reuters21578/
Release History

Release History

This version
History Node

0.0.2

History Node

0.0.1

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
scluster-0.0.2.tar.gz (6.8 kB) Copy SHA256 Checksum SHA256 Source Dec 30, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting