Module for automatic text summarization of HTML documents.
Project description
Here are some other summarizers:
https://github.com/thavelick/summarize/ - Python, TF (very simple)
Reduction - Python, TextRank (simple)
Open Text Summarizer - C, TF without normalization
Simple program that summarize text - Python, TF without normalization
Intro to Computational Linguistics - Java, LexRank
TextTeaser - Scala
Automatic Document Summarizer - Java, Bipartite HITS (no sources)
Pythia - Python, LexRank & Centroid
SWING - Ruby
Topic Networks - R, topic models & bipartite graphs
Almus: Automatic Text Summarizer - Java, LSA (without source code)
Musutelsa - Java, LSA (always freezes)
MEAD - Perl, various methods + evaluation framework
Installation
Currently only from git repo (make sure you have Python installed)
$ wget https://github.com/miso-belica/sumy/archive/master.zip # download the sources
$ unzip master.zip # extract the downloaded file
$ cd sumy-master/
$ [sudo] python setup.py install # install the package
Or simply run:
$ [sudo] pip install git+git://github.com/miso-belica/sumy.git
Usage
Sumy contains command line utility for quick summarization of documents.
$ sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization # what's summarization?
$ sumy luhn --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy edmundson --language=czech --length=3% --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy --help # for more info
Various evaluation methods for some summarization method can be executed by commands below:
$ sumy_eval lex-rank reference_summary.txt --url=http://en.wikipedia.org/wiki/Automatic_summarization
$ sumy_eval lsa reference_summary.txt --language=czech --url=http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ sumy_eval edmundson reference_summary.txt --language=czech --url=http://cs.wikipedia.org/wiki/Bitva_u_Lipan
$ sumy_eval --help # for more info
Python API
Or you can use sumy like a library in your project.
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers.czech import stem_word
from sumy.utils import get_stop_words
if __name__ == "__main__":
url = "http://www.zsstritezuct.estranky.cz/clanky/predmety/cteni/jak-naucit-dite-spravne-cist.html"
parser = HtmlParser.from_url(url, Tokenizer("czech"))
summarizer = LsaSummarizer(stem_word)
summarizer.stop_words = get_stop_words("czech")
for sentence in summarizer(parser.document, 20):
print(sentence)
Tests
Run tests via
$ nosetests-2.6 && nosetests-3.2 && nosetests-2.7 && nosetests-3.3
Changelog
0.1.0 (2013-MM-DD)
First public release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.