Skip to main content
Donate to the Python Software Foundation or Purchase a PyCharm License to Benefit the PSF! Donate Now

An implementation of ROUGE family metrics for automatic summarization

Project description

This is a pure Python implementation of the ROUGE metrics family in the automatic summarization fields
following the paper *ROUGE: A Package for Automatic Evaluation of Summaries Chin-Yew Lin et al.*.
It is an attempt to implement these metrics correctly and elegantly in total Python. It provides the following features:
- ROUGE-N, ROUGE-L and ROUGE-W are currently supported.
- Flexible input. For each metrics supported, a sentence level and a summary level variants are provided, which means you
can use them in a machine translation context with sentence pairs.
- Correctness. All the claimed implemented metrics are tested against a non-trivial amount of data, using the plain old perl script
as a baseline.
- Self-contained. The total implementation is *one single script* in *one single package*.
No dependency except a Python-3 is required. *No perl script involved.*
- Fast. At least faster the perl script wrappers.
- Preprocessing is a total freedom of the client. We establish API on the concept of *sentence*, which is list of tokens
and *sentences*, which is a list of *sentence*. Preprocessing like *stopword-removal*, *stemming* and *tokenization* is
left to client totally.
- Well documented. Every function is `doctest` powered.
- Procedural style API. You don't need to instantiate an object. Just call the function that does the right job.

# Usage
An example use of calculating `ROUGE-2` on sentence level is provided:
from rouge.rouge import rouge_n_sentence_level

summary_sentence = 'the capital of China is Beijing'.split()
reference_sentence = 'Beijing is the capital of China'.split()

# Calculate ROUGE-2.
recall, precision, rouge = rouge_n_sentence_level(summary_sentence, reference_sentence, 2)
print('ROUGE-2-R', recall)
print('ROUGE-2-P', precision)
print('ROUGE-2-F', rouge)

# If you just want the F-measure you can do this:
*_, rouge = rouge_n_sentence_level(summary_sentence, reference_sentence, 2) # Requires a Python-3 to use *_.
print('ROUGE-2-R', recall)

For more usage examples, please refer to ``.

# Install
Currently not uploaded to PyPi...
git clone
python ./ install

# Dependencies
The code is *only* tested on `python==3.6.2` but it should work with higher version of Python.
If you want to run the tests locally, you need to install [Pythonrouge](, which is a wrapper on the original
perl script. To install it:
# not using pip
git clone
python install

# using pip
pip install git+
Then go to the project root directory, run:
# Run doctests.
python -m doctest ./rouge/*.py -v

# Run unittests.
python -m unittest -v
Since the python wrapper is generally slow, the tests takes more than a few minutes to finish.

# Rationals
In this section we talk about the rationals for reinventing the wheels or reimplementing the ROUGE metrics in a few words.

## The Complexity
ROUGE is *not* a trivial metric. First of all, it has a lot of variants, `ROUGE-N`, `ROUGE-L`, `ROUGE-W`, `ROUGE-S`, `ROUGE-SU`, `ROUGE-BE`.
That's why the author called it *a package*. To implement each of them correctly is not a trivial job.
Second, ROUGE has two completely different signatures. By *signature* I mean the shape of the input data. It has both *sentence level* and
*summary level* defined, sometimes in varying forms. Third, ROUGE can handle multiple references and multiple summaries altogether.
It handles multiple references by fixing the summary to a single one and calculating a list of values given a list of references. It then
uses a *jackknifing procedure* to reduce those values to a scalar. It handles multiple summaries by using a *A-or-B* scheme to reduce the list
of values produced by the list of summaries to a scalar. For *A-or-B*, `A` means taking average and `B` means taking the best one.
If you feel calm after learning these, ROUGE can handle a great deal of preprocessing on the input data and each of them can affect the final score.
Few projects on Github implement all these functionalities correctly. Some of them don't even realize the problems of signatures.

## The Plain Old Perl Script
While the plain old perl script `` implements all these things correctly, its interface is quite discouraging.
It has a long array of single-character options with some options affecting the others!
It takes the input from a fixed directory format and requires a configure file in XML, making it less usable in the context of
rapid prototyping and development. Worse still, it *don't* provide an application programmer interface even in perl. You have no way
to use it programmatically except lauching a process, which is very expensive.

## Our Usage Scenario
Although ROUGE is originated from automatic summarization, we want to use it in a sentence-oriented style.
That's why the sentence level API is emphasized in the project. We want to evaluate the average ROUGE score on a large
number of `response-ground_truth` pairs quickly. The wrapper way is not taken since its inefficiency.

## Simplification of the Problem
The first simplification we made is to throw away any preprocessing and stick to a general representation of *sentence*.
A sentence is just a list of strings or tokens. Since then we don't have to implement any tokenizer. In fact we don't
*need* to because there are a lot of libraries can do that nicely, like `nltk` and `spicy`.

The second decision is that we don't care about *multiple references and multiple summaries*.
We *only* care about *a single reference and a single summary*. Unlike BLEU, multiple references means all the same
to every variant of ROUGE (almost always Jackknifing). And how to combine scores from different instance of summary is not our interest.
They can be implemented in the project however, only as extensions to the core metrics.

After making the two decisions, our code can be implemented in a clear and reusable way.
Hopefully, it will also be *extensible* because right now we haven't implement *all* of the ROUGE metrics.
And you know, a ROUGE can never have too many metrics!

# Acknowledgement
The test data is taken literally from [sumeval](
The code borrows ideas from both `sumeval` and the `` script of [tensorflow/nmt](
The API style closely follows that of `nltk.translate.bleu_score`.

# References
[1] ROUGE: A Package for Automatic Evaluation of Summaries.

[2] ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks.

[3] Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics.

Project details

Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
easy_rouge-0.2.2-py3-none-any.whl (22.1 kB) Copy SHA256 hash SHA256 Wheel py3

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page