Skip to main content

A fork of Multilingual ROUGE (a fork of XL-Sum)

Project description

Multilingual ROUGE Scoring

This is a port of XL-Sum repository as a simple utility to utilize the multilingual RougeL utility for MIRAGE-Bench, with the following changes:

  1. Added multilingual_rouge/bengali_stemmer from bengali-stemmer repository as it was not available in PyPI.

I do not own the codebase so any questions must be redirected to XL-Sum repository.

Installation

To install this library, please use the following:

python -m unidic download # for japanese segmentation
pip install -U multilingual-rouge

Alternatively to install from source:

python -m unidic download # for japanese segmentation
pip install -e .

Overview

ROUGE is the de facto evaluation metric used for text summarization. However, it was designed specifically for evaluating English texts. Due to the nature of the metric, scores are heavily dependent on text tokenization / stemming / unnecessary character removal, etc. This repo tries to address these issues by adding the following main features using an adaptation of rouge-score: Google's rouge implementation.

  • Enables multilingual ROUGE scoring by making use of popular word segmentation / stemming algorithms for various languages.
  • Removes only punctuation characters according to unicode data tables as part of text normalization. This enables basic rouge scoring even with the absence of a segmenter / stemmer for any language.
  • Provides an easy to use interface for using custom tokenization / stemming implementations.

Supported language names for stemming

bengali, hindi, turkish, arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, swedish

Supported language names for word segmentation

chinese, thai, japanese, burmese

Setup

pip3 install -r requirements.txt
python3 -m unidic download # for japanese segmentation
pip3 install --upgrade ./

Example Usage

Using CLI

python -m rouge_score.rouge \
    --target_filepattern=*.targets \
    --prediction_filepattern=*.decodes \
    --output_filename=scores.csv \
    --use_stemmer=true \ # optional
    --lang="bengali" # optional

Using python

  • Default usage
from multilingual_rouge import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')
  • With provided language
from multilingual_rouge import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True, lang="bengali")
scores = scorer.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।',
                      'আপনার সাথে দেখা হয়ে ভালো লাগলো।')
  • With your own stemming / word segmentation implementation

    Custom stemmer/ tokenizer implementations must be callable objects, i.e. functions or classes with __call__ method implemented. If lang is also given, user provided implementations take precedence over the library provided ones.

from multilingual_rouge import rouge_scorer

# example with custom stemming
class DummyStemmer(object):
    def __call__(self, token):
        stem = ""
        # your stemmer implementation
        return stem

# example with custom segmenter/tokenizer
def dummy_tokenize(text):
    tokens = []
    # your tokenizer implementation
    return tokens

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True, 
                                    callable_stemmer=DummyStemmer(),
                                    callable_tokenizer=dummy_tokenize)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')
                      

License

Originally licensed under the Apache 2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multilingual_rouge-0.0.1.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

multilingual_rouge-0.0.1-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file multilingual_rouge-0.0.1.tar.gz.

File metadata

  • Download URL: multilingual_rouge-0.0.1.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.4

File hashes

Hashes for multilingual_rouge-0.0.1.tar.gz
Algorithm Hash digest
SHA256 cd18a8153137b35d27fec51f92411cfbc3c5599eebc504dd7a3d44dcea2fb06f
MD5 943a3548836c361f881c415a52ee2b57
BLAKE2b-256 29da3ea92272a41a6d8516bb4fb27f5f7f3d32fa0f52a8cc11cd5d97d5344e64

See more details on using hashes here.

File details

Details for the file multilingual_rouge-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for multilingual_rouge-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4c8d9c2439aad5ff4e675387dbfa99e424d6e9c1d4d6cbd3a13e3fa6c7d8ece3
MD5 4d4403b9228b3dbe4c9103c6643a35a4
BLAKE2b-256 6b6a718089b05758bf1e65c4c7b43aaff2fbc8fdadd8eebf8f8f99e11e32e98d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page