A fork of Multilingual ROUGE (a fork of XL-Sum)
Project description
Multilingual ROUGE Scoring
This is a port of XL-Sum repository as a simple utility to utilize the multilingual RougeL utility for MIRAGE-Bench, with the following changes:
- Added multilingual_rouge/bengali_stemmer from bengali-stemmer repository as it was not available in PyPI.
I do not own the codebase so any questions must be redirected to XL-Sum repository.
Installation
To install this library, please use the following:
python -m unidic download # for japanese segmentation
pip install -U multilingual-rouge
Alternatively to install from source:
python -m unidic download # for japanese segmentation
pip install -e .
Overview
ROUGE is the de facto evaluation metric used for text summarization. However, it was designed specifically for evaluating English texts. Due to the nature of the metric, scores are heavily dependent on text tokenization / stemming / unnecessary character removal, etc. This repo tries to address these issues by adding the following main features using an adaptation of rouge-score: Google's rouge implementation.
- Enables multilingual ROUGE scoring by making use of popular word segmentation / stemming algorithms for various languages.
- Removes only punctuation characters according to unicode data tables as part of text normalization. This enables basic rouge scoring even with the absence of a segmenter / stemmer for any language.
- Provides an easy to use interface for using custom tokenization / stemming implementations.
Supported language names for stemming
bengali, hindi, turkish, arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, swedish
Supported language names for word segmentation
chinese, thai, japanese, burmese
Setup
pip3 install -r requirements.txt
python3 -m unidic download # for japanese segmentation
pip3 install --upgrade ./
Example Usage
Using CLI
python -m rouge_score.rouge \
--target_filepattern=*.targets \
--prediction_filepattern=*.decodes \
--output_filename=scores.csv \
--use_stemmer=true \ # optional
--lang="bengali" # optional
Using python
- Default usage
from multilingual_rouge import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
'The quick brown dog jumps on the log.')
- With provided language
from multilingual_rouge import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True, lang="bengali")
scores = scorer.score('তোমার সাথে দেখা হয়ে ভালো লাগলো।',
'আপনার সাথে দেখা হয়ে ভালো লাগলো।')
-
With your own stemming / word segmentation implementation
Custom
stemmer/tokenizerimplementations must becallableobjects, i.e. functions or classes with__call__method implemented. Iflangis also given, user provided implementations take precedence over the library provided ones.
from multilingual_rouge import rouge_scorer
# example with custom stemming
class DummyStemmer(object):
def __call__(self, token):
stem = ""
# your stemmer implementation
return stem
# example with custom segmenter/tokenizer
def dummy_tokenize(text):
tokens = []
# your tokenizer implementation
return tokens
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True,
callable_stemmer=DummyStemmer(),
callable_tokenizer=dummy_tokenize)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
'The quick brown dog jumps on the log.')
- To see list of all available keyword arguments and reference stemmer and segmenter implementations refer to rouge_scorer.py, stemmers.py and tokenizers.py
License
Originally licensed under the Apache 2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multilingual_rouge-0.0.1.tar.gz.
File metadata
- Download URL: multilingual_rouge-0.0.1.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd18a8153137b35d27fec51f92411cfbc3c5599eebc504dd7a3d44dcea2fb06f
|
|
| MD5 |
943a3548836c361f881c415a52ee2b57
|
|
| BLAKE2b-256 |
29da3ea92272a41a6d8516bb4fb27f5f7f3d32fa0f52a8cc11cd5d97d5344e64
|
File details
Details for the file multilingual_rouge-0.0.1-py3-none-any.whl.
File metadata
- Download URL: multilingual_rouge-0.0.1-py3-none-any.whl
- Upload date:
- Size: 28.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c8d9c2439aad5ff4e675387dbfa99e424d6e9c1d4d6cbd3a13e3fa6c7d8ece3
|
|
| MD5 |
4d4403b9228b3dbe4c9103c6643a35a4
|
|
| BLAKE2b-256 |
6b6a718089b05758bf1e65c4c7b43aaff2fbc8fdadd8eebf8f8f99e11e32e98d
|