Skip to main content

Evaluation as a Service for Natural Language Processing

Project description

Evaluation-as-a-Service for NLP



License GitHub stars PyPI Integration Tests

Usage

Detailed documentation can be found here. To install the API, simply run

pip install eaas

To use the API, You should go through the following two steps.

  • Step 1: You should load the default configurations and make modifications if needed.
from eaas import Config
config = Config()

# To see the metrics we support, run
print(config.metrics())

# To see the default configuration of a metric, run
print(config.bleu.to_dict())

# Here is an example of modifying the default config.
config.bleu.set_property("smooth_method", "floor")
print(config.bleu.to_dict())
  • Step 2: Initialize the client and send your inputs. Please send your inputs as a whole (a list of many dicts) instead of sending one sample at a time (which will be much slower).
from eaas import Client
client = Client()
client.load_config(config)  # The config you have created above

# To use this API for scoring, you need to format your input as list of dictionary. 
# Each dictionary consists of `source` (string, optional), `references` (list of string, optional) and `hypothesis` (string, required). `source` and `references` are optional based on the metrics you want to use. 
# Please do not conduct any preprocessing on `source`, `references` or `hypothesis`. 
# We expect normal-cased detokenized texts. All the preprocessing steps are taken by the metrics. 
# Below is a simple example.

inputs = [{"source": "This is the source.", 
           "references": ["This is the reference one.", "This is the reference two."],
           "hypothesis": "This is the generated hypothesis."}]
metrics = ["bleu", "chrf"] # Can be None for simplicity if you consider using all metrics

score_dic = client.score(inputs, task="sum", metrics=metrics, lang="en", cal_attributes=True) 
# inputs is a list of Dict, task is the name of task (for calculating attributes), metrics is metric list, lang is the two-letter code language.
# You can also set cal_attributes=False to save some time since some attribute calculations can be slow.

The output is like

# sample_level is a list of dict, corpus_level is a dict
{
    'sample_level': [
        {
            'bleu': 32.46679154750991, 
            'attr_compression': 0.8333333333333334, 
            'attr_copy_len': 2.0, 
            'attr_coverage': 0.6666666666666666, 
            'attr_density': 1.6666666666666667, 
            'attr_hypothesis_len': 6, 
            'attr_novelty': 0.6, 
            'attr_repetition': 0.0, 
            'attr_source_len': 5, 
            'chrf': 38.56890099861521
        }
    ], 
    'corpus_level': {
        'corpus_bleu': 32.46679154750991, 
        'corpus_attr_compression': 0.8333333333333334, 
        'corpus_attr_copy_len': 2.0, 
        'corpus_attr_coverage': 0.6666666666666666, 
        'corpus_attr_density': 1.6666666666666667, 
        'corpus_attr_hypothesis_len': 6.0, 
        'corpus_attr_novelty': 0.6, 
        'corpus_attr_repetition': 0.0, 
        'corpus_attr_source_len': 5.0, 
        'corpus_chrf': 38.56890099861521
    }
}

Supported Metrics

Currently, EaaS supports the following metrics:

  • bart_score_cnn_hypo_ref: BARTScore is a sequence to sequence framework based on pre-trained language model BART. bart_score_cnn_hypo_ref uses the CNNDM finetuned BART. It calculates the average generation score of Score(hypothesis|reference) and Score(reference|hypothesis).
  • bart_score_summ: BARTScore using the CNNDM finetuned BART. It calculates Score(hypothesis|source).
  • bart_score_mt: BARTScore using the Parabank2 finetuned BART. It calculates the average generation score of Score(hypothesis|reference) and Score(reference|hypothesis).
  • bert_score_p: BERTScore is a metric designed for evaluating translated text using BERT-based matching framework. bert_score_p calculates the BERTScore precision.
  • bert_score_r: BERTScore recall.
  • bert_score_f: BERTScore f score.
  • bleu: BLEU measures modified ngram matches between each candidate translation and the reference translations.
  • chrf: CHRF measures the character-level ngram matches between hypothesis and reference.
  • comet: COMET is a neural framework for training multilingual machine translation evaluation models. comet uses the wmt20-comet-da checkpoint which utilizes source, hypothesis and reference.
  • comet_qe: COMET for quality estimation. comet_qe uses the wmt20-comet-qe-da checkpoint which utilizes only source and hypothesis.
  • mover_score: MoverScore is a metric similar to BERTScore. Different from BERTScore, it uses the Earth Mover’s Distance instead of the Euclidean Distance.
  • prism: PRISM is a sequence to sequence framework trained from scratch. prism calculates the average generation score of Score(hypothesis|reference) and Score(reference|hypothesis).
  • prism_qe: PRISM for quality estimation. It calculates Score(hypothesis| source).
  • rouge1: ROUGE-1 refers to the overlap of unigram (each word) between the system and reference summaries.
  • rouge2: ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
  • rougeL: ROUGE-L refers to the longest common subsequence between the system and reference summaries.

Support for Attributes

The task option in the client.score() function decides what attributes we calculate. Currently, we only support attributes for summarization task (task=sum). The following attributes (reference: this paper) will be calculated if cal_attributes is set to True in client.score(). They are all reference-free.

  • source_len: measures the length of the source text.
  • hypothesis_len: measures the length of the hypothesis text.
  • density & coverage: measures to what extent a summary covers the content in the source text.
  • compression: measures the compression ratio from the source text to the generated summary.
  • repetition: measures the rate of repeated segments in summaries. The segments are instantiated as trigrams.
  • novelty: measures the proportion of segments in the summaries that haven’t appeared in source documents. The segments are instantiated as bigrams.
  • copy_len: measures the average length of segments in summary copied from source document.

Support for Common Metrics

We support quick calculation for BLEU and ROUGE(1,2,L), see the following for usage.

from eaas import Config, Client
config = Config()
client = Client()
client.load_config(config) 

# Note that the input format is different from the `score` function. 
references = [["This is the reference one for sample one.", "This is the reference two for sample one."],
              ["This is the reference one for sample two.", "This is the reference two for sample two."]]
hypothesis = ["This is the generated hypothesis for sample one.", 
              "This is the generated hypothesis for sample two."]

# Calculate BLEU
client.bleu(references, hypothesis, task="sum", lang="en", cal_attributes=False)

# Calculate ROUGEs
client.rouge1(references, hypothesis, task="sum", lang="en", cal_attributes=False)
client.rouge2(references, hypothesis, task="sum", lang="en", cal_attributes=False)
client.rougeL(references, hypothesis, task="sum", lang="en", cal_attributes=False)

Support for Prompts

Prompts can sometimes improve the performance for certain metrics (See this paper). In our client.score() function, we support adding prompts to the source/hypothesis/references with both prefix position and suffix position. An example is shown below.

from eaas import Config, Client
config = Config()
client = Client()
client.load_config(config)

inputs = [
    {
        "source": "This is the source.",
        "references": ["This is the reference one.", "This is two."],
        "hypothesis": "This is the generated hypothesis."
    }
]

prompt_info = {
    "source": {"prefix": "This is source prefix", "suffix": "This is source suffix"},
    "hypothesis": {"prefix": "This is hypothesis prefix", "suffix": "This is hypothesis suffix"},
    "reference": {"prefix": "This is reference prefix", "suffix": "This is reference suffix"}
}

# adding this prompt info will automatically turn the inputs into
# [{'source': 'This is source prefix This is the source. This is source suffix', 
#   'references': ['This is reference prefix This is the reference one. This is reference suffix', 'This is reference prefix This is two. This is reference suffix'], 
#   'hypothesis': 'This is hypothesis prefix This is the generated hypothesis. This is hypothesis suffix'}]

# Here is a simpler example.
# prompt_info = {"source": {"prefix": "This is prefix"}}

score_dic = client.score(inputs, task="sum", metrics=["bart_score_summ"], lang="en", cal_attributes=False, **prompt_info)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eaas-0.3.6.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eaas-0.3.6-py2.py3-none-any.whl (14.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file eaas-0.3.6.tar.gz.

File metadata

  • Download URL: eaas-0.3.6.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for eaas-0.3.6.tar.gz
Algorithm Hash digest
SHA256 083e1f4334e5b2fad2a8a368247a0b0e0a1f2bdee44bee246eb976ca0b0f336d
MD5 26a8ca45d8a828adb4277a2407cf853a
BLAKE2b-256 040cd0690638bf36e6575eda4247d42da8219dcd8f4f913d83ec3dbaab72ce62

See more details on using hashes here.

File details

Details for the file eaas-0.3.6-py2.py3-none-any.whl.

File metadata

  • Download URL: eaas-0.3.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for eaas-0.3.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ced47b39675c050936382f3a2602a34aa2a61d7e9e96096e4cff41ad54d2e481
MD5 d8df3b9e108ee2ab63b6207623dca74f
BLAKE2b-256 2761a66ae6046102b0cc24d9166fd04bce891d98c1ddc8261db32c229d413852

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page