Skip to main content

Compute Word Error Rate for Tibetan language text.

Project description

Tibetan-WER

This module provides a means to calculate Word Error Rate, and the Syllable Error Rate for Tibetan language text.

Install

Install the library to get started:

pip install --upgrade tibetan_wer

Usage

Basic Usage

The wer function expects a list of predictions and a list of references and returns a dictionary of the micro and macro average WER as well as the total number of substitutions, insertions, and deletions.

from tibetan_wer.metrics import wer

rediction = ['གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']
reference = ['འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']

result = wer(prediction, reference)

print(f'Micro-Average WER Score: {result['micro_wer']}')
print(f'Macro-Average WER Score: {result['macro_wer']}')
print(f'Substitutions: {result['substitutions']}')
print(f'Insertions: {result['insertions']}')
print(f'Deletions: {result['deletions']}')

The ser function works very similarly.

from tibetan_wer.metrics import ser

prediction = ['གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']
reference = ['འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']

result = ser(prediction, reference)

print(f'Micro-Average SER Score: {result['micro_ser']:.3f}')
print(f'Macro-Average SER Score: {result['macro_ser']:.3f}')
print(f'Substitutions: {result['substitutions']:.3f}')
print(f'Insertions: {result['insertions']:.3f}')
print(f'Deletions: {result['deletions']:.3f}')

Usage for Model Evaluation

The intended use-case is as part of assessing model training. To use tibetan_wer for this you can define custom metrics for model training like so:

import evaluate
from tibetan_wer.metrics import wer as tib_wer, ser as tib_ser

cer_metric = evaluate.load("cer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    cer = cer_metric.compute(predictions=pred_str, references=label_str)
    tib_wer_res = tib_wer(predictions=pred_str, references=label_str)
    tib_ser_res = tib_ser(predictions=pred_str, references=label_str)

    macro_wer = tib_wer_res['macro_wer']
    micro_wer = tib_wer_res['micro_wer']
    word_subs = tib_wer_res['substitutions']
    word_ins = tib_wer_res['insertions']
    word_dels = tib_wer_res['deletions']

    macro_ser = tib_ser_res['macro_ser']
    micro_ser = tib_ser_res['micro_ser']
    syl_subs = tib_ser_res['substitutions']
    syl_ins = tib_ser_res['insertions']
    syl_dels = tib_ser_res['deletions']

    return {"cer": cer,
            "tib_macro_wer": macro_wer,
            "tib_micro_wer": micro_wer,
            "word_substitutions": word_subs,
            "word_insertions":word_ins,
            "word_deletions":word_dels,
            "tib_macro_ser": macro_ser,
            "tib_micro_ser": micro_ser,
            "syllable_substitutions": syl_subs,
            "syllable_insertions": syl_ins,
            "syllable_deletions": syl_dels
            }

You can then set the transformers trainer to use these metrics like so:

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics, # use custom metrics
    tokenizer=processor.feature_extractor,
)

trainer.train()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tibetan_wer-1.0.1.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tibetan_wer-1.0.1-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file tibetan_wer-1.0.1.tar.gz.

File metadata

  • Download URL: tibetan_wer-1.0.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for tibetan_wer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f7fec18ca4e2621177f901195056ff8cd776ebe2b7230828c1c3b884eebe151f
MD5 7a4907a4c03b24155b5bc72ddb1a82f4
BLAKE2b-256 0063b83fcbe0994a8e1abe2ffda71cc741922246e6ac980bfbe22b2b9690dffe

See more details on using hashes here.

File details

Details for the file tibetan_wer-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: tibetan_wer-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for tibetan_wer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c7a34dc2c52bd3ab9e845b5b2bf2df9b0ecdd06e6fe8c8a03f6bc00ef2f5d1a9
MD5 e347164f5b1ac32826a880c98599f12e
BLAKE2b-256 59e24d1b5e0f03b4e916700cafcfca4ce2c06b619307cf4bd888dd6adbb8c6bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page