Skip to main content

Reference-free script fidelity metric for multilingual ASR.

Project description

script-fidelity

script-fidelity is a small Python package for Script Fidelity Rate (SFR), a reference-free metric for multilingual ASR. SFR measures the fraction of countable hypothesis characters that belong to the expected Unicode script for a target language.

Quick signals:

  • Install with uv add script-fidelity
  • Load with HF Evaluate via themechanism/script_fidelity_rate
  • Supports 102 FLEURS language configs, excluding all

Use SFR with WER and CER. SFR checks script validity; WER and CER measure transcription error against references.

install

For package development in this repo:

uv sync --extra dev

For a downstream project after release:

uv add script-fidelity

python use

from script_fidelity import compute_sfr, compute_sfr_batch

score = compute_sfr("کابل کې ښه هوا ده", language="ps_af")
scores = compute_sfr_batch(
    ["کابل کې ښه هوا ده", "this is romanized output"],
    language="pashto",
)

Digits count by default, matching the paper. Treat digits as neutral with digit_policy="ignore".

compute_sfr("کابل 2026", language="ps_af", digit_policy="ignore")

HF Evaluate use

Local metric:

import evaluate

sfr = evaluate.load("./metrics/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")

Hub metric after publishing:

import evaluate

sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")

CLI

sfr score --language ps_af --text "کابل کې ښه هوا ده"
sfr audit predictions.jsonl --language ps_af --text-column prediction
sfr audit predictions.csv --language bn_in --text-column transcript --format csv

ASR batch example

from script_fidelity import compute_corpus_sfr

predictions = [
    item["text"]
    for item in whisper_outputs
]

summary = compute_corpus_sfr(predictions, language="bn_in")
print(summary["sfr_percent"])
print(summary["dominant_script_counts"])

pandas dataframe example

import pandas as pd
from script_fidelity import compute_sfr

df = pd.read_json("predictions.jsonl", lines=True)
df["sfr"] = df["prediction"].map(lambda text: compute_sfr(text, language="ps_af"))

Transformers compute_metrics example

import evaluate

wer = evaluate.load("wer")
sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    pred_text = processor.batch_decode(predictions, skip_special_tokens=True)
    label_text = processor.batch_decode(labels, skip_special_tokens=True)
    return {
        "wer": wer.compute(predictions=pred_text, references=label_text),
        "sfr": sfr.compute(predictions=pred_text, language="ps_af")["sfr"],
    }

CI gate example

from script_fidelity import compute_corpus_sfr

summary = compute_corpus_sfr(predictions, language="ml_in")
if summary["sfr"] < 0.90:
    raise SystemExit("SFR regression: Malayalam output is below 90% target script")

shared-script caveats

SFR is a script check, not a language identifier. Pashto, Urdu, Persian, Arabic, Central Kurdish, and Sindhi share Arabic-script Unicode blocks. Latin-script languages mostly detect romanization or non-Latin substitution, not language identity. Pair SFR with language ID or lexical checks when shared-script confusions matter.

Use dominant_script() and script_distribution() to inspect failures:

from script_fidelity import dominant_script, script_distribution

dominant_script("this is romanized output")
script_distribution("বাংলা भाषा")

FLEURS codes

The registry covers the 102 FLEURS language configs listed by sfr languages. These paper languages have short aliases:

FLEURS code Alias Script
ps_af pashto Arabic
ur_pk urdu Arabic
ar_eg arabic Arabic
fa_ir persian, farsi Arabic
hi_in hindi Devanagari
bn_in bengali, bangla Bengali
ml_in malayalam Malayalam
ta_in tamil Tamil
so_so somali Latin
ka_ge georgian Georgian

For the full reviewed registry, see script_fidelity/data/fleurs_registry.json.

Full code table:

Code Language Script
af_za Afrikaans Latin
am_et Amharic Ethiopic
ar_eg Arabic Arabic
as_in Assamese Bengali
ast_es Asturian Latin
az_az Azerbaijani Latin
be_by Belarusian Cyrillic
bg_bg Bulgarian Cyrillic
bn_in Bengali Bengali
bs_ba Bosnian Latin
ca_es Catalan Latin
ceb_ph Cebuano Latin
ckb_iq Central Kurdish Arabic
cmn_hans_cn Mandarin Chinese Han
cs_cz Czech Latin
cy_gb Welsh Latin
da_dk Danish Latin
de_de German Latin
el_gr Greek Greek
en_us English Latin
es_419 Spanish Latin
et_ee Estonian Latin
fa_ir Persian Arabic
ff_sn Fulah Latin
fi_fi Finnish Latin
fil_ph Filipino Latin
fr_fr French Latin
ga_ie Irish Latin
gl_es Galician Latin
gu_in Gujarati Gujarati
ha_ng Hausa Latin
he_il Hebrew Hebrew
hi_in Hindi Devanagari
hr_hr Croatian Latin
hu_hu Hungarian Latin
hy_am Armenian Armenian
id_id Indonesian Latin
ig_ng Igbo Latin
is_is Icelandic Latin
it_it Italian Latin
ja_jp Japanese Han, Hiragana, Katakana
jv_id Javanese Latin
ka_ge Georgian Georgian
kam_ke Kamba Latin
kea_cv Kabuverdianu Latin
kk_kz Kazakh Cyrillic
km_kh Khmer Khmer
kn_in Kannada Kannada
ko_kr Korean Hangul
ky_kg Kyrgyz Cyrillic
lb_lu Luxembourgish Latin
lg_ug Ganda Latin
ln_cd Lingala Latin
lo_la Lao Lao
lt_lt Lithuanian Latin
luo_ke Luo Latin
lv_lv Latvian Latin
mi_nz Maori Latin
mk_mk Macedonian Cyrillic
ml_in Malayalam Malayalam
mn_mn Mongolian Cyrillic
mr_in Marathi Devanagari
ms_my Malay Latin
mt_mt Maltese Latin
my_mm Burmese Myanmar
nb_no Norwegian Bokmal Latin
ne_np Nepali Devanagari
nl_nl Dutch Latin
nso_za Northern Sotho Latin
ny_mw Chichewa Latin
oc_fr Occitan Latin
om_et Oromo Latin
or_in Odia Odia
pa_in Punjabi Gurmukhi
pl_pl Polish Latin
ps_af Pashto Arabic
pt_br Portuguese Latin
ro_ro Romanian Latin
ru_ru Russian Cyrillic
sd_in Sindhi Arabic
sk_sk Slovak Latin
sl_si Slovenian Latin
sn_zw Shona Latin
so_so Somali Latin
sr_rs Serbian Cyrillic
sv_se Swedish Latin
sw_ke Swahili Latin
ta_in Tamil Tamil
te_in Telugu Telugu
tg_tj Tajik Cyrillic
th_th Thai Thai
tr_tr Turkish Latin
uk_ua Ukrainian Cyrillic
umb_ao Umbundu Latin
ur_pk Urdu Arabic
uz_uz Uzbek Latin
vi_vn Vietnamese Latin
wo_sn Wolof Latin
xh_za Xhosa Latin
yo_ng Yoruba Latin
yue_hant_hk Cantonese Han
zu_za Zulu Latin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

script_fidelity-0.1.0.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

script_fidelity-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file script_fidelity-0.1.0.tar.gz.

File metadata

  • Download URL: script_fidelity-0.1.0.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.1

File hashes

Hashes for script_fidelity-0.1.0.tar.gz
Algorithm Hash digest
SHA256 94066f1d808d67e175a10f355bbf8cb18baa1c2866bfb416d7e10631a4b1f495
MD5 2afd20021c5aed8ca67d5fdc732119b5
BLAKE2b-256 1f5e296aa91436c6f55697c9d59cb545f6d43686b845511e2e3b383be6e0a427

See more details on using hashes here.

File details

Details for the file script_fidelity-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for script_fidelity-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 49c126423bc5e942d27538a7a767ca94f9a4d782568a2c08ddf2f595f29ca4fd
MD5 37da3671b55e91818b7c449c4e8a30d7
BLAKE2b-256 62b5313d8b5b75b3730b1e75bacfe6279b2d62fdc49a518bd96ff561a4acbe4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page