Skip to main content

Reference-free script fidelity metric for multilingual ASR.

Project description

script-fidelity

script-fidelity is a small Python package for Script Fidelity Rate (SFR), a reference-free metric for multilingual ASR. SFR measures the fraction of countable hypothesis characters that belong to the expected Unicode script for a target language.

Quick signals:

Use SFR with WER and CER. SFR checks script validity; WER and CER measure transcription error against references.

install

For package development in this repo:

uv sync --extra dev

For a downstream project:

uv add script-fidelity

Run the CLI without adding it to a project:

uvx --from script-fidelity sfr score --language ps_af --text "کابل کې ښه هوا ده"

python use

from script_fidelity import compute_sfr, compute_sfr_batch

score = compute_sfr("کابل کې ښه هوا ده", language="ps_af")
scores = compute_sfr_batch(
    ["کابل کې ښه هوا ده", "this is romanized output"],
    language="pashto",
)

Digits count by default, matching the paper. Treat digits as neutral with digit_policy="ignore".

compute_sfr("کابل 2026", language="ps_af", digit_policy="ignore")

HF Evaluate use

Local metric:

import evaluate

sfr = evaluate.load("./metrics/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")

Hub metric after publishing:

import evaluate

sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")

CLI

sfr score --language ps_af --text "کابل کې ښه هوا ده"
sfr audit predictions.jsonl --language ps_af --text-column prediction
sfr audit predictions.csv --language bn_in --text-column transcript --format csv

ASR batch example

from script_fidelity import compute_corpus_sfr

predictions = [
    item["text"]
    for item in whisper_outputs
]

summary = compute_corpus_sfr(predictions, language="bn_in")
print(summary["sfr_percent"])
print(summary["dominant_script_counts"])

pandas dataframe example

import pandas as pd
from script_fidelity import compute_sfr

df = pd.read_json("predictions.jsonl", lines=True)
df["sfr"] = df["prediction"].map(lambda text: compute_sfr(text, language="ps_af"))

Transformers compute_metrics example

import evaluate

wer = evaluate.load("wer")
sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    pred_text = processor.batch_decode(predictions, skip_special_tokens=True)
    label_text = processor.batch_decode(labels, skip_special_tokens=True)
    return {
        "wer": wer.compute(predictions=pred_text, references=label_text),
        "sfr": sfr.compute(predictions=pred_text, language="ps_af")["sfr"],
    }

CI gate example

from script_fidelity import compute_corpus_sfr

summary = compute_corpus_sfr(predictions, language="ml_in")
if summary["sfr"] < 0.90:
    raise SystemExit("SFR regression: Malayalam output is below 90% target script")

shared-script caveats

SFR is a script check, not a language identifier. Pashto, Urdu, Persian, Arabic, Central Kurdish, and Sindhi share Arabic-script Unicode blocks. Latin-script languages mostly detect romanization or non-Latin substitution, not language identity. Pair SFR with language ID or lexical checks when shared-script confusions matter.

Use dominant_script() and script_distribution() to inspect failures:

from script_fidelity import dominant_script, script_distribution

dominant_script("this is romanized output")
script_distribution("বাংলা भाषा")

FLEURS codes

The registry covers the 102 FLEURS language configs listed by sfr languages. These paper languages have short aliases:

FLEURS code Alias Script
ps_af pashto Arabic
ur_pk urdu Arabic
ar_eg arabic Arabic
fa_ir persian, farsi Arabic
hi_in hindi Devanagari
bn_in bengali, bangla Bengali
ml_in malayalam Malayalam
ta_in tamil Tamil
so_so somali Latin
ka_ge georgian Georgian

For the full reviewed registry, see script_fidelity/data/fleurs_registry.json.

Full code table:

Code Language Script
af_za Afrikaans Latin
am_et Amharic Ethiopic
ar_eg Arabic Arabic
as_in Assamese Bengali
ast_es Asturian Latin
az_az Azerbaijani Latin
be_by Belarusian Cyrillic
bg_bg Bulgarian Cyrillic
bn_in Bengali Bengali
bs_ba Bosnian Latin
ca_es Catalan Latin
ceb_ph Cebuano Latin
ckb_iq Central Kurdish Arabic
cmn_hans_cn Mandarin Chinese Han
cs_cz Czech Latin
cy_gb Welsh Latin
da_dk Danish Latin
de_de German Latin
el_gr Greek Greek
en_us English Latin
es_419 Spanish Latin
et_ee Estonian Latin
fa_ir Persian Arabic
ff_sn Fulah Latin
fi_fi Finnish Latin
fil_ph Filipino Latin
fr_fr French Latin
ga_ie Irish Latin
gl_es Galician Latin
gu_in Gujarati Gujarati
ha_ng Hausa Latin
he_il Hebrew Hebrew
hi_in Hindi Devanagari
hr_hr Croatian Latin
hu_hu Hungarian Latin
hy_am Armenian Armenian
id_id Indonesian Latin
ig_ng Igbo Latin
is_is Icelandic Latin
it_it Italian Latin
ja_jp Japanese Han, Hiragana, Katakana
jv_id Javanese Latin
ka_ge Georgian Georgian
kam_ke Kamba Latin
kea_cv Kabuverdianu Latin
kk_kz Kazakh Cyrillic
km_kh Khmer Khmer
kn_in Kannada Kannada
ko_kr Korean Hangul
ky_kg Kyrgyz Cyrillic
lb_lu Luxembourgish Latin
lg_ug Ganda Latin
ln_cd Lingala Latin
lo_la Lao Lao
lt_lt Lithuanian Latin
luo_ke Luo Latin
lv_lv Latvian Latin
mi_nz Maori Latin
mk_mk Macedonian Cyrillic
ml_in Malayalam Malayalam
mn_mn Mongolian Cyrillic
mr_in Marathi Devanagari
ms_my Malay Latin
mt_mt Maltese Latin
my_mm Burmese Myanmar
nb_no Norwegian Bokmal Latin
ne_np Nepali Devanagari
nl_nl Dutch Latin
nso_za Northern Sotho Latin
ny_mw Chichewa Latin
oc_fr Occitan Latin
om_et Oromo Latin
or_in Odia Odia
pa_in Punjabi Gurmukhi
pl_pl Polish Latin
ps_af Pashto Arabic
pt_br Portuguese Latin
ro_ro Romanian Latin
ru_ru Russian Cyrillic
sd_in Sindhi Arabic
sk_sk Slovak Latin
sl_si Slovenian Latin
sn_zw Shona Latin
so_so Somali Latin
sr_rs Serbian Cyrillic
sv_se Swedish Latin
sw_ke Swahili Latin
ta_in Tamil Tamil
te_in Telugu Telugu
tg_tj Tajik Cyrillic
th_th Thai Thai
tr_tr Turkish Latin
uk_ua Ukrainian Cyrillic
umb_ao Umbundu Latin
ur_pk Urdu Arabic
uz_uz Uzbek Latin
vi_vn Vietnamese Latin
wo_sn Wolof Latin
xh_za Xhosa Latin
yo_ng Yoruba Latin
yue_hant_hk Cantonese Han
zu_za Zulu Latin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

script_fidelity-0.1.1.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

script_fidelity-0.1.1-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file script_fidelity-0.1.1.tar.gz.

File metadata

  • Download URL: script_fidelity-0.1.1.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.1

File hashes

Hashes for script_fidelity-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4e36da45cddd306e6794eb59bd06cbd3fe9ae19801791bbe5c02862952aa89a8
MD5 f31ba71b6284fce17a6a1d4f845bb4f3
BLAKE2b-256 97595193832fab409d11ae736cb8f1a763c488d4966fc5e92423f1b73e84dfb9

See more details on using hashes here.

File details

Details for the file script_fidelity-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for script_fidelity-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 36d980aac780d6ce123517a589a87f0a8dcf98c001f7b4de8a2db68f76cb0689
MD5 35099ac1e97de4d744a0207f5ed100e8
BLAKE2b-256 6edb372f85730ca7b934767cc02c0cc2372d980adbfb357707c0a52aca3dff5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page