Reference-free script fidelity metric for multilingual ASR.
Project description
script-fidelity
script-fidelity is a small Python package for Script Fidelity Rate (SFR), a
reference-free metric for multilingual ASR. SFR measures the fraction of
countable hypothesis characters that belong to the expected Unicode script for a
target language.
Quick signals:
- Install with
uv add script-fidelity - Load with HF Evaluate via
themechanism/script_fidelity_rate - Supports 102 FLEURS language configs, excluding
all - PyPI: https://pypi.org/project/script-fidelity/
Use SFR with WER and CER. SFR checks script validity; WER and CER measure transcription error against references.
install
For package development in this repo:
uv sync --extra dev
For a downstream project:
uv add script-fidelity
Run the CLI without adding it to a project:
uvx --from script-fidelity sfr score --language ps_af --text "کابل کې ښه هوا ده"
python use
from script_fidelity import compute_sfr, compute_sfr_batch
score = compute_sfr("کابل کې ښه هوا ده", language="ps_af")
scores = compute_sfr_batch(
["کابل کې ښه هوا ده", "this is romanized output"],
language="pashto",
)
Digits count by default, matching the paper. Treat digits as neutral with
digit_policy="ignore".
compute_sfr("کابل 2026", language="ps_af", digit_policy="ignore")
HF Evaluate use
Local metric:
import evaluate
sfr = evaluate.load("./metrics/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")
Hub metric after publishing:
import evaluate
sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")
CLI
sfr score --language ps_af --text "کابل کې ښه هوا ده"
sfr audit predictions.jsonl --language ps_af --text-column prediction
sfr audit predictions.csv --language bn_in --text-column transcript --format csv
ASR batch example
from script_fidelity import compute_corpus_sfr
predictions = [
item["text"]
for item in whisper_outputs
]
summary = compute_corpus_sfr(predictions, language="bn_in")
print(summary["sfr_percent"])
print(summary["dominant_script_counts"])
pandas dataframe example
import pandas as pd
from script_fidelity import compute_sfr
df = pd.read_json("predictions.jsonl", lines=True)
df["sfr"] = df["prediction"].map(lambda text: compute_sfr(text, language="ps_af"))
Transformers compute_metrics example
import evaluate
wer = evaluate.load("wer")
sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
pred_text = processor.batch_decode(predictions, skip_special_tokens=True)
label_text = processor.batch_decode(labels, skip_special_tokens=True)
return {
"wer": wer.compute(predictions=pred_text, references=label_text),
"sfr": sfr.compute(predictions=pred_text, language="ps_af")["sfr"],
}
CI gate example
from script_fidelity import compute_corpus_sfr
summary = compute_corpus_sfr(predictions, language="ml_in")
if summary["sfr"] < 0.90:
raise SystemExit("SFR regression: Malayalam output is below 90% target script")
shared-script caveats
SFR is a script check, not a language identifier. Pashto, Urdu, Persian, Arabic, Central Kurdish, and Sindhi share Arabic-script Unicode blocks. Latin-script languages mostly detect romanization or non-Latin substitution, not language identity. Pair SFR with language ID or lexical checks when shared-script confusions matter.
Use dominant_script() and script_distribution() to inspect failures:
from script_fidelity import dominant_script, script_distribution
dominant_script("this is romanized output")
script_distribution("বাংলা भाषा")
FLEURS codes
The registry covers the 102 FLEURS language configs listed by sfr languages.
These paper languages have short aliases:
| FLEURS code | Alias | Script |
|---|---|---|
ps_af |
pashto |
Arabic |
ur_pk |
urdu |
Arabic |
ar_eg |
arabic |
Arabic |
fa_ir |
persian, farsi |
Arabic |
hi_in |
hindi |
Devanagari |
bn_in |
bengali, bangla |
Bengali |
ml_in |
malayalam |
Malayalam |
ta_in |
tamil |
Tamil |
so_so |
somali |
Latin |
ka_ge |
georgian |
Georgian |
For the full reviewed registry, see
script_fidelity/data/fleurs_registry.json.
Full code table:
| Code | Language | Script |
|---|---|---|
af_za |
Afrikaans | Latin |
am_et |
Amharic | Ethiopic |
ar_eg |
Arabic | Arabic |
as_in |
Assamese | Bengali |
ast_es |
Asturian | Latin |
az_az |
Azerbaijani | Latin |
be_by |
Belarusian | Cyrillic |
bg_bg |
Bulgarian | Cyrillic |
bn_in |
Bengali | Bengali |
bs_ba |
Bosnian | Latin |
ca_es |
Catalan | Latin |
ceb_ph |
Cebuano | Latin |
ckb_iq |
Central Kurdish | Arabic |
cmn_hans_cn |
Mandarin Chinese | Han |
cs_cz |
Czech | Latin |
cy_gb |
Welsh | Latin |
da_dk |
Danish | Latin |
de_de |
German | Latin |
el_gr |
Greek | Greek |
en_us |
English | Latin |
es_419 |
Spanish | Latin |
et_ee |
Estonian | Latin |
fa_ir |
Persian | Arabic |
ff_sn |
Fulah | Latin |
fi_fi |
Finnish | Latin |
fil_ph |
Filipino | Latin |
fr_fr |
French | Latin |
ga_ie |
Irish | Latin |
gl_es |
Galician | Latin |
gu_in |
Gujarati | Gujarati |
ha_ng |
Hausa | Latin |
he_il |
Hebrew | Hebrew |
hi_in |
Hindi | Devanagari |
hr_hr |
Croatian | Latin |
hu_hu |
Hungarian | Latin |
hy_am |
Armenian | Armenian |
id_id |
Indonesian | Latin |
ig_ng |
Igbo | Latin |
is_is |
Icelandic | Latin |
it_it |
Italian | Latin |
ja_jp |
Japanese | Han, Hiragana, Katakana |
jv_id |
Javanese | Latin |
ka_ge |
Georgian | Georgian |
kam_ke |
Kamba | Latin |
kea_cv |
Kabuverdianu | Latin |
kk_kz |
Kazakh | Cyrillic |
km_kh |
Khmer | Khmer |
kn_in |
Kannada | Kannada |
ko_kr |
Korean | Hangul |
ky_kg |
Kyrgyz | Cyrillic |
lb_lu |
Luxembourgish | Latin |
lg_ug |
Ganda | Latin |
ln_cd |
Lingala | Latin |
lo_la |
Lao | Lao |
lt_lt |
Lithuanian | Latin |
luo_ke |
Luo | Latin |
lv_lv |
Latvian | Latin |
mi_nz |
Maori | Latin |
mk_mk |
Macedonian | Cyrillic |
ml_in |
Malayalam | Malayalam |
mn_mn |
Mongolian | Cyrillic |
mr_in |
Marathi | Devanagari |
ms_my |
Malay | Latin |
mt_mt |
Maltese | Latin |
my_mm |
Burmese | Myanmar |
nb_no |
Norwegian Bokmal | Latin |
ne_np |
Nepali | Devanagari |
nl_nl |
Dutch | Latin |
nso_za |
Northern Sotho | Latin |
ny_mw |
Chichewa | Latin |
oc_fr |
Occitan | Latin |
om_et |
Oromo | Latin |
or_in |
Odia | Odia |
pa_in |
Punjabi | Gurmukhi |
pl_pl |
Polish | Latin |
ps_af |
Pashto | Arabic |
pt_br |
Portuguese | Latin |
ro_ro |
Romanian | Latin |
ru_ru |
Russian | Cyrillic |
sd_in |
Sindhi | Arabic |
sk_sk |
Slovak | Latin |
sl_si |
Slovenian | Latin |
sn_zw |
Shona | Latin |
so_so |
Somali | Latin |
sr_rs |
Serbian | Cyrillic |
sv_se |
Swedish | Latin |
sw_ke |
Swahili | Latin |
ta_in |
Tamil | Tamil |
te_in |
Telugu | Telugu |
tg_tj |
Tajik | Cyrillic |
th_th |
Thai | Thai |
tr_tr |
Turkish | Latin |
uk_ua |
Ukrainian | Cyrillic |
umb_ao |
Umbundu | Latin |
ur_pk |
Urdu | Arabic |
uz_uz |
Uzbek | Latin |
vi_vn |
Vietnamese | Latin |
wo_sn |
Wolof | Latin |
xh_za |
Xhosa | Latin |
yo_ng |
Yoruba | Latin |
yue_hant_hk |
Cantonese | Han |
zu_za |
Zulu | Latin |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file script_fidelity-0.1.1.tar.gz.
File metadata
- Download URL: script_fidelity-0.1.1.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e36da45cddd306e6794eb59bd06cbd3fe9ae19801791bbe5c02862952aa89a8
|
|
| MD5 |
f31ba71b6284fce17a6a1d4f845bb4f3
|
|
| BLAKE2b-256 |
97595193832fab409d11ae736cb8f1a763c488d4966fc5e92423f1b73e84dfb9
|
File details
Details for the file script_fidelity-0.1.1-py3-none-any.whl.
File metadata
- Download URL: script_fidelity-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36d980aac780d6ce123517a589a87f0a8dcf98c001f7b4de8a2db68f76cb0689
|
|
| MD5 |
35099ac1e97de4d744a0207f5ed100e8
|
|
| BLAKE2b-256 |
6edb372f85730ca7b934767cc02c0cc2372d980adbfb357707c0a52aca3dff5f
|