Reference-free script fidelity metric for multilingual ASR.

These details have not been verified by PyPI

Project description

script-fidelity

script-fidelity is a small Python package for Script Fidelity Rate (SFR), a reference-free metric for multilingual ASR. SFR measures the fraction of countable hypothesis characters that belong to the expected Unicode script for a target language.

Quick signals:

Install with uv add script-fidelity
Load with HF Evaluate via themechanism/script_fidelity_rate
Supports 102 FLEURS language configs, excluding all

Use SFR with WER and CER. SFR checks script validity; WER and CER measure transcription error against references.

install

For package development in this repo:

uv sync --extra dev

For a downstream project after release:

uv add script-fidelity

python use

from script_fidelity import compute_sfr, compute_sfr_batch

score = compute_sfr("کابل کې ښه هوا ده", language="ps_af")
scores = compute_sfr_batch(
    ["کابل کې ښه هوا ده", "this is romanized output"],
    language="pashto",
)

Digits count by default, matching the paper. Treat digits as neutral with digit_policy="ignore".

compute_sfr("کابل 2026", language="ps_af", digit_policy="ignore")

HF Evaluate use

Local metric:

import evaluate

sfr = evaluate.load("./metrics/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")

Hub metric after publishing:

import evaluate

sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")

CLI

sfr score --language ps_af --text "کابل کې ښه هوا ده"
sfr audit predictions.jsonl --language ps_af --text-column prediction
sfr audit predictions.csv --language bn_in --text-column transcript --format csv

ASR batch example

from script_fidelity import compute_corpus_sfr

predictions = [
    item["text"]
    for item in whisper_outputs
]

summary = compute_corpus_sfr(predictions, language="bn_in")
print(summary["sfr_percent"])
print(summary["dominant_script_counts"])

pandas dataframe example

import pandas as pd
from script_fidelity import compute_sfr

df = pd.read_json("predictions.jsonl", lines=True)
df["sfr"] = df["prediction"].map(lambda text: compute_sfr(text, language="ps_af"))

Transformers compute_metrics example

import evaluate

wer = evaluate.load("wer")
sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    pred_text = processor.batch_decode(predictions, skip_special_tokens=True)
    label_text = processor.batch_decode(labels, skip_special_tokens=True)
    return {
        "wer": wer.compute(predictions=pred_text, references=label_text),
        "sfr": sfr.compute(predictions=pred_text, language="ps_af")["sfr"],
    }

CI gate example

from script_fidelity import compute_corpus_sfr

summary = compute_corpus_sfr(predictions, language="ml_in")
if summary["sfr"] < 0.90:
    raise SystemExit("SFR regression: Malayalam output is below 90% target script")

shared-script caveats

SFR is a script check, not a language identifier. Pashto, Urdu, Persian, Arabic, Central Kurdish, and Sindhi share Arabic-script Unicode blocks. Latin-script languages mostly detect romanization or non-Latin substitution, not language identity. Pair SFR with language ID or lexical checks when shared-script confusions matter.

Use dominant_script() and script_distribution() to inspect failures:

from script_fidelity import dominant_script, script_distribution

dominant_script("this is romanized output")
script_distribution("বাংলা भाषा")

FLEURS codes

The registry covers the 102 FLEURS language configs listed by sfr languages. These paper languages have short aliases:

FLEURS code	Alias	Script
`ps_af`	`pashto`	Arabic
`ur_pk`	`urdu`	Arabic
`ar_eg`	`arabic`	Arabic
`fa_ir`	`persian`, `farsi`	Arabic
`hi_in`	`hindi`	Devanagari
`bn_in`	`bengali`, `bangla`	Bengali
`ml_in`	`malayalam`	Malayalam
`ta_in`	`tamil`	Tamil
`so_so`	`somali`	Latin
`ka_ge`	`georgian`	Georgian

For the full reviewed registry, see script_fidelity/data/fleurs_registry.json.

Full code table:

Code	Language	Script
`af_za`	Afrikaans	Latin
`am_et`	Amharic	Ethiopic
`ar_eg`	Arabic	Arabic
`as_in`	Assamese	Bengali
`ast_es`	Asturian	Latin
`az_az`	Azerbaijani	Latin
`be_by`	Belarusian	Cyrillic
`bg_bg`	Bulgarian	Cyrillic
`bn_in`	Bengali	Bengali
`bs_ba`	Bosnian	Latin
`ca_es`	Catalan	Latin
`ceb_ph`	Cebuano	Latin
`ckb_iq`	Central Kurdish	Arabic
`cmn_hans_cn`	Mandarin Chinese	Han
`cs_cz`	Czech	Latin
`cy_gb`	Welsh	Latin
`da_dk`	Danish	Latin
`de_de`	German	Latin
`el_gr`	Greek	Greek
`en_us`	English	Latin
`es_419`	Spanish	Latin
`et_ee`	Estonian	Latin
`fa_ir`	Persian	Arabic
`ff_sn`	Fulah	Latin
`fi_fi`	Finnish	Latin
`fil_ph`	Filipino	Latin
`fr_fr`	French	Latin
`ga_ie`	Irish	Latin
`gl_es`	Galician	Latin
`gu_in`	Gujarati	Gujarati
`ha_ng`	Hausa	Latin
`he_il`	Hebrew	Hebrew
`hi_in`	Hindi	Devanagari
`hr_hr`	Croatian	Latin
`hu_hu`	Hungarian	Latin
`hy_am`	Armenian	Armenian
`id_id`	Indonesian	Latin
`ig_ng`	Igbo	Latin
`is_is`	Icelandic	Latin
`it_it`	Italian	Latin
`ja_jp`	Japanese	Han, Hiragana, Katakana
`jv_id`	Javanese	Latin
`ka_ge`	Georgian	Georgian
`kam_ke`	Kamba	Latin
`kea_cv`	Kabuverdianu	Latin
`kk_kz`	Kazakh	Cyrillic
`km_kh`	Khmer	Khmer
`kn_in`	Kannada	Kannada
`ko_kr`	Korean	Hangul
`ky_kg`	Kyrgyz	Cyrillic
`lb_lu`	Luxembourgish	Latin
`lg_ug`	Ganda	Latin
`ln_cd`	Lingala	Latin
`lo_la`	Lao	Lao
`lt_lt`	Lithuanian	Latin
`luo_ke`	Luo	Latin
`lv_lv`	Latvian	Latin
`mi_nz`	Maori	Latin
`mk_mk`	Macedonian	Cyrillic
`ml_in`	Malayalam	Malayalam
`mn_mn`	Mongolian	Cyrillic
`mr_in`	Marathi	Devanagari
`ms_my`	Malay	Latin
`mt_mt`	Maltese	Latin
`my_mm`	Burmese	Myanmar
`nb_no`	Norwegian Bokmal	Latin
`ne_np`	Nepali	Devanagari
`nl_nl`	Dutch	Latin
`nso_za`	Northern Sotho	Latin
`ny_mw`	Chichewa	Latin
`oc_fr`	Occitan	Latin
`om_et`	Oromo	Latin
`or_in`	Odia	Odia
`pa_in`	Punjabi	Gurmukhi
`pl_pl`	Polish	Latin
`ps_af`	Pashto	Arabic
`pt_br`	Portuguese	Latin
`ro_ro`	Romanian	Latin
`ru_ru`	Russian	Cyrillic
`sd_in`	Sindhi	Arabic
`sk_sk`	Slovak	Latin
`sl_si`	Slovenian	Latin
`sn_zw`	Shona	Latin
`so_so`	Somali	Latin
`sr_rs`	Serbian	Cyrillic
`sv_se`	Swedish	Latin
`sw_ke`	Swahili	Latin
`ta_in`	Tamil	Tamil
`te_in`	Telugu	Telugu
`tg_tj`	Tajik	Cyrillic
`th_th`	Thai	Thai
`tr_tr`	Turkish	Latin
`uk_ua`	Ukrainian	Cyrillic
`umb_ao`	Umbundu	Latin
`ur_pk`	Urdu	Arabic
`uz_uz`	Uzbek	Latin
`vi_vn`	Vietnamese	Latin
`wo_sn`	Wolof	Latin
`xh_za`	Xhosa	Latin
`yo_ng`	Yoruba	Latin
`yue_hant_hk`	Cantonese	Han
`zu_za`	Zulu	Latin

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

May 8, 2026

This version

0.1.0

May 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

script_fidelity-0.1.0.tar.gz (18.8 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

script_fidelity-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file script_fidelity-0.1.0.tar.gz.

File metadata

Download URL: script_fidelity-0.1.0.tar.gz
Upload date: May 8, 2026
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.1

File hashes

Hashes for script_fidelity-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`94066f1d808d67e175a10f355bbf8cb18baa1c2866bfb416d7e10631a4b1f495`
MD5	`2afd20021c5aed8ca67d5fdc732119b5`
BLAKE2b-256	`1f5e296aa91436c6f55697c9d59cb545f6d43686b845511e2e3b383be6e0a427`

See more details on using hashes here.

File details

Details for the file script_fidelity-0.1.0-py3-none-any.whl.

File metadata

Download URL: script_fidelity-0.1.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.1

File hashes

Hashes for script_fidelity-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49c126423bc5e942d27538a7a767ca94f9a4d782568a2c08ddf2f595f29ca4fd`
MD5	`37da3671b55e91818b7c449c4e8a30d7`
BLAKE2b-256	`62b5313d8b5b75b3730b1e75bacfe6279b2d62fdc49a518bd96ff561a4acbe4e`

See more details on using hashes here.

script-fidelity 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

script-fidelity

install

python use

HF Evaluate use

CLI

ASR batch example

pandas dataframe example

Transformers compute_metrics example

CI gate example

shared-script caveats

FLEURS codes

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes