jiwer-compatible WER normalizer with number, email, URL, filler, and symbol normalization for voice AI evaluation in English, German, and French
Project description
extended-wer-normalizer
jiwer-compatible text normalizer for Word Error Rate (WER) evaluation in voice AI.
Extends jiwer's built-in transforms with normalizations that matter for real-world ASR evaluation: phone numbers, emails, URLs, currency, percentages, ordinals, filler words, and stuttering.
Installation
pip install extended-wer-normalizer
Quick start
from extended_wer_normalizer import normalize_for_wer
normalize_for_wer("Call 0176 or email info@example.com, it costs $5.99")
# → "call 0 1 7 6 or email info at example dot com it costs five dollars ninety nine cents"
normalize_for_wer("Um, 1st place goes to Dr. Smith with 50% accuracy")
# → "first place goes to doctor smith with fifty percent accuracy"
jiwer integration
Every normalization is a jiwer.AbstractTransform subclass — compose them freely:
import jiwer
from extended_wer_normalizer.transforms import NormalizeEmails, ExpandDigitRuns
pipeline = jiwer.Compose([
NormalizeEmails(),
ExpandDigitRuns(),
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
jiwer.ReduceToListOfListOfWords(),
])
wer = jiwer.wer("info at example dot com", "info@example.com", hypothesis_transform=pipeline)
Use the pre-built pipeline directly with jiwer.wer:
import jiwer
from extended_wer_normalizer import english_wer_pipeline
wer = jiwer.wer(reference, hypothesis, reference_transform=english_wer_pipeline, hypothesis_transform=english_wer_pipeline)
Available transforms
| Transform | Example |
|---|---|
ExpandDigitRuns |
"0176" → "0 1 7 6" |
DigitWordsToChars |
"zero one seven" → "0 1 7" |
NormalizeEmails |
"user@example.com" → "user at example dot com" |
NormalizeURLs |
"https://example.com/path" → "example dot com" |
NormalizeCurrency |
"$5.99" → "five dollars ninety nine cents" |
NormalizePercentages |
"50%" → "fifty percent" |
NormalizeOrdinals |
"1st" → "first", "15th" → "fifteenth" |
ExpandAbbreviations |
"Dr." → "doctor", "vs." → "versus" |
NormalizeSymbols |
"cats & dogs" → "cats and dogs" |
RemoveFillerWords |
removes um, uh, hmm, er, ah, … |
CollapseRepetitions |
"I I I think" → "I think" |
ExpandFrenchElisions |
"j'aime" → "j aime", "qu'il" → "qu il" (French only) |
Every transform that consumes language-specific data accepts a language="en" keyword (default English): NormalizeEmails(language="fr"), ExpandAbbreviations(language="de"), etc.
Pipeline design
The English pipeline applies transforms left-to-right in a single pass:
- Pattern-specific (before punctuation is stripped): email, URL, symbol, abbreviation, currency, percentage, ordinal
- Core: contractions (
I'm→i am), lowercase, punctuation removal - Digit normalization: expand digit runs (
0176→0 1 7 6), convert digit words (zero→0) - Cleanup: filler words, repetition collapse
Supported languages
Full pipelines (with language-specific abbreviations, fillers, lexicons, and number/ordinal/percentage word forms via num2words) ship for English, German, and French. Pass any other language value for the minimal fallback (lowercase + punctuation + whitespace).
from extended_wer_normalizer import normalize_for_wer
# German: titles, fillers, ordinals, currency
normalize_for_wer("Hr. Müller, am 1. Januar, ähm, ungefähr 50% Rabatt", language="de")
# → "herr müller am erste januar ungefähr fünfzig prozent rabatt"
# French: elision contractions, ordinals, comma-decimal currency
normalize_for_wer("M. Dupont, le 1er janvier, c'est €5,99", language="fr")
# → "monsieur dupont le premier janvier c est cinq euros quatre vingt dix neuf centimes"
# Spanish, Italian, … fall through to the minimal pipeline
normalize_for_wer("¡Hola, mundo!", language="es")
# → "hola mundo"
Per-language pipelines are also exposed for direct use with jiwer.wer:
from extended_wer_normalizer import (
english_wer_pipeline,
german_wer_pipeline,
french_wer_pipeline,
)
To inspect or extend the language data:
from extended_wer_normalizer.languages import get_language_data, supported_languages
supported_languages() # ["de", "en", "fr"]
get_language_data("de").abbreviations["hr."] # "herr"
Quirks worth knowing
- Comma vs. period decimals: French uses
,(€5,99,3,5%); the currency and percentage transforms accept either separator regardless of language. - German ordinals: matched as 1- to 3-digit numbers followed by
.and a word (e.g."1. Januar"but not"Es war 1990."or"1.5 Liter"). 4+ digits and decimals are skipped to avoid false positives on years. - French ordinals: matched as
1er,1ère,2e,2es,2ème,2èmes,2nde,2nds,2nd.num2wordsreturns masculine forms (premier,deuxième); feminine variants likepremièreorsecondeare not produced. - Contractions:
jiwer.ExpandCommonEnglishContractionsruns only for English. French has a customExpandFrenchElisionsthat splitsj',l',d',n',s',m',t',c',qu',jusqu',lorsqu',puisqu',quoiqu'from the following word. German has no contraction step. - German pluralization: most currency units stay invariant (
fünf Euro, notfünf Euros).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extended_wer_normalizer-0.3.0.tar.gz.
File metadata
- Download URL: extended_wer_normalizer-0.3.0.tar.gz
- Upload date:
- Size: 35.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
489c2a250d8c802baa0fa6c0154694ad73d6f2fe95a0536a3f94e991f59146de
|
|
| MD5 |
b0e48e700d4d38773da1ebc4f36be3b1
|
|
| BLAKE2b-256 |
f17a3c10649b4b54362bdc162f8bc3d23d83ef98d2f1a0bae6839c50beb596a1
|
File details
Details for the file extended_wer_normalizer-0.3.0-py3-none-any.whl.
File metadata
- Download URL: extended_wer_normalizer-0.3.0-py3-none-any.whl
- Upload date:
- Size: 13.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6cbadd4f00ccf01dfaeb822309bfdf210c146ff7feede5b33039e0e8a702cde
|
|
| MD5 |
95eec8f71a50c21e9d016154f99c9857
|
|
| BLAKE2b-256 |
f7a4eaf3970ca44601220e40e712c14b3636ddbfa277f1a42cefaca3a5317398
|