jiwer-compatible WER normalizer with number, email, URL, filler, and symbol normalization for voice AI evaluation
Project description
extended-wer-normalizer
jiwer-compatible text normalizer for Word Error Rate (WER) evaluation in voice AI.
Extends jiwer's built-in transforms with normalizations that matter for real-world ASR evaluation: phone numbers, emails, URLs, currency, percentages, ordinals, filler words, and stuttering.
Installation
pip install extended-wer-normalizer
Quick start
from extended_wer_normalizer import normalize_for_wer
normalize_for_wer("Call 0176 or email info@example.com, it costs $5.99")
# → "call 0 1 7 6 or email info at example dot com it costs five dollars ninety nine cents"
normalize_for_wer("Um, 1st place goes to Dr. Smith with 50% accuracy")
# → "first place goes to doctor smith with fifty percent accuracy"
jiwer integration
Every normalization is a jiwer.AbstractTransform subclass — compose them freely:
import jiwer
from extended_wer_normalizer.transforms import NormalizeEmails, ExpandDigitRuns
pipeline = jiwer.Compose([
NormalizeEmails(),
ExpandDigitRuns(),
jiwer.ToLowerCase(),
jiwer.RemovePunctuation(),
jiwer.ReduceToListOfListOfWords(),
])
wer = jiwer.wer("info at example dot com", "info@example.com", hypothesis_transform=pipeline)
Use the pre-built pipeline directly with jiwer.wer:
import jiwer
from extended_wer_normalizer import english_wer_pipeline
wer = jiwer.wer(reference, hypothesis, reference_transform=english_wer_pipeline, hypothesis_transform=english_wer_pipeline)
Available transforms
| Transform | Example |
|---|---|
ExpandDigitRuns |
"0176" → "0 1 7 6" |
DigitWordsToChars |
"zero one seven" → "0 1 7" |
WhisperEnglishNormalize |
lowercase, punctuation, contractions, compound numbers |
WhisperBasicNormalize |
language-agnostic basic normalization |
FinalDigitWordCleanup |
residual digit-word sweep after compound resolution |
NormalizeEmails |
"user@example.com" → "user at example dot com" |
NormalizeURLs |
"https://example.com/path" → "example dot com" |
NormalizeCurrency |
"$5.99" → "five dollars ninety nine cents" |
NormalizePercentages |
"50%" → "fifty percent" |
NormalizeOrdinals |
"1st" → "first", "15th" → "fifteenth" |
ExpandAbbreviations |
"Dr." → "doctor", "vs." → "versus" |
NormalizeSymbols |
"cats & dogs" → "cats and dogs" |
RemoveFillerWords |
removes um, uh, hmm, er, ah, … |
CollapseRepetitions |
"I I I think" → "I think" |
Pipeline design
The English pipeline applies transforms in an order that ensures idempotence and correct interaction with Whisper's EnglishTextNormalizer:
- Pre-Whisper: email, URL, symbol, abbreviation (patterns Whisper would mangle)
- Digit preparation: expand digit runs, convert digit words
- Core:
WhisperEnglishNormalize(lowercase, punctuation, contractions, compound numbers → digits) - Post-Whisper: digit run expansion, digit word cleanup, currency, percentage, ordinal (patterns Whisper preserves or compacts)
- Cleanup: filler words, repetition collapse
Non-English
For non-English text, pass language to get Whisper's BasicTextNormalizer:
normalize_for_wer("Das kostet fünf Euro", language="de")
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extended_wer_normalizer-0.2.1.tar.gz.
File metadata
- Download URL: extended_wer_normalizer-0.2.1.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f51455722764156e09d8c884c302f1bd83b5920d007be7b44477118180bc7019
|
|
| MD5 |
3e720336f9d5bc2adb9e57f802398116
|
|
| BLAKE2b-256 |
5e85351d7cdefb3b219d50473e695649e647c4964b341e12348ee653fbfa185c
|
File details
Details for the file extended_wer_normalizer-0.2.1-py3-none-any.whl.
File metadata
- Download URL: extended_wer_normalizer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c1367fe7f917581eaa9a877ce5950fcc73e780641298f9e2e108d2be449ff05
|
|
| MD5 |
4d642943367c622118490873f95e51f9
|
|
| BLAKE2b-256 |
8beb2ae2dec786e07c10f9e0ed78fb42726fac1747a32ea499f1b6d990bc3e47
|