Skip to main content

jiwer-compatible WER normalizer with number, email, URL, filler, and symbol normalization for voice AI evaluation

Project description

extended-wer-normalizer

jiwer-compatible text normalizer for Word Error Rate (WER) evaluation in voice AI.

Extends jiwer's built-in transforms with normalizations that matter for real-world ASR evaluation: phone numbers, emails, URLs, currency, percentages, ordinals, filler words, and stuttering.

Installation

pip install extended-wer-normalizer

Quick start

from extended_wer_normalizer import normalize_for_wer

normalize_for_wer("Call 0176 or email info@example.com, it costs $5.99")
# → "call 0 1 7 6 or email info at example dot com it costs five dollars ninety nine cents"

normalize_for_wer("Um, 1st place goes to Dr. Smith with 50% accuracy")
# → "first place goes to doctor smith with fifty percent accuracy"

jiwer integration

Every normalization is a jiwer.AbstractTransform subclass — compose them freely:

import jiwer
from extended_wer_normalizer.transforms import NormalizeEmails, ExpandDigitRuns

pipeline = jiwer.Compose([
    NormalizeEmails(),
    ExpandDigitRuns(),
    jiwer.ToLowerCase(),
    jiwer.RemovePunctuation(),
    jiwer.ReduceToListOfListOfWords(),
])

wer = jiwer.wer("info at example dot com", "info@example.com", hypothesis_transform=pipeline)

Use the pre-built pipeline directly with jiwer.wer:

import jiwer
from extended_wer_normalizer import english_wer_pipeline

wer = jiwer.wer(reference, hypothesis, reference_transform=english_wer_pipeline, hypothesis_transform=english_wer_pipeline)

Available transforms

Transform Example
ExpandDigitRuns "0176""0 1 7 6"
DigitWordsToChars "zero one seven""0 1 7"
WhisperEnglishNormalize lowercase, punctuation, contractions, compound numbers
WhisperBasicNormalize language-agnostic basic normalization
FinalDigitWordCleanup residual digit-word sweep after compound resolution
NormalizeEmails "user@example.com""user at example dot com"
NormalizeURLs "https://example.com/path""example dot com"
NormalizeCurrency "$5.99""five dollars ninety nine cents"
NormalizePercentages "50%""fifty percent"
NormalizeOrdinals "1st""first", "15th""fifteenth"
ExpandAbbreviations "Dr.""doctor", "vs.""versus"
NormalizeSymbols "cats & dogs""cats and dogs"
RemoveFillerWords removes um, uh, hmm, er, ah, …
CollapseRepetitions "I I I think""I think"

Pipeline design

The English pipeline applies transforms in an order that ensures idempotence and correct interaction with Whisper's EnglishTextNormalizer:

  1. Pre-Whisper: email, URL, symbol, abbreviation (patterns Whisper would mangle)
  2. Digit preparation: expand digit runs, convert digit words
  3. Core: WhisperEnglishNormalize (lowercase, punctuation, contractions, compound numbers → digits)
  4. Post-Whisper: digit run expansion, digit word cleanup, currency, percentage, ordinal (patterns Whisper preserves or compacts)
  5. Cleanup: filler words, repetition collapse

Non-English

For non-English text, pass language to get Whisper's BasicTextNormalizer:

normalize_for_wer("Das kostet fünf Euro", language="de")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extended_wer_normalizer-0.2.1.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extended_wer_normalizer-0.2.1-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file extended_wer_normalizer-0.2.1.tar.gz.

File metadata

File hashes

Hashes for extended_wer_normalizer-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f51455722764156e09d8c884c302f1bd83b5920d007be7b44477118180bc7019
MD5 3e720336f9d5bc2adb9e57f802398116
BLAKE2b-256 5e85351d7cdefb3b219d50473e695649e647c4964b341e12348ee653fbfa185c

See more details on using hashes here.

File details

Details for the file extended_wer_normalizer-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for extended_wer_normalizer-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7c1367fe7f917581eaa9a877ce5950fcc73e780641298f9e2e108d2be449ff05
MD5 4d642943367c622118490873f95e51f9
BLAKE2b-256 8beb2ae2dec786e07c10f9e0ed78fb42726fac1747a32ea499f1b6d990bc3e47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page