Skip to main content

jiwer-compatible WER normalizer with number, email, URL, filler, and symbol normalization for voice AI evaluation

Project description

extended-wer-normalizer

jiwer-compatible text normalizer for Word Error Rate (WER) evaluation in voice AI.

Extends jiwer's built-in transforms with normalizations that matter for real-world ASR evaluation: phone numbers, emails, URLs, currency, percentages, ordinals, filler words, and stuttering.

Installation

pip install extended-wer-normalizer

Quick start

from extended_wer_normalizer import normalize_for_wer

normalize_for_wer("Call 0176 or email info@example.com, it costs $5.99")
# → "call 0 1 7 6 or email info at example dot com it costs five dollars ninety nine cents"

normalize_for_wer("Um, 1st place goes to Dr. Smith with 50% accuracy")
# → "first place goes to doctor smith with fifty percent accuracy"

jiwer integration

Every normalization is a jiwer.AbstractTransform subclass — compose them freely:

import jiwer
from extended_wer_normalizer.transforms import NormalizeEmails, ExpandDigitRuns

pipeline = jiwer.Compose([
    NormalizeEmails(),
    ExpandDigitRuns(),
    jiwer.ToLowerCase(),
    jiwer.RemovePunctuation(),
    jiwer.ReduceToListOfListOfWords(),
])

wer = jiwer.wer("info at example dot com", "info@example.com", hypothesis_transform=pipeline)

Use the pre-built pipeline directly with jiwer.wer:

import jiwer
from extended_wer_normalizer import english_wer_pipeline

wer = jiwer.wer(reference, hypothesis, reference_transform=english_wer_pipeline, hypothesis_transform=english_wer_pipeline)

Available transforms

Transform Example
ExpandDigitRuns "0176""0 1 7 6"
DigitWordsToChars "zero one seven""0 1 7"
WhisperEnglishNormalize lowercase, punctuation, contractions, compound numbers
WhisperBasicNormalize language-agnostic basic normalization
FinalDigitWordCleanup residual digit-word sweep after compound resolution
NormalizeEmails "user@example.com""user at example dot com"
NormalizeURLs "https://example.com/path""example dot com"
NormalizeCurrency "$5.99""five dollars ninety nine cents"
NormalizePercentages "50%""fifty percent"
NormalizeOrdinals "1st""first", "15th""fifteenth"
ExpandAbbreviations "Dr.""doctor", "vs.""versus"
NormalizeSymbols "cats & dogs""cats and dogs"
RemoveFillerWords removes um, uh, hmm, er, ah, …
CollapseRepetitions "I I I think""I think"

Pipeline design

The English pipeline applies transforms in an order that ensures idempotence and correct interaction with Whisper's EnglishTextNormalizer:

  1. Pre-Whisper: email, URL, symbol, abbreviation (patterns Whisper would mangle)
  2. Digit preparation: expand digit runs, convert digit words
  3. Core: WhisperEnglishNormalize (lowercase, punctuation, contractions, compound numbers → digits)
  4. Post-Whisper: digit run expansion, digit word cleanup, currency, percentage, ordinal (patterns Whisper preserves or compacts)
  5. Cleanup: filler words, repetition collapse

Non-English

For non-English text, pass language to get Whisper's BasicTextNormalizer:

normalize_for_wer("Das kostet fünf Euro", language="de")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extended_wer_normalizer-0.2.0.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extended_wer_normalizer-0.2.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file extended_wer_normalizer-0.2.0.tar.gz.

File metadata

File hashes

Hashes for extended_wer_normalizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e02f9c4e378f9a2f952d10334d6fd8b2a2e5ebc77dbb3d1783645f62c87cb85b
MD5 58dc7b374b55530098e6060fec6a48cf
BLAKE2b-256 36df2bd129f12575489ab13f47558625e2b6d6e90e5fe22c57b4f175593b1e1c

See more details on using hashes here.

File details

Details for the file extended_wer_normalizer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for extended_wer_normalizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e243641e205c90e6e553c03d4e1b1a605bcaea65a6171c0f6a22b24f52d3ed60
MD5 bce5c2b6151b85a3d5ff4e81803c2cdd
BLAKE2b-256 2bc80471f9ba61682730ecdd175ba148e307ff32a4f7afc886d20e28a751b217

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page