Skip to main content

jiwer-compatible WER normalizer with number, email, URL, filler, and symbol normalization for voice AI evaluation in English, German, and French

Project description

extended-wer-normalizer

jiwer-compatible text normalizer for Word Error Rate (WER) evaluation in voice AI.

Extends jiwer's built-in transforms with normalizations that matter for real-world ASR evaluation: phone numbers, emails, URLs, currency, percentages, ordinals, filler words, and stuttering.

Installation

pip install extended-wer-normalizer

Quick start

from extended_wer_normalizer import normalize_for_wer

normalize_for_wer("Call 0176 or email info@example.com, it costs $5.99")
# → "call 0 1 7 6 or email info at example dot com it costs five dollars ninety nine cents"

normalize_for_wer("Um, 1st place goes to Dr. Smith with 50% accuracy")
# → "first place goes to doctor smith with fifty percent accuracy"

jiwer integration

Every normalization is a jiwer.AbstractTransform subclass — compose them freely:

import jiwer
from extended_wer_normalizer.transforms import NormalizeEmails, ExpandDigitRuns

pipeline = jiwer.Compose([
    NormalizeEmails(),
    ExpandDigitRuns(),
    jiwer.ToLowerCase(),
    jiwer.RemovePunctuation(),
    jiwer.ReduceToListOfListOfWords(),
])

wer = jiwer.wer("info at example dot com", "info@example.com", hypothesis_transform=pipeline)

Use the pre-built pipeline directly with jiwer.wer:

import jiwer
from extended_wer_normalizer import english_wer_pipeline

wer = jiwer.wer(reference, hypothesis, reference_transform=english_wer_pipeline, hypothesis_transform=english_wer_pipeline)

Available transforms

Transform Example
ExpandDigitRuns "0176""0 1 7 6"
DigitWordsToChars "zero one seven""0 1 7"
NormalizeEmails "user@example.com""user at example dot com"
NormalizeURLs "https://example.com/path""example dot com"
NormalizeCurrency "$5.99""five dollars ninety nine cents"
NormalizePercentages "50%""fifty percent"
NormalizeOrdinals "1st""first", "15th""fifteenth"
ExpandAbbreviations "Dr.""doctor", "vs.""versus"
NormalizeSymbols "cats & dogs""cats and dogs"
RemoveFillerWords removes um, uh, hmm, er, ah, …
CollapseRepetitions "I I I think""I think"
ExpandFrenchElisions "j'aime""j aime", "qu'il""qu il" (French only)

Every transform that consumes language-specific data accepts a language="en" keyword (default English): NormalizeEmails(language="fr"), ExpandAbbreviations(language="de"), etc.

Pipeline design

The English pipeline applies transforms left-to-right in a single pass:

  1. Pattern-specific (before punctuation is stripped): email, URL, symbol, abbreviation, currency, percentage, ordinal
  2. Core: contractions (I'mi am), lowercase, punctuation removal
  3. Digit normalization: expand digit runs (01760 1 7 6), convert digit words (zero0)
  4. Cleanup: filler words, repetition collapse

Supported languages

Full pipelines (with language-specific abbreviations, fillers, lexicons, and number/ordinal/percentage word forms via num2words) ship for English, German, and French. Pass any other language value for the minimal fallback (lowercase + punctuation + whitespace).

from extended_wer_normalizer import normalize_for_wer

# German: titles, fillers, ordinals, currency
normalize_for_wer("Hr. Müller, am 1. Januar, ähm, ungefähr 50% Rabatt", language="de")
# → "herr müller am erste januar ungefähr fünfzig prozent rabatt"

# French: elision contractions, ordinals, comma-decimal currency
normalize_for_wer("M. Dupont, le 1er janvier, c'est €5,99", language="fr")
# → "monsieur dupont le premier janvier c est cinq euros quatre vingt dix neuf centimes"

# Spanish, Italian, … fall through to the minimal pipeline
normalize_for_wer("¡Hola, mundo!", language="es")
# → "hola mundo"

Per-language pipelines are also exposed for direct use with jiwer.wer:

from extended_wer_normalizer import (
    english_wer_pipeline,
    german_wer_pipeline,
    french_wer_pipeline,
)

To inspect or extend the language data:

from extended_wer_normalizer.languages import get_language_data, supported_languages

supported_languages()              # ["de", "en", "fr"]
get_language_data("de").abbreviations["hr."]  # "herr"

Quirks worth knowing

  • Comma vs. period decimals: French uses , (€5,99, 3,5%); the currency and percentage transforms accept either separator regardless of language.
  • German ordinals: matched as 1- to 3-digit numbers followed by . and a word (e.g. "1. Januar" but not "Es war 1990." or "1.5 Liter"). 4+ digits and decimals are skipped to avoid false positives on years.
  • French ordinals: matched as 1er, 1ère, 2e, 2es, 2ème, 2èmes, 2nde, 2nds, 2nd. num2words returns masculine forms (premier, deuxième); feminine variants like première or seconde are not produced.
  • Contractions: jiwer.ExpandCommonEnglishContractions runs only for English. French has a custom ExpandFrenchElisions that splits j', l', d', n', s', m', t', c', qu', jusqu', lorsqu', puisqu', quoiqu' from the following word. German has no contraction step.
  • German pluralization: most currency units stay invariant (fünf Euro, not fünf Euros).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extended_wer_normalizer-0.3.0.tar.gz (35.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extended_wer_normalizer-0.3.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file extended_wer_normalizer-0.3.0.tar.gz.

File metadata

File hashes

Hashes for extended_wer_normalizer-0.3.0.tar.gz
Algorithm Hash digest
SHA256 489c2a250d8c802baa0fa6c0154694ad73d6f2fe95a0536a3f94e991f59146de
MD5 b0e48e700d4d38773da1ebc4f36be3b1
BLAKE2b-256 f17a3c10649b4b54362bdc162f8bc3d23d83ef98d2f1a0bae6839c50beb596a1

See more details on using hashes here.

File details

Details for the file extended_wer_normalizer-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for extended_wer_normalizer-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f6cbadd4f00ccf01dfaeb822309bfdf210c146ff7feede5b33039e0e8a702cde
MD5 95eec8f71a50c21e9d016154f99c9857
BLAKE2b-256 f7a4eaf3970ca44601220e40e712c14b3636ddbfa277f1a42cefaca3a5317398

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page