Skip to main content

SemanticWER: Meaning-Aware ASR Evaluation Toolkit for speech-to-LLM systems

Project description

🔥 SemanticWER

Evaluation framework for speech-to-LLM systems.

PyPI version Python 3.9+ License: MIT


Classic Word Error Rate (WER) measures token accuracy. But modern pipelines look like this:

Speech → ASR → LLM → Task (QA, summarization, agents, RAG)

A 20% WER transcript can preserve meaning — or completely break downstream reasoning. WER cannot tell the difference.

SemanticWER fixes this with a four-component composite score:

SemanticWER = w₁·L + w₂·E + w₃·S + w₄·T
Component What it measures
L — Lexical Standard WER + CER (NIST-compatible)
E — Entity Named entity preservation (PERSON, ORG, DATE, …)
S — Semantic Embedding cosine similarity (SBERT)
T — Task Downstream task success delta

Lower score = better transcript quality.


Installation

# Minimal (WER/CER + regex NER + Jaccard semantic fallback)
pip install semanticwer

# Recommended (full features)
pip install "semanticwer[full]"
python -m spacy download en_core_web_sm

Quick Start

from semanticwer import SemanticWER

metric = SemanticWER()  # defaults: weights=(0.3, 0.2, 0.3, 0.2)

result = metric(
    reference="The patient was prescribed 50mg of metformin twice daily",
    hypothesis="The patient was prescribed 15mg of metformin twice daily",
)

print(result.summary())
# ====================================================
#   SemanticWER Result
# ====================================================
#   Composite Score  : 0.3241  (lower = better)
# ----------------------------------------------------
#   [L] Lexical      : WER=0.1429  CER=0.0541  (w=0.30)
#   [E] Entity       : F1=0.8000  Recall=0.6667  (w=0.20)
#   [S] Semantic     : Sim=0.8923  (w=0.30)
#   [T] Task         : N/A  (w=0.20)
# ====================================================

print(result.wer)           # 0.1429
print(result.semantic_sim)  # 0.8923
print(result.entity_f1)     # 0.8000
print(result.score)         # 0.3241

torchmetrics-Style API

metric = SemanticWER(weights=(0.3, 0.2, 0.3, 0.2))

# Accumulate samples
for ref, hyp in dataset:
    metric.update(ref, hyp)

# Compute over full corpus
result = metric.aggregate()
print(f"Corpus SemanticWER: {result.score:.4f}")

HuggingFace evaluate-Style API

result = metric.compute(
    predictions=hypotheses,
    references=references,
)

Task Utility: The Game-Changer

Connect SemanticWER to your actual downstream task:

Built-in: ROUGE

from semanticwer import SemanticWER
from semanticwer.modules.task import TaskModule

metric = SemanticWER(
    weights=(0.25, 0.25, 0.25, 0.25),
    task_fn=TaskModule.rouge_adapter("rougeL"),
)
result = metric(ref, hyp)
print(result.task_score)  # 0.0–1.0

Built-in: Token F1 (SQuAD-style QA)

metric = SemanticWER(
    task_fn=TaskModule.f1_token_adapter(),
    weights=(0.25, 0.25, 0.25, 0.25),
)

Custom: Any callable

def my_qa_eval(reference: str, hypothesis: str) -> float:
    """Return 1.0 if hypothesis preserves the answer to our question."""
    ref_answer = qa_model(question="Who was mentioned?", context=reference)
    hyp_answer = qa_model(question="Who was mentioned?", context=hypothesis)
    return 1.0 if ref_answer == hyp_answer else 0.0

metric = SemanticWER(
    task_fn=my_qa_eval,
    weights=(0.2, 0.2, 0.3, 0.3),
)

Custom: LLM-as-judge

import anthropic

client = anthropic.Anthropic()

def llm_judge(reference: str, hypothesis: str) -> float:
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": (
                f"Score semantic equivalence 0.0–1.0 (1.0 = identical meaning).\n"
                f"REF: {reference}\nHYP: {hypothesis}\n"
                f"Respond with only a float."
            ),
        }],
    )
    return float(response.content[0].text.strip())

metric = SemanticWER(
    task_fn=TaskModule.llm_judge_adapter(llm_judge),
    weights=(0.2, 0.2, 0.3, 0.3),
)

NER Backend Selection

# spaCy (default, best accuracy for English)
metric = SemanticWER(ner_backend="spacy")

# HuggingFace transformers pipeline
metric = SemanticWER(ner_backend="hf")

# Lightweight regex (no extra deps)
metric = SemanticWER(ner_backend="regex")

# Disable entity scoring
metric = SemanticWER(ner_backend="none")

CLI

# Single pair
semanticwer --ref "John Smith called at 3pm" --hyp "Tom Jones called at 9am"

# Files (one sentence per line)
semanticwer --ref ref.txt --hyp hyp.txt

# With ROUGE task scoring
semanticwer --ref ref.txt --hyp hyp.txt --task rouge

# JSON output (for pipelines)
semanticwer --ref ref.txt --hyp hyp.txt --output json

# Custom weights
semanticwer --ref ref.txt --hyp hyp.txt --weights 0.4 0.2 0.3 0.1

# CSV output
semanticwer --ref ref.txt --hyp hyp.txt --output csv

Result Object

result = metric(ref, hyp)

result.score            # Composite SemanticWER [0, 1]
result.wer              # Classic WER
result.cer              # Character Error Rate
result.entity_f1        # Entity F1 score
result.entity_recall    # Entity recall
result.semantic_sim     # Cosine similarity [0, 1]
result.task_score       # Task utility score (or None)

result.to_dict()        # Full breakdown as dict
result.to_json()        # Full breakdown as JSON string
result.summary()        # Human-readable table

Reproducibility / Custom Weights

Weights must sum to 1.0. Recommended presets:

Use case Weights (L, E, S, T)
General ASR evaluation (0.3, 0.2, 0.3, 0.2)
Medical / legal (entity-critical) (0.2, 0.4, 0.2, 0.2)
LLM pipeline (task-first) (0.15, 0.15, 0.3, 0.4)
Backward-compatible WER (1.0, 0.0, 0.0, 0.0)

Citation

If you use SemanticWER in research, please cite:

@software{semanticwer2024,
  title     = {SemanticWER: Meaning-Aware ASR Evaluation Toolkit},
  year      = {2024},
  url       = {https://github.com/semanticwer/semanticwer},
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semanticwer-0.1.0.tar.gz (20.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semanticwer-0.1.0-py3-none-any.whl (17.5 kB view details)

Uploaded Python 3

File details

Details for the file semanticwer-0.1.0.tar.gz.

File metadata

  • Download URL: semanticwer-0.1.0.tar.gz
  • Upload date:
  • Size: 20.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for semanticwer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 73d51f35e4d6ad71f0e32d761b8d4bc1a4c45142484d16cf209e2a0dabe67bdc
MD5 f8cf1bfcef15a54ec9e17ed0feb47697
BLAKE2b-256 cfec7c2848809628b5003fcd9462ddd9c9a3590fd18da88cdb1bfee8b1e4cce4

See more details on using hashes here.

File details

Details for the file semanticwer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: semanticwer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for semanticwer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 34f1ac5d814d7af52065689aadf5013ed69074eba9568b732b2c13cb5f08a5d6
MD5 cea64dd5763bbb68b6f9f4923b755792
BLAKE2b-256 fbe4a2735c73ab862e9939baa136a9d004e28637af2cfb89675ec13fb7f07e1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page