SemanticWER: Meaning-Aware ASR Evaluation Toolkit for speech-to-LLM systems
Project description
🔥 SemanticWER
Evaluation framework for speech-to-LLM systems.
Classic Word Error Rate (WER) measures token accuracy. But modern pipelines look like this:
Speech → ASR → LLM → Task (QA, summarization, agents, RAG)
A 20% WER transcript can preserve meaning — or completely break downstream reasoning. WER cannot tell the difference.
SemanticWER fixes this with a four-component composite score:
SemanticWER = w₁·L + w₂·E + w₃·S + w₄·T
| Component | What it measures |
|---|---|
| L — Lexical | Standard WER + CER (NIST-compatible) |
| E — Entity | Named entity preservation (PERSON, ORG, DATE, …) |
| S — Semantic | Embedding cosine similarity (SBERT) |
| T — Task | Downstream task success delta |
Lower score = better transcript quality.
Installation
# Minimal (WER/CER + regex NER + Jaccard semantic fallback)
pip install semanticwer
# Recommended (full features)
pip install "semanticwer[full]"
python -m spacy download en_core_web_sm
Quick Start
from semanticwer import SemanticWER
metric = SemanticWER() # defaults: weights=(0.3, 0.2, 0.3, 0.2)
result = metric(
reference="The patient was prescribed 50mg of metformin twice daily",
hypothesis="The patient was prescribed 15mg of metformin twice daily",
)
print(result.summary())
# ====================================================
# SemanticWER Result
# ====================================================
# Composite Score : 0.3241 (lower = better)
# ----------------------------------------------------
# [L] Lexical : WER=0.1429 CER=0.0541 (w=0.30)
# [E] Entity : F1=0.8000 Recall=0.6667 (w=0.20)
# [S] Semantic : Sim=0.8923 (w=0.30)
# [T] Task : N/A (w=0.20)
# ====================================================
print(result.wer) # 0.1429
print(result.semantic_sim) # 0.8923
print(result.entity_f1) # 0.8000
print(result.score) # 0.3241
torchmetrics-Style API
metric = SemanticWER(weights=(0.3, 0.2, 0.3, 0.2))
# Accumulate samples
for ref, hyp in dataset:
metric.update(ref, hyp)
# Compute over full corpus
result = metric.aggregate()
print(f"Corpus SemanticWER: {result.score:.4f}")
HuggingFace evaluate-Style API
result = metric.compute(
predictions=hypotheses,
references=references,
)
Task Utility: The Game-Changer
Connect SemanticWER to your actual downstream task:
Built-in: ROUGE
from semanticwer import SemanticWER
from semanticwer.modules.task import TaskModule
metric = SemanticWER(
weights=(0.25, 0.25, 0.25, 0.25),
task_fn=TaskModule.rouge_adapter("rougeL"),
)
result = metric(ref, hyp)
print(result.task_score) # 0.0–1.0
Built-in: Token F1 (SQuAD-style QA)
metric = SemanticWER(
task_fn=TaskModule.f1_token_adapter(),
weights=(0.25, 0.25, 0.25, 0.25),
)
Custom: Any callable
def my_qa_eval(reference: str, hypothesis: str) -> float:
"""Return 1.0 if hypothesis preserves the answer to our question."""
ref_answer = qa_model(question="Who was mentioned?", context=reference)
hyp_answer = qa_model(question="Who was mentioned?", context=hypothesis)
return 1.0 if ref_answer == hyp_answer else 0.0
metric = SemanticWER(
task_fn=my_qa_eval,
weights=(0.2, 0.2, 0.3, 0.3),
)
Custom: LLM-as-judge
import anthropic
client = anthropic.Anthropic()
def llm_judge(reference: str, hypothesis: str) -> float:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=10,
messages=[{
"role": "user",
"content": (
f"Score semantic equivalence 0.0–1.0 (1.0 = identical meaning).\n"
f"REF: {reference}\nHYP: {hypothesis}\n"
f"Respond with only a float."
),
}],
)
return float(response.content[0].text.strip())
metric = SemanticWER(
task_fn=TaskModule.llm_judge_adapter(llm_judge),
weights=(0.2, 0.2, 0.3, 0.3),
)
NER Backend Selection
# spaCy (default, best accuracy for English)
metric = SemanticWER(ner_backend="spacy")
# HuggingFace transformers pipeline
metric = SemanticWER(ner_backend="hf")
# Lightweight regex (no extra deps)
metric = SemanticWER(ner_backend="regex")
# Disable entity scoring
metric = SemanticWER(ner_backend="none")
CLI
# Single pair
semanticwer --ref "John Smith called at 3pm" --hyp "Tom Jones called at 9am"
# Files (one sentence per line)
semanticwer --ref ref.txt --hyp hyp.txt
# With ROUGE task scoring
semanticwer --ref ref.txt --hyp hyp.txt --task rouge
# JSON output (for pipelines)
semanticwer --ref ref.txt --hyp hyp.txt --output json
# Custom weights
semanticwer --ref ref.txt --hyp hyp.txt --weights 0.4 0.2 0.3 0.1
# CSV output
semanticwer --ref ref.txt --hyp hyp.txt --output csv
Result Object
result = metric(ref, hyp)
result.score # Composite SemanticWER [0, 1]
result.wer # Classic WER
result.cer # Character Error Rate
result.entity_f1 # Entity F1 score
result.entity_recall # Entity recall
result.semantic_sim # Cosine similarity [0, 1]
result.task_score # Task utility score (or None)
result.to_dict() # Full breakdown as dict
result.to_json() # Full breakdown as JSON string
result.summary() # Human-readable table
Reproducibility / Custom Weights
Weights must sum to 1.0. Recommended presets:
| Use case | Weights (L, E, S, T) |
|---|---|
| General ASR evaluation | (0.3, 0.2, 0.3, 0.2) |
| Medical / legal (entity-critical) | (0.2, 0.4, 0.2, 0.2) |
| LLM pipeline (task-first) | (0.15, 0.15, 0.3, 0.4) |
| Backward-compatible WER | (1.0, 0.0, 0.0, 0.0) |
Citation
If you use SemanticWER in research, please cite:
@software{semanticwer2024,
title = {SemanticWER: Meaning-Aware ASR Evaluation Toolkit},
year = {2024},
url = {https://github.com/semanticwer/semanticwer},
}
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semanticwer-0.1.0.tar.gz.
File metadata
- Download URL: semanticwer-0.1.0.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73d51f35e4d6ad71f0e32d761b8d4bc1a4c45142484d16cf209e2a0dabe67bdc
|
|
| MD5 |
f8cf1bfcef15a54ec9e17ed0feb47697
|
|
| BLAKE2b-256 |
cfec7c2848809628b5003fcd9462ddd9c9a3590fd18da88cdb1bfee8b1e4cce4
|
File details
Details for the file semanticwer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: semanticwer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34f1ac5d814d7af52065689aadf5013ed69074eba9568b732b2c13cb5f08a5d6
|
|
| MD5 |
cea64dd5763bbb68b6f9f4923b755792
|
|
| BLAKE2b-256 |
fbe4a2735c73ab862e9939baa136a9d004e28637af2cfb89675ec13fb7f07e1b
|