Evaluation toolkit for AI systems in African language contexts — code-switching, dialectal robustness, and low-resource NLP.
Project description
NaijaEval
Evaluation infrastructure for AI systems that mainstream benchmarks can't assess — built for African languages, code-switching, and dialectal robustness.
Why this exists
Standard NLP benchmarks — GLUE, HELM, XTREME — were built for high-resource languages and standard dialects. When you build a system for Nigerian English, Yoruba, Igbo, Hausa, Nigerian Pidgin, or Swahili, none of those benchmarks tell you whether your system actually works.
The specific gaps NaijaEval addresses:
- No metric exists for code-switch robustness. A model that scores 0.85 on clean English may collapse when a user switches mid-sentence from English to Yoruba.
- No standard way to measure dialectal degradation. WER on standard British English says nothing about WER on Nigerian English.
- Terminology preservation is unmeasured. BLEU doesn't weight medical or legal terms differently from "the" — but in practice, getting "hypertension" wrong matters more than getting word order slightly wrong.
- Hallucination in low-resource translation is invisible. When a model is undertrained on Swahili, it hallucinates. Standard metrics don't flag this.
NaijaEval provides composable, task-agnostic metrics that work on real African language evaluation challenges — out of the box.
Quickstart
pip install naijaeval
from naijaeval.metrics import (
CodeSwitchRateMetric,
TerminologyPreservationMetric,
HallucinationRateMetric,
WERMetric,
)
# Measure how mixed your test data is
csr = CodeSwitchRateMetric()
result = csr.compute(
predictions=["I dey go market abeg, wetin be the price?"],
references=[],
)
print(f"Code-switch rate: {result.score:.3f}")
# Code-switch rate: 0.444
# Check terminology preservation in medical translation
tpr = TerminologyPreservationMetric(domain="medical")
result = tpr.compute(
predictions=["Alaisan naa ni malaria ati hypertension."],
references=[],
)
print(f"Term preservation: {result.score:.3f}")
# Term preservation: 0.150 (most terms not preserved → low Yoruba coverage)
# Detect hallucination in summarisation
hal = HallucinationRateMetric()
result = hal.compute(
predictions=["The Lagos General Hospital in Kano treated 500 patients."],
references=["The hospital in Lagos treated patients."], # source
)
print(f"Hallucination rate: {result.score:.3f}")
print(f"Hallucinated: {result.details['per_sample'][0]['hallucinated']}")
Supported tasks and benchmarks
| Benchmark | Task | Languages | Dataset |
|---|---|---|---|
naija_mt_v1 |
Machine translation | English → Yoruba | MENYO-20k |
coswitch_asr_v1 |
ASR robustness | Nigerian English / Pidgin | Common Voice |
Supported metrics
| Metric | Category | Description |
|---|---|---|
code_switch_rate |
Robustness | Fraction of token pairs that switch language |
dialectal_robustness_score |
Robustness | Relative performance drop on dialectal vs standard input |
terminology_preservation_rate |
Fidelity | Fraction of domain terms present in output |
bleu |
Fidelity | Corpus BLEU (sacrebleu) |
chrf |
Fidelity | Character F-score — better for morphologically rich languages |
wer |
ASR | Word Error Rate |
cer |
ASR | Character Error Rate |
wer_delta |
ASR | WER degradation from standard to dialectal input |
hallucination_rate |
Consistency | Entity-based hallucination detection |
consistency_score |
Consistency | N-gram faithfulness to source |
Built-in domain term lists
medical · legal · financial · customer_support
Built-in language vocabularies
Yoruba (yo) · Igbo (ig) · Hausa (ha) · Nigerian Pidgin (pcm) · Swahili (sw) · Zulu (zu) · Amharic (am)
CLI reference
# List everything available
naijaeval list metrics
naijaeval list datasets
naijaeval list benchmarks
# Run a benchmark
naijaeval run \
--benchmark naija_mt_v1 \
--predictions preds.txt \
--references refs.txt \
--model Helsinki-NLP/opus-mt-en-yo \
--output results.json
# Compare two models
naijaeval compare model_a.json model_b.json
# Generate HTML report
naijaeval report --input results.json --output report.html
Python API
# Run a full task evaluation
from naijaeval.tasks.translation import TranslationTask
task = TranslationTask(domain="medical")
results = task.evaluate(
predictions=my_translations,
references=reference_translations,
sources=english_sentences,
)
for name, result in results.items():
print(f"{name}: {result.score:.4f}")
# Compare ASR performance on standard vs dialectal input
from naijaeval.tasks.asr import ASRTask
task = ASRTask()
results = task.evaluate(
predictions=standard_preds,
references=standard_refs,
dialectal_predictions=dialectal_preds,
dialectal_references=dialectal_refs,
dialect_name="Nigerian English",
)
print(results["wer_delta"].details["interpretation"])
Extending the toolkit
Register a custom metric:
from naijaeval import register_metric
from naijaeval.metrics.base import BaseMetric, MetricResult
@register_metric("my_custom_score")
class MyCustomScore(BaseMetric):
name = "my_custom_score"
description = "My domain-specific evaluation metric."
higher_is_better = True
def compute(self, predictions, references, **kwargs):
score = ... # your implementation
return MetricResult(name=self.name, score=score)
Register a custom dataset:
from naijaeval import register_dataset
@register_dataset("my_corpus")
def load_my_corpus(split="test", **kwargs):
# Return an iterable of {"source": ..., "target": ...} dicts
...
See docs/contributing/adding_metrics.md for the full contribution guide.
Roadmap
v0.1 (current)
- 10 core metrics across 4 categories
- 2 benchmarks (naija_mt_v1, coswitch_asr_v1)
- 5 dataset loaders (MENYO-20k, FLEURS ×3, sample)
- CLI and HTML reports
- Plugin system
v0.2 (planned)
- COMET and BERTScore integration
- NLI-based hallucination detection (upgrade from heuristic)
- Conversational AI task
- Swahili and Igbo translation benchmarks
- Interactive Colab notebook
v0.3 (planned)
- Leaderboard integration
- AfricaNLP workshop benchmark track
Citation
If you use NaijaEval in your research, please cite:
@software{buzugbe2026naijaeval,
author = {Buzugbe, Uche},
title = {{NaijaEval}: Evaluation toolkit for AI systems in African language contexts},
year = {2026},
url = {https://github.com/Uchebuzz/naijaeval},
version = {0.1.0},
}
Contributing
Contributions are welcomed and encouraged. See CONTRIBUTING.md for how to add metrics, datasets, and benchmarks.
The fastest way to make a meaningful contribution is to:
- Add a new metric (see
naijaeval/metrics/for examples) - Add a dataset loader for an underrepresented African language
- Run your own models against existing benchmarks and submit results
Community
- GitHub Discussions — questions, ideas, benchmark results
- AfricaNLP Workshop — the primary research community this toolkit serves
- Masakhane — African NLP community
License
Apache 2.0 — see LICENSE.
Because good models deserve honest benchmarks.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file naijaeval-0.1.0.tar.gz.
File metadata
- Download URL: naijaeval-0.1.0.tar.gz
- Upload date:
- Size: 50.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54dd9563740c130cc1da62a10910c5039b60f32a0d0cb9751b529c52037c1db5
|
|
| MD5 |
5baaa62f23cc46adae1f6119b9f98456
|
|
| BLAKE2b-256 |
a2d1d537605b74157bbf792d9a9fa6d976fce6df650644c281948618aa3897cb
|
Provenance
The following attestation bundles were made for naijaeval-0.1.0.tar.gz:
Publisher:
ci.yml on Uchebuzz/Naijaeval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
naijaeval-0.1.0.tar.gz -
Subject digest:
54dd9563740c130cc1da62a10910c5039b60f32a0d0cb9751b529c52037c1db5 - Sigstore transparency entry: 1363499415
- Sigstore integration time:
-
Permalink:
Uchebuzz/Naijaeval@930ad7b4e2b508796d396f915690819b029be74c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Uchebuzz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@930ad7b4e2b508796d396f915690819b029be74c -
Trigger Event:
release
-
Statement type:
File details
Details for the file naijaeval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: naijaeval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a87879c0bd394549343dec6a61351f693fbfcf6aae3b707fd15747feb66c804
|
|
| MD5 |
40eae3c2bb4f4c05408894eab302486f
|
|
| BLAKE2b-256 |
a1b5bc4733e97932a7cc08ac6edfc3486e98ecc252d98840bc2b3c0cbb3dea80
|
Provenance
The following attestation bundles were made for naijaeval-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on Uchebuzz/Naijaeval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
naijaeval-0.1.0-py3-none-any.whl -
Subject digest:
3a87879c0bd394549343dec6a61351f693fbfcf6aae3b707fd15747feb66c804 - Sigstore transparency entry: 1363499483
- Sigstore integration time:
-
Permalink:
Uchebuzz/Naijaeval@930ad7b4e2b508796d396f915690819b029be74c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Uchebuzz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@930ad7b4e2b508796d396f915690819b029be74c -
Trigger Event:
release
-
Statement type: