Skip to main content

Evaluation toolkit for AI systems in African language contexts — code-switching, dialectal robustness, and low-resource NLP.

Project description

NaijaEval

Evaluation infrastructure for AI systems that mainstream benchmarks can't assess — built for African languages, code-switching, and dialectal robustness.

CI PyPI Python License Code style: black


Why this exists

Standard NLP benchmarks — GLUE, HELM, XTREME — were built for high-resource languages and standard dialects. When you build a system for Nigerian English, Yoruba, Igbo, Hausa, Nigerian Pidgin, or Swahili, none of those benchmarks tell you whether your system actually works.

The specific gaps NaijaEval addresses:

  • No metric exists for code-switch robustness. A model that scores 0.85 on clean English may collapse when a user switches mid-sentence from English to Yoruba.
  • No standard way to measure dialectal degradation. WER on standard British English says nothing about WER on Nigerian English.
  • Terminology preservation is unmeasured. BLEU doesn't weight medical or legal terms differently from "the" — but in practice, getting "hypertension" wrong matters more than getting word order slightly wrong.
  • Hallucination in low-resource translation is invisible. When a model is undertrained on Swahili, it hallucinates. Standard metrics don't flag this.

NaijaEval provides composable, task-agnostic metrics that work on real African language evaluation challenges — out of the box.


Quickstart

pip install naijaeval
from naijaeval.metrics import (
    CodeSwitchRateMetric,
    TerminologyPreservationMetric,
    HallucinationRateMetric,
    WERMetric,
)

# Measure how mixed your test data is
csr = CodeSwitchRateMetric()
result = csr.compute(
    predictions=["I dey go market abeg, wetin be the price?"],
    references=[],
)
print(f"Code-switch rate: {result.score:.3f}")
# Code-switch rate: 0.444

# Check terminology preservation in medical translation
tpr = TerminologyPreservationMetric(domain="medical")
result = tpr.compute(
    predictions=["Alaisan naa ni malaria ati hypertension."],
    references=[],
)
print(f"Term preservation: {result.score:.3f}")
# Term preservation: 0.150  (most terms not preserved → low Yoruba coverage)

# Detect hallucination in summarisation
hal = HallucinationRateMetric()
result = hal.compute(
    predictions=["The Lagos General Hospital in Kano treated 500 patients."],
    references=["The hospital in Lagos treated patients."],  # source
)
print(f"Hallucination rate: {result.score:.3f}")
print(f"Hallucinated: {result.details['per_sample'][0]['hallucinated']}")

Supported tasks and benchmarks

Benchmark Task Languages Dataset
naija_mt_v1 Machine translation English → Yoruba MENYO-20k
coswitch_asr_v1 ASR robustness Nigerian English / Pidgin Common Voice

Supported metrics

Metric Category Description
code_switch_rate Robustness Fraction of token pairs that switch language
dialectal_robustness_score Robustness Relative performance drop on dialectal vs standard input
terminology_preservation_rate Fidelity Fraction of domain terms present in output
bleu Fidelity Corpus BLEU (sacrebleu)
chrf Fidelity Character F-score — better for morphologically rich languages
wer ASR Word Error Rate
cer ASR Character Error Rate
wer_delta ASR WER degradation from standard to dialectal input
hallucination_rate Consistency Entity-based hallucination detection
consistency_score Consistency N-gram faithfulness to source

Built-in domain term lists

medical · legal · financial · customer_support

Built-in language vocabularies

Yoruba (yo) · Igbo (ig) · Hausa (ha) · Nigerian Pidgin (pcm) · Swahili (sw) · Zulu (zu) · Amharic (am)


CLI reference

# List everything available
naijaeval list metrics
naijaeval list datasets
naijaeval list benchmarks

# Run a benchmark
naijaeval run \
    --benchmark naija_mt_v1 \
    --predictions preds.txt \
    --references refs.txt \
    --model Helsinki-NLP/opus-mt-en-yo \
    --output results.json

# Compare two models
naijaeval compare model_a.json model_b.json

# Generate HTML report
naijaeval report --input results.json --output report.html

Python API

# Run a full task evaluation
from naijaeval.tasks.translation import TranslationTask

task = TranslationTask(domain="medical")
results = task.evaluate(
    predictions=my_translations,
    references=reference_translations,
    sources=english_sentences,
)
for name, result in results.items():
    print(f"{name}: {result.score:.4f}")

# Compare ASR performance on standard vs dialectal input
from naijaeval.tasks.asr import ASRTask

task = ASRTask()
results = task.evaluate(
    predictions=standard_preds,
    references=standard_refs,
    dialectal_predictions=dialectal_preds,
    dialectal_references=dialectal_refs,
    dialect_name="Nigerian English",
)
print(results["wer_delta"].details["interpretation"])

Extending the toolkit

Register a custom metric:

from naijaeval import register_metric
from naijaeval.metrics.base import BaseMetric, MetricResult

@register_metric("my_custom_score")
class MyCustomScore(BaseMetric):
    name = "my_custom_score"
    description = "My domain-specific evaluation metric."
    higher_is_better = True

    def compute(self, predictions, references, **kwargs):
        score = ...  # your implementation
        return MetricResult(name=self.name, score=score)

Register a custom dataset:

from naijaeval import register_dataset

@register_dataset("my_corpus")
def load_my_corpus(split="test", **kwargs):
    # Return an iterable of {"source": ..., "target": ...} dicts
    ...

See docs/contributing/adding_metrics.md for the full contribution guide.


Roadmap

v0.1 (current)

  • 10 core metrics across 4 categories
  • 2 benchmarks (naija_mt_v1, coswitch_asr_v1)
  • 5 dataset loaders (MENYO-20k, FLEURS ×3, sample)
  • CLI and HTML reports
  • Plugin system

v0.2 (planned)

  • COMET and BERTScore integration
  • NLI-based hallucination detection (upgrade from heuristic)
  • Conversational AI task
  • Swahili and Igbo translation benchmarks
  • Interactive Colab notebook

v0.3 (planned)

  • Leaderboard integration
  • AfricaNLP workshop benchmark track

Citation

If you use NaijaEval in your research, please cite:

@software{buzugbe2026naijaeval,
  author    = {Buzugbe, Uche},
  title     = {{NaijaEval}: Evaluation toolkit for AI systems in African language contexts},
  year      = {2026},
  url       = {https://github.com/Uchebuzz/naijaeval},
  version   = {0.1.0},
}

Contributing

Contributions are welcomed and encouraged. See CONTRIBUTING.md for how to add metrics, datasets, and benchmarks.

The fastest way to make a meaningful contribution is to:

  1. Add a new metric (see naijaeval/metrics/ for examples)
  2. Add a dataset loader for an underrepresented African language
  3. Run your own models against existing benchmarks and submit results

Community


License

Apache 2.0 — see LICENSE.

Because good models deserve honest benchmarks.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naijaeval-0.1.0.tar.gz (50.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

naijaeval-0.1.0-py3-none-any.whl (46.7 kB view details)

Uploaded Python 3

File details

Details for the file naijaeval-0.1.0.tar.gz.

File metadata

  • Download URL: naijaeval-0.1.0.tar.gz
  • Upload date:
  • Size: 50.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for naijaeval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 54dd9563740c130cc1da62a10910c5039b60f32a0d0cb9751b529c52037c1db5
MD5 5baaa62f23cc46adae1f6119b9f98456
BLAKE2b-256 a2d1d537605b74157bbf792d9a9fa6d976fce6df650644c281948618aa3897cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaeval-0.1.0.tar.gz:

Publisher: ci.yml on Uchebuzz/Naijaeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file naijaeval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: naijaeval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for naijaeval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a87879c0bd394549343dec6a61351f693fbfcf6aae3b707fd15747feb66c804
MD5 40eae3c2bb4f4c05408894eab302486f
BLAKE2b-256 a1b5bc4733e97932a7cc08ac6edfc3486e98ecc252d98840bc2b3c0cbb3dea80

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaeval-0.1.0-py3-none-any.whl:

Publisher: ci.yml on Uchebuzz/Naijaeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page