Skip to main content

A robust Python toolkit for low-resource African language pre-processing, emotion labels, evaluation, and routing.

Project description

Low-Resource NLP Toolkit

A public, research-facing Python toolkit for African language pre-processing, emotion-label mapping, evaluation, and language/dialect routing.

The project is designed as a safe open-source wrapper around the kinds of NLP engineering problems that appear in low-resource and multilingual AI research: noisy text, code-switching, uneven label taxonomies, small datasets, and evaluation that must be transparent.

Status: 0.2.0 release. Local checks, CI, isolated wheel builds, metadata checks, and install tests pass.

Why This Exists

Low-resource NLP projects often spend too much time rebuilding the same foundations before modelling begins. This toolkit provides a dependable base layer:

  • Text normalisation for noisy social, conversational, and cultural text.
  • Lightweight African language routing for Yoruba, Igbo, Hausa, Nigerian Pidgin, Swahili, and English.
  • Evidence-first code-switch audits that expose token routes, spans, and abstentions.
  • Emotion label harmonisation across categorical and valence-arousal formats.
  • Evaluation utilities for classification and routing experiments.
  • A CLI and examples that run without downloading model weights.
  • Extension points for transformer or embedding backends when a project needs heavier models.

Architecture

flowchart LR
    A["Raw multilingual text"] --> B["Normaliser"]
    B --> C["Tokeniser"]
    C --> D["Language router"]
    C --> E["Emotion label mapper"]
    D --> F["Route decision + confidence"]
    E --> G["Canonical emotion / valence-arousal"]
    F --> H["Evaluation reports"]
    G --> H

Quick Start

python3 -m venv .venv
source .venv/bin/activate
python -m pip install low-resource-nlp-toolkit
low-resource-nlp --version

Route a text sample:

low-resource-nlp route "abeg make una help me check this model output"

Audit code-switched language evidence:

low-resource-nlp audit "abeg make una check this model output"

Normalise text:

low-resource-nlp normalise "Ẹ káàrọ̀!!! Visit https://example.com @user"

Map an emotion label:

low-resource-nlp label joy

Run tests:

make check

Without make:

python3 scripts/quality_gate.py
PYTHONPATH=src python3 -m unittest discover -s tests

Python Usage

from low_resource_nlp import (
    LexicalLanguageRouter,
    audit_code_switching,
    label_to_valence_arousal,
    normalise_text,
)

text = normalise_text("Ẹ káàrọ̀, báwo ni?")
decision = LexicalLanguageRouter.default().route(text)
audit = audit_code_switching("abeg make una check this model output")
emotion = label_to_valence_arousal("joy")

print(decision.language_code, decision.confidence)
print(audit.language_mix, audit.warnings)
print(emotion)

Current Scope

The first public release deliberately avoids bundling private datasets or model weights. The core is deterministic, inspectable, and dependency-light. Optional embedding and transformer backends are outside the current core package.

Supported core modules:

  • normalisation: Unicode-aware text cleaning, URL/user normalisation, tokenisation, repeated-character handling.
  • routing: script-aware and lexicon-assisted language routing.
  • audit: token-level code-switch audits with spans, evidence, and abstention warnings.
  • labels: canonical emotion labels and valence-arousal mapping.
  • evaluation: precision, recall, F1, macro/micro summaries, and confusion matrices.
  • datasets: simple CSV/JSONL readers for experiment scaffolding.

Public Project Materials

Responsible AI Notes

This toolkit is for research and prototyping. Language, dialect, and emotion labels are socially and culturally sensitive. Do not treat routing or emotion predictions as identity labels, clinical assessments, or ground truth. Always evaluate with speakers, domain experts, and context-specific data.

External Use

External use signals should be public and verifiable: issues from real users, pull requests, tutorial use, workshop demos, citations, package downloads, or adoption by a lab/community project. Self-generated activity should not be counted as impact.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

low_resource_nlp_toolkit-0.2.0.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

low_resource_nlp_toolkit-0.2.0-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file low_resource_nlp_toolkit-0.2.0.tar.gz.

File metadata

  • Download URL: low_resource_nlp_toolkit-0.2.0.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for low_resource_nlp_toolkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4ef0273baad864de16ba97fb893f45e6cb1e0b3a9964549a7606b49c68c1de06
MD5 e93530b2f2d65725651391f1687f92b1
BLAKE2b-256 26d6ada623b27b31a5811504fddd8a1a29186dc6a616518742518c7a0acbe25a

See more details on using hashes here.

File details

Details for the file low_resource_nlp_toolkit-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for low_resource_nlp_toolkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 786f0f95f5001d52547700afb80c5e1a8813f971294fac0481ebf786277fa1e9
MD5 ba9d11079b8b4d67bbf208c7f0738061
BLAKE2b-256 23603744baefd7b60ec8157785d5192fb8d3a01a633d9fbf16045dc22a3fb547

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page