A robust Python toolkit for low-resource African language pre-processing, emotion labels, evaluation, and routing.
Project description
Low-Resource NLP Toolkit
A public, research-facing Python toolkit for African language pre-processing, emotion-label mapping, evaluation, and language/dialect routing.
The project is designed as a safe open-source wrapper around the kinds of NLP engineering problems that appear in low-resource and multilingual AI research: noisy text, code-switching, uneven label taxonomies, small datasets, and evaluation that must be transparent.
Status: 0.2.0 release. Local checks, CI, isolated wheel builds, metadata checks, and install tests pass.
Why This Exists
Low-resource NLP projects often spend too much time rebuilding the same foundations before modelling begins. This toolkit provides a dependable base layer:
- Text normalisation for noisy social, conversational, and cultural text.
- Lightweight African language routing for Yoruba, Igbo, Hausa, Nigerian Pidgin, Swahili, and English.
- Evidence-first code-switch audits that expose token routes, spans, and abstentions.
- Emotion label harmonisation across categorical and valence-arousal formats.
- Evaluation utilities for classification and routing experiments.
- A CLI and examples that run without downloading model weights.
- Extension points for transformer or embedding backends when a project needs heavier models.
Architecture
flowchart LR
A["Raw multilingual text"] --> B["Normaliser"]
B --> C["Tokeniser"]
C --> D["Language router"]
C --> E["Emotion label mapper"]
D --> F["Route decision + confidence"]
E --> G["Canonical emotion / valence-arousal"]
F --> H["Evaluation reports"]
G --> H
Quick Start
python3 -m venv .venv
source .venv/bin/activate
python -m pip install low-resource-nlp-toolkit
low-resource-nlp --version
Route a text sample:
low-resource-nlp route "abeg make una help me check this model output"
Audit code-switched language evidence:
low-resource-nlp audit "abeg make una check this model output"
Normalise text:
low-resource-nlp normalise "Ẹ káàrọ̀!!! Visit https://example.com @user"
Map an emotion label:
low-resource-nlp label joy
Run tests:
make check
Without make:
python3 scripts/quality_gate.py
PYTHONPATH=src python3 -m unittest discover -s tests
Python Usage
from low_resource_nlp import (
LexicalLanguageRouter,
audit_code_switching,
label_to_valence_arousal,
normalise_text,
)
text = normalise_text("Ẹ káàrọ̀, báwo ni?")
decision = LexicalLanguageRouter.default().route(text)
audit = audit_code_switching("abeg make una check this model output")
emotion = label_to_valence_arousal("joy")
print(decision.language_code, decision.confidence)
print(audit.language_mix, audit.warnings)
print(emotion)
Current Scope
The first public release deliberately avoids bundling private datasets or model weights. The core is deterministic, inspectable, and dependency-light. Optional embedding and transformer backends are outside the current core package.
Supported core modules:
normalisation: Unicode-aware text cleaning, URL/user normalisation, tokenisation, repeated-character handling.routing: script-aware and lexicon-assisted language routing.audit: token-level code-switch audits with spans, evidence, and abstention warnings.labels: canonical emotion labels and valence-arousal mapping.evaluation: precision, recall, F1, macro/micro summaries, and confusion matrices.datasets: simple CSV/JSONL readers for experiment scaffolding.
Public Project Materials
- Changelog
- Contributing guide
- Documentation index
- Novelty review
- 0.1.0 release plan
- Adoption notes
- Model card template
- Data statement template
- Citation metadata
Responsible AI Notes
This toolkit is for research and prototyping. Language, dialect, and emotion labels are socially and culturally sensitive. Do not treat routing or emotion predictions as identity labels, clinical assessments, or ground truth. Always evaluate with speakers, domain experts, and context-specific data.
External Use
External use signals should be public and verifiable: issues from real users, pull requests, tutorial use, workshop demos, citations, package downloads, or adoption by a lab/community project. Self-generated activity should not be counted as impact.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file low_resource_nlp_toolkit-0.2.0.tar.gz.
File metadata
- Download URL: low_resource_nlp_toolkit-0.2.0.tar.gz
- Upload date:
- Size: 19.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ef0273baad864de16ba97fb893f45e6cb1e0b3a9964549a7606b49c68c1de06
|
|
| MD5 |
e93530b2f2d65725651391f1687f92b1
|
|
| BLAKE2b-256 |
26d6ada623b27b31a5811504fddd8a1a29186dc6a616518742518c7a0acbe25a
|
File details
Details for the file low_resource_nlp_toolkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: low_resource_nlp_toolkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
786f0f95f5001d52547700afb80c5e1a8813f971294fac0481ebf786277fa1e9
|
|
| MD5 |
ba9d11079b8b4d67bbf208c7f0738061
|
|
| BLAKE2b-256 |
23603744baefd7b60ec8157785d5192fb8d3a01a633d9fbf16045dc22a3fb547
|