Skip to main content

A robust Python toolkit for low-resource African language pre-processing, emotion labels, evaluation, and routing.

Project description

Low-Resource NLP Toolkit

A public, research-facing Python toolkit for African language pre-processing, emotion-label mapping, evaluation, and language/dialect routing.

The project is designed as a safe open-source wrapper around the kinds of NLP engineering problems that appear in low-resource and multilingual AI research: noisy text, code-switching, uneven label taxonomies, small datasets, and evaluation that must be transparent.

Status: 0.1.0 seed release from source. Local checks, CI, isolated wheel builds, metadata checks, and install tests pass.

Why This Exists

Low-resource NLP projects often spend too much time rebuilding the same foundations before modelling begins. This toolkit provides a dependable base layer:

  • Text normalisation for noisy social, conversational, and cultural text.
  • Lightweight African language routing for Yoruba, Igbo, Hausa, Nigerian Pidgin, Swahili, and English.
  • Emotion label harmonisation across categorical and valence-arousal formats.
  • Evaluation utilities for classification and routing experiments.
  • A CLI and examples that run without downloading model weights.
  • Extension points for transformer or embedding backends when a project needs heavier models.

Architecture

flowchart LR
    A["Raw multilingual text"] --> B["Normaliser"]
    B --> C["Tokeniser"]
    C --> D["Language router"]
    C --> E["Emotion label mapper"]
    D --> F["Route decision + confidence"]
    E --> G["Canonical emotion / valence-arousal"]
    F --> H["Evaluation reports"]
    G --> H

Quick Start

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .
low-resource-nlp --version

Route a text sample:

low-resource-nlp route "abeg make una help me check this model output"

Normalise text:

low-resource-nlp normalise "Ẹ káàrọ̀!!! Visit https://example.com @user"

Map an emotion label:

low-resource-nlp label joy

Run tests:

make check

Without make:

python3 scripts/quality_gate.py
PYTHONPATH=src python3 -m unittest discover -s tests

Python Usage

from low_resource_nlp import LexicalLanguageRouter, normalise_text, label_to_valence_arousal

text = normalise_text("Ẹ káàrọ̀, báwo ni?")
decision = LexicalLanguageRouter.default().route(text)
emotion = label_to_valence_arousal("joy")

print(decision.language_code, decision.confidence)
print(emotion)

Current Scope

The first public release deliberately avoids bundling private datasets or model weights. The core is deterministic, inspectable, and dependency-light. Optional embedding and transformer backends are outside the current core package.

Supported core modules:

  • normalisation: Unicode-aware text cleaning, URL/user normalisation, tokenisation, repeated-character handling.
  • routing: script-aware and lexicon-assisted language routing.
  • labels: canonical emotion labels and valence-arousal mapping.
  • evaluation: precision, recall, F1, macro/micro summaries, and confusion matrices.
  • datasets: simple CSV/JSONL readers for experiment scaffolding.

Public Project Materials

Responsible AI Notes

This toolkit is for research and prototyping. Language, dialect, and emotion labels are socially and culturally sensitive. Do not treat routing or emotion predictions as identity labels, clinical assessments, or ground truth. Always evaluate with speakers, domain experts, and context-specific data.

External Use

External use signals should be public and verifiable: issues from real users, pull requests, tutorial use, workshop demos, citations, package downloads, or adoption by a lab/community project. Self-generated activity should not be counted as impact.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

low_resource_nlp_toolkit-0.1.0.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

low_resource_nlp_toolkit-0.1.0-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file low_resource_nlp_toolkit-0.1.0.tar.gz.

File metadata

  • Download URL: low_resource_nlp_toolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for low_resource_nlp_toolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 78071e1cf2a02a08e08933915ea8583903e877c629b11bdfbfa16b9e70390b19
MD5 59dedb3ab6c13827cd5de4e8aa8d356c
BLAKE2b-256 5795fac15d01d3a0cfc564ac3488bc79811a463b71b86ccaab849aa39c741fad

See more details on using hashes here.

File details

Details for the file low_resource_nlp_toolkit-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for low_resource_nlp_toolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ae859623bca407da65a525b658e05bc6c82d11741a7c3f269d7caafc74f2f38
MD5 6e0ffc1e0e0c44dc3dd67aced8910467
BLAKE2b-256 a9c3f798d4bbcc86b014271f7a251387729d0376801f009a24af3d7d51009061

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page