Open-source ML tools, libraries, and notebooks for the Nigerian ML ecosystem
Project description
NaijaML
Sovereign ML infrastructure for Nigeria.
Production-ready NLP tools for Yoruba, Hausa, Igbo, and Nigerian Pidgin.
Works on CPU. Works offline. No GPU required.
Standard NLP tools don't work for Nigeria. Tokenizers strip Yoruba diacritics. NER models don't recognize Nigerian names or states. Sentiment tools think Pidgin is broken English. Preprocessing libraries flag "sha" and "sef" as misspellings.
NaijaML is an open-source Python library that fixes this — built for the real constraints of developing ML in Nigeria: limited compute, intermittent connectivity, expensive bandwidth, and 500+ languages that the global ML ecosystem ignores.
pip install naijaml
Quick Start
Yoruba Diacritizer
from naijaml.nlp import diacritize_yoruba, diacritize_yoruba_dot_below
diacritize_yoruba_dot_below("Ojo lo si oja")
# → 'Ọjọ lo si ọja' (dot-below only, no tones)
diacritize_yoruba("Ojo lo si oja lana")
# → 'Ọjọ́ ló sí ọjà lànà' (full tonal restoration)
# Dot-below: 97.5% accuracy | 6.4MB bundled
# Full tonal: 90.0% accuracy | 12.6MB auto-downloaded on first use
Igbo Diacritizer
from naijaml.nlp import diacritize_igbo
diacritize_igbo("Kedu ka i mere")
# → 'Kedụ ka ị mere'
# 95.2% accuracy | 4.9MB model | CPU only
Language Detection
from naijaml.nlp import detect_language
detect_language("Bawo ni, se daadaa ni?") # → 'yor'
detect_language("Ina kwana?") # → 'hau'
detect_language("Kedu ka ị mere?") # → 'ibo'
detect_language("How far, wetin dey happen?") # → 'pcm'
# 5 languages: Yoruba, Hausa, Igbo, Pidgin, English | 96.6% accuracy
Sentiment Analysis
from naijaml.nlp import analyze_sentiment
analyze_sentiment("This film too sweet!")
# → {'label': 'positive', 'confidence': 0.64, ...}
analyze_sentiment("I no like am at all")
# → {'label': 'negative', 'confidence': 0.54, ...}
analyze_sentiment("Wannan fim din yana da kyau") # Hausa
# → {'label': 'positive', 'confidence': 0.81, ...}
# Works across Yoruba, Hausa, Igbo, and Pidgin
Load Nigerian Datasets
from naijaml.data import load_dataset
# NaijaSenti — Sentiment in 4 Nigerian languages
data = load_dataset("naijasenti", lang="yor", split="train")
# → 8,522 Yoruba samples, 14,172 Hausa, 10,192 Igbo, 5,121 Pidgin
# MasakhaNER — Named Entity Recognition
ner_data = load_dataset("masakhaner", lang="hau", split="train")
# → Tags: PER, ORG, LOC, DATE
# MasakhaNEWS — News Classification
news = load_dataset("masakhanews", lang="pcm", split="train")
# → Categories: business, entertainment, health, politics, sports, technology
# 7 datasets total | Downloads once, cached offline
Text Preprocessing
from naijaml.nlp import mask_pii, is_pidgin_particle
# Mask Nigerian PII patterns
mask_pii("Call me on 08012345678 or email me@example.com")
# → 'Call me on [PHONE] or [EMAIL]'
# Detects: +234 numbers, 080x/070x/090x, BVN, NIN, emails
# Pidgin-aware — preserves particles other tools strip
is_pidgin_particle("sha") # → True
is_pidgin_particle("sef") # → True
is_pidgin_particle("abeg") # → True
Nigerian Constants
from naijaml.utils.constants import STATES, BANKS, format_naira, get_telco
STATES["Lagos"] # → 'Ikeja'
BANKS["Guaranty Trust Bank"] # → '058'
format_naira(1500000) # → '₦1,500,000.00'
get_telco("08031234567") # → 'MTN'
Features
| Feature | Status | Accuracy | Model Size |
|---|---|---|---|
| Language Detection | ✅ | 96.6% | 29.6MB |
| Yoruba Diacritizer (full tonal) | ✅ | 90.0% word | 12.6MB |
| Yoruba Diacritizer (dot-below) | ✅ | 97.5% char | 6.4MB |
| Igbo Diacritizer | ✅ | 95.2% | 4.9MB |
| Sentiment Analysis | ✅ | 72% | 4.3MB |
| Dataset Loaders (7 datasets) | ✅ | — | — |
| Text Preprocessing & PII Masking | ✅ | — | — |
| Nigerian Constants (states, banks, telcos) | ✅ | — | — |
45MB bundled, 13MB downloaded on first use. Everything runs on CPU. No GPU required.
Design Philosophy
CPU-first. Every feature works on a laptop with 4GB RAM. GPU makes things faster but is never required. 95% of African AI talent has no meaningful GPU access — NaijaML is built for them.
Offline-capable. Small models ship with the package; larger ones auto-download from HuggingFace on first use and cache locally. After first run, everything works without internet.
Minimal dependencies. Core package needs only numpy, requests, tqdm. We don't pull in PyTorch if we don't need it.
Honest metrics. We report real accuracy numbers, not cherry-picked results. The sentiment model is 72%, not 95%. The Yoruba diacritizer handles dot-below at 97.5% but full tonal is 90%. We tell you upfront.
Nigerian context. Examples use Nigerian names, cities, and data. PII masking handles Nigerian phone formats and national ID numbers. Currency is in Naira, not dollars.
Models
| Model | Size | Approach |
|---|---|---|
| Language Detection | 29.6MB | Naive Bayes + char n-grams (1-4) + language features |
| Yoruba Diacritizer (full) | 12.6MB | Word-level lookup + Viterbi decoding |
| Yoruba Diacritizer (dot-below) | 6.4MB | Syllable-based k-NN |
| Igbo Diacritizer | 4.9MB | Syllable-based k-NN |
| Sentiment Analysis | 4.3MB | TF-IDF + Logistic Regression |
Limitations
We believe in transparency. Here's what NaijaML can't do yet:
- Yoruba tones: Dot-below restoration (ọ, ẹ, ṣ) is 97.5% accurate. Full tonal diacritization (à, á, è, é) is 90% word accuracy using Viterbi decoding — remaining errors are due to contextual ambiguity where even native speakers sometimes disagree on tones.
- Sentiment accuracy: 72% on Twitter data. Good enough for trend analysis, not for production decisions on individual texts. Optional transformer models coming soon.
- Pidgin vs English: Pidgin is an English-based creole, so code-mixed texts can be ambiguous. The detector requires Pidgin-specific markers (e.g., "dey", "wetin", "abeg") to classify as Pidgin — English-like text without markers defaults to English. 94.6% Pidgin recall, 99.9% English recall on held-out data.
Tokenizer Benchmark
We benchmarked 7 major AI tokenizers (GPT-4, GPT-4o, Llama 3, Gemma 2, Mistral, BERT, XLM-RoBERTa) on Nigerian languages. The results:
| Language | Avg Token Ratio vs English |
|---|---|
| Yoruba | 3.14x |
| Igbo | 2.30x |
| Hausa | 1.75x |
| Pidgin | 1.05x |
Yoruba text costs 3x more to process than English with most tokenizers. GPT-4o's newer tokenizer performs best (1.69x); Mistral performs worst (2.47x). See the full analysis with interactive charts in benchmarks/.
Roadmap
- Hausa diacritizer
- More dataset loaders (MENYO-20k, NollySenti, AfriQA, MasakhaPOS)
- Optional transformer models via
pip install naijaml[transformers] - Named Entity Recognition for Nigerian entities
- Speech-to-text for Nigerian languages
Contributing
We need people who know Nigerian languages, Nigerian data, and Nigerian problems — ML engineers, linguists, data scientists, and domain experts in fintech, agritech, and health.
git clone https://github.com/naijaml/naijaml.git
cd naijaml
pip install -e ".[dev]"
pytest tests/ -v
Links
Acknowledgments
Built with data and research from Masakhane, HausaNLP, and the African NLP community.
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file naijaml-0.1.3.tar.gz.
File metadata
- Download URL: naijaml-0.1.3.tar.gz
- Upload date:
- Size: 9.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5728d2df90971ac43e7270169cab8b104293266705ae6d90ef56293ac68d3f3e
|
|
| MD5 |
af1ee7f3b27d2d8a11dcd401a4535515
|
|
| BLAKE2b-256 |
908de9c5f0ed2a8dd1cab5af4811ab5df7f30b8d3a3608ace9e56e7d5da5288d
|
Provenance
The following attestation bundles were made for naijaml-0.1.3.tar.gz:
Publisher:
release.yml on naijaml/naijaml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
naijaml-0.1.3.tar.gz -
Subject digest:
5728d2df90971ac43e7270169cab8b104293266705ae6d90ef56293ac68d3f3e - Sigstore transparency entry: 953621086
- Sigstore integration time:
-
Permalink:
naijaml/naijaml@ce7f97d2afddae6250351e43b88f4c5666cffd5b -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/naijaml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ce7f97d2afddae6250351e43b88f4c5666cffd5b -
Trigger Event:
push
-
Statement type:
File details
Details for the file naijaml-0.1.3-py3-none-any.whl.
File metadata
- Download URL: naijaml-0.1.3-py3-none-any.whl
- Upload date:
- Size: 8.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e2c8bda3c84b8e95e8cf3f38ceb25abdb43e9cfca161929e2760a05ef70601c
|
|
| MD5 |
585973722764426f8705a97fb5f9677f
|
|
| BLAKE2b-256 |
57862ad69593f9abfae4bd408b98c1d8b528eb29ad296065d95feb9574ce6999
|
Provenance
The following attestation bundles were made for naijaml-0.1.3-py3-none-any.whl:
Publisher:
release.yml on naijaml/naijaml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
naijaml-0.1.3-py3-none-any.whl -
Subject digest:
2e2c8bda3c84b8e95e8cf3f38ceb25abdb43e9cfca161929e2760a05ef70601c - Sigstore transparency entry: 953621088
- Sigstore integration time:
-
Permalink:
naijaml/naijaml@ce7f97d2afddae6250351e43b88f4c5666cffd5b -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/naijaml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ce7f97d2afddae6250351e43b88f4c5666cffd5b -
Trigger Event:
push
-
Statement type: