Open-source ML tools, libraries, and notebooks for the Nigerian ML ecosystem

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Py.Vinci

These details have not been verified by PyPI

Project description

NaijaML

Sovereign ML infrastructure for Nigeria.

Production-ready NLP tools for Yoruba, Hausa, Igbo, and Nigerian Pidgin.
Works on CPU. Works offline. No GPU required.

Standard NLP tools don't work for Nigeria. Tokenizers strip Yoruba diacritics. NER models don't recognize Nigerian names or states. Sentiment tools think Pidgin is broken English. Preprocessing libraries flag "sha" and "sef" as misspellings.

NaijaML is an open-source Python library that fixes this — built for the real constraints of developing ML in Nigeria: limited compute, intermittent connectivity, expensive bandwidth, and 500+ languages that the global ML ecosystem ignores.

pip install naijaml

Quick Start

Yoruba Diacritizer

from naijaml.nlp import diacritize_yoruba, diacritize_yoruba_dot_below

diacritize_yoruba_dot_below("Ojo lo si oja")
# → 'Ọjọ lo si ọja'  (dot-below only, no tones)

diacritize_yoruba("Ojo lo si oja lana")
# → 'Ọjọ́ ló sí ọjà lànà'  (full tonal restoration)

# Dot-below: 97.5% accuracy | 6.4MB bundled
# Full tonal: 90.0% accuracy | 12.6MB auto-downloaded on first use

Igbo Diacritizer

from naijaml.nlp import diacritize_igbo

diacritize_igbo("Kedu ka i mere")
# → 'Kedụ ka ị mere'

# 95.2% accuracy | 4.9MB model | CPU only

Language Detection

from naijaml.nlp import detect_language

detect_language("Bawo ni, se daadaa ni?")   # → 'yor'
detect_language("Ina kwana?")                # → 'hau'
detect_language("Kedu ka ị mere?")           # → 'ibo'
detect_language("How far, wetin dey happen?") # → 'pcm'

# 5 languages: Yoruba, Hausa, Igbo, Pidgin, English | 96.6% accuracy

Sentiment Analysis

from naijaml.nlp import analyze_sentiment

analyze_sentiment("This film too sweet!")
# → {'label': 'positive', 'confidence': 0.64, ...}

analyze_sentiment("I no like am at all")
# → {'label': 'negative', 'confidence': 0.54, ...}

analyze_sentiment("Wannan fim din yana da kyau")  # Hausa
# → {'label': 'positive', 'confidence': 0.81, ...}

# Works across Yoruba, Hausa, Igbo, and Pidgin

Load Nigerian Datasets

from naijaml.data import load_dataset

# NaijaSenti — Sentiment in 4 Nigerian languages
data = load_dataset("naijasenti", lang="yor", split="train")
# → 8,522 Yoruba samples, 14,172 Hausa, 10,192 Igbo, 5,121 Pidgin

# MasakhaNER — Named Entity Recognition
ner_data = load_dataset("masakhaner", lang="hau", split="train")
# → Tags: PER, ORG, LOC, DATE

# MasakhaNEWS — News Classification
news = load_dataset("masakhanews", lang="pcm", split="train")
# → Categories: business, entertainment, health, politics, sports, technology

# 7 datasets total | Downloads once, cached offline

Text Preprocessing

from naijaml.nlp import mask_pii, is_pidgin_particle

# Mask Nigerian PII patterns
mask_pii("Call me on 08012345678 or email me@example.com")
# → 'Call me on [PHONE] or [EMAIL]'
# Detects: +234 numbers, 080x/070x/090x, BVN, NIN, emails

# Pidgin-aware — preserves particles other tools strip
is_pidgin_particle("sha")   # → True
is_pidgin_particle("sef")   # → True
is_pidgin_particle("abeg")  # → True

Nigerian Constants

from naijaml.utils.constants import STATES, BANKS, format_naira, get_telco

STATES["Lagos"]              # → 'Ikeja'
BANKS["Guaranty Trust Bank"]  # → '058'
format_naira(1500000)        # → '₦1,500,000.00'
get_telco("08031234567")     # → 'MTN'

Tokenizer

from naijaml.nlp import Tokenizer

tok = Tokenizer("yoruba")
tokens = tok.encode("Ọjọ́ àìkú")
text = tok.decode(tokens)  # Perfect roundtrip

# Or use the unified tokenizer for all 4 languages
tok = Tokenizer("naija")
tok.encode("Ẹ kú àbọ̀")      # Yoruba
tok.encode("Kedụ ka ị mere")  # Igbo
tok.encode("Ina kwana?")      # Hausa

# 63% fewer tokens than GPT-4 for Yoruba | 100% diacritic preservation

Features

Feature	Status	Accuracy / Efficiency	Model Size
Tokenizer (Yoruba)	✅	63% fewer tokens vs GPT-4, 45% vs AfriBERTa	560KB
Tokenizer (Igbo)	✅	50% fewer tokens vs GPT-4, 40% vs AfriBERTa	550KB
Tokenizer (Hausa)	✅	31% fewer tokens vs GPT-4, 18% vs AfriBERTa	420KB
Tokenizer (Pidgin)	✅	14% fewer tokens vs GPT-4	510KB
Tokenizer (Unified)	✅	All 4 languages	400KB
Language Detection	✅	96.6% accuracy	29.6MB
Yoruba Diacritizer (full tonal)	✅	90.0% word accuracy	12.6MB
Yoruba Diacritizer (dot-below)	✅	97.5% char accuracy	6.4MB
Igbo Diacritizer	✅	95.2% accuracy	4.9MB
Sentiment Analysis	✅	72% accuracy	4.3MB
Dataset Loaders (7 datasets)	✅	—	—
Text Preprocessing & PII Masking	✅	—	—
Nigerian Constants (states, banks, telcos)	✅	—	—

~48MB bundled, 13MB downloaded on first use. Everything runs on CPU. No GPU required.

Design Philosophy

CPU-first. Every feature works on a laptop with 4GB RAM. GPU makes things faster but is never required. 95% of African AI talent has no meaningful GPU access — NaijaML is built for them.

Offline-capable. Small models ship with the package; larger ones auto-download from HuggingFace on first use and cache locally. After first run, everything works without internet.

Minimal dependencies. Core package needs only numpy, requests, tqdm, and tokenizers. We don't pull in PyTorch if we don't need it.

Honest metrics. We report real accuracy numbers, not cherry-picked results. The sentiment model is 72%, not 95%. The Yoruba diacritizer handles dot-below at 97.5% but full tonal is 90%. We tell you upfront.

Nigerian context. Examples use Nigerian names, cities, and data. PII masking handles Nigerian phone formats and national ID numbers. Currency is in Naira, not dollars.

Models

Model	Size	Approach
Tokenizers (5 models)	2.4MB total	BPE trained on dedicated Nigerian language corpora
Language Detection	29.6MB	Naive Bayes + char n-grams (1-4) + language features
Yoruba Diacritizer (full)	12.6MB	Word-level lookup + Viterbi decoding
Yoruba Diacritizer (dot-below)	6.4MB	Syllable-based k-NN
Igbo Diacritizer	4.9MB	Syllable-based k-NN
Sentiment Analysis	4.3MB	TF-IDF + Logistic Regression

Limitations

We believe in transparency. Here's what NaijaML can't do yet:

Yoruba tones: Dot-below restoration (ọ, ẹ, ṣ) is 97.5% accurate. Full tonal diacritization (à, á, è, é) is 90% word accuracy using Viterbi decoding — remaining errors are due to contextual ambiguity where even native speakers sometimes disagree on tones.
Sentiment accuracy: 72% on Twitter data. Good enough for trend analysis, not for production decisions on individual texts. Optional transformer models coming soon.
Pidgin vs English: Pidgin is an English-based creole, so code-mixed texts can be ambiguous. The detector requires Pidgin-specific markers (e.g., "dey", "wetin", "abeg") to classify as Pidgin — English-like text without markers defaults to English. 94.6% Pidgin recall, 99.9% English recall on held-out data.

Tokenizer Benchmark

We benchmarked against GPT-4 (tiktoken), AfriBERTa, and AfroXLMR on Nigerian languages:

Token Efficiency (fewer = better)

Language	GPT-4	AfriBERTa	AfroXLMR	NaijaML
Yoruba	baseline	+45%	+12%	+63%
Igbo	baseline	+40%	-1%	+50%
Hausa	baseline	+18%	+14%	+31%
Pidgin	baseline	-1%	—	+14%

Diacritic Handling (critical difference)

Input	GPT-4	AfriBERTa	NaijaML
`ọ́` (compound)	2 tokens	2 tokens	1 token
`ẹ̀` (compound)	3 tokens	2 tokens	1 token
`Ẹ kú àbọ̀`	8 tokens	5 tokens	3 tokens

Other tokenizers split diacritics because they weren't trained on enough Nigerian data. NaijaML keeps them together.

Speed (batch encoding)

Tokenizer	Speed	vs GPT-4
NaijaML (Rust)	3.8M tok/s	2.5x faster
GPT-4 (tiktoken)	1.7M tok/s	baseline
AfriBERTa	0.4M tok/s	4x slower

See the full analysis in benchmarks/.

Roadmap

Hausa diacritizer
More dataset loaders (MENYO-20k, NollySenti, AfriQA, MasakhaPOS)
Optional transformer models via pip install naijaml[transformers]
Named Entity Recognition for Nigerian entities
Speech-to-text for Nigerian languages

Contributing

We need people who know Nigerian languages, Nigerian data, and Nigerian problems — ML engineers, linguists, data scientists, and domain experts in fintech, agritech, and health.

git clone https://github.com/naijaml/naijaml.git
cd naijaml
pip install -e ".[dev]"
pytest tests/ -v

Acknowledgments

Built with data and research from Masakhane, HausaNLP, and the African NLP community.

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Py.Vinci

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Mar 5, 2026

0.2.0

Feb 23, 2026

0.1.3

Feb 15, 2026

0.1.2

Feb 15, 2026

0.1.1

Feb 12, 2026

0.1.0

Feb 12, 2026

0.0.1

Feb 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naijaml-0.2.1.tar.gz (9.5 MB view details)

Uploaded Mar 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

naijaml-0.2.1-py3-none-any.whl (9.3 MB view details)

Uploaded Mar 5, 2026 Python 3

File details

Details for the file naijaml-0.2.1.tar.gz.

File metadata

Download URL: naijaml-0.2.1.tar.gz
Upload date: Mar 5, 2026
Size: 9.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for naijaml-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`d4220d5ac68561d4696bdb874e2566a8a62f52225a5c6bf63589163a57f179f6`
MD5	`6dd8554a3fe34e42e6baf6c07a6af58c`
BLAKE2b-256	`8f4f939abcc2171f4bcb628bce9a2108dd3b71221eca23655f53bdaaee293bf6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaml-0.2.1.tar.gz:

Publisher: release.yml on naijaml/naijaml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: naijaml-0.2.1.tar.gz
- Subject digest: d4220d5ac68561d4696bdb874e2566a8a62f52225a5c6bf63589163a57f179f6
- Sigstore transparency entry: 1040733475
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: naijaml/naijaml@f149e978982b87e8fa148b745a1ce9ffbddc054d
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/naijaml
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f149e978982b87e8fa148b745a1ce9ffbddc054d
- Trigger Event: push

File details

Details for the file naijaml-0.2.1-py3-none-any.whl.

File metadata

Download URL: naijaml-0.2.1-py3-none-any.whl
Upload date: Mar 5, 2026
Size: 9.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for naijaml-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89b1ce3e3b51ea07b3dbfdd7d689fb5302fe94a843b8e38e646b71cc67c8ada6`
MD5	`e522bf8f22c8cceb6b26f00f434d0216`
BLAKE2b-256	`87850da63c772ca82acd517b5797658f455ccbeb5c32db974be8919197755880`

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaml-0.2.1-py3-none-any.whl:

Publisher: release.yml on naijaml/naijaml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: naijaml-0.2.1-py3-none-any.whl
- Subject digest: 89b1ce3e3b51ea07b3dbfdd7d689fb5302fe94a843b8e38e646b71cc67c8ada6
- Sigstore transparency entry: 1040733543
- Sigstore integration time: Mar 5, 2026
Source repository:
- Permalink: naijaml/naijaml@f149e978982b87e8fa148b745a1ce9ffbddc054d
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/naijaml
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f149e978982b87e8fa148b745a1ce9ffbddc054d
- Trigger Event: push

naijaml 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

NaijaML

Quick Start

Yoruba Diacritizer

Igbo Diacritizer

Language Detection

Sentiment Analysis

Load Nigerian Datasets

Text Preprocessing

Nigerian Constants

Tokenizer

Features

Design Philosophy

Models

Limitations

Tokenizer Benchmark

Token Efficiency (fewer = better)

Diacritic Handling (critical difference)

Speed (batch encoding)

Roadmap

Contributing

Links

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance