Open-source ML tools, libraries, and notebooks for the Nigerian ML ecosystem
Project description
NaijaML
Python library for Nigerian language NLP. Supports Yorùbá, Hausa, Igbo, and Nigerian Pidgin.
pip install naijaml
What's Inside
| Feature | What it does | Accuracy |
|---|---|---|
| Language Detection | Identify yor/hau/ibo/pcm/eng | ~95% |
| Yorùbá Diacritizer | Restore ọ, ẹ, ṣ marks | 97.5% |
| Igbo Diacritizer | Restore ị, ọ, ụ marks | 95.2% |
| Sentiment Analysis | Classify pos/neg/neutral | 72% |
| Dataset Loaders | NaijaSenti, MasakhaNER, MasakhaNEWS | - |
| Text Preprocessing | PII masking, Pidgin-aware cleaning | - |
All models run on CPU. No GPU required. Total size: ~17MB.
Quick Start
Language Detection
from naijaml.nlp import detect_language
detect_language("Bawo ni, se daadaa ni?")
# → 'yor'
detect_language("Ina kwana?")
# → 'hau'
detect_language("Kedu ka ị mere?")
# → 'ibo'
detect_language("How far, wetin dey happen?")
# → 'pcm'
Yorùbá Diacritizer
from naijaml.nlp import diacritize_dot_below
diacritize_dot_below("Ojo lo si oja")
# → 'Ọjọ lo si ọja'
diacritize_dot_below("Ese pupo fun iranlowo re")
# → 'Ẹsẹ pupo fun iranlọwọ rẹ'
Igbo Diacritizer
from naijaml.nlp import diacritize_igbo
diacritize_igbo("Kedu ka i mere")
# → 'Kedụ ka ị mere'
diacritize_igbo("Daalu nne")
# → 'Daalu nne'
Sentiment Analysis
from naijaml.nlp import analyze_sentiment
analyze_sentiment("This film too sweet!")
# → {'label': 'positive', 'confidence': 0.64, ...}
analyze_sentiment("I no like am at all")
# → {'label': 'negative', 'confidence': 0.54, ...}
analyze_sentiment("Wannan fim din yana da kyau") # Hausa
# → {'label': 'positive', 'confidence': 0.81, ...}
Dataset Loaders
from naijaml.data import load_dataset
# NaijaSenti - Sentiment Analysis (Hausa, Igbo, Yorùbá, Pidgin)
data = load_dataset("naijasenti", lang="yor", split="train")
# → [{'text': 'Ọjọ́ yìí dára gan!', 'label': 'positive'}, ...]
# 8,522 Yorùbá samples, 14,172 Hausa, 10,192 Igbo, 5,121 Pidgin
# MasakhaNER - Named Entity Recognition (Hausa, Igbo, Yorùbá)
ner_data = load_dataset("masakhaner", lang="hau", split="train")
# → [{'tokens': ['Shugaba', 'Tinubu', 'ya', ...], 'ner_tags': ['B-PER', 'I-PER', 'O', ...]}, ...]
# Tags: PER (person), ORG (organization), LOC (location), DATE
# MasakhaNEWS - News Classification (Hausa, Igbo, Yorùbá, Pidgin)
news = load_dataset("masakhanews", lang="pcm", split="train")
# → [{'text': '...', 'label': 'sports', 'headline': '...', 'url': '...'}, ...]
# Categories: business, entertainment, health, politics, sports, technology
Text Preprocessing
from naijaml.nlp import mask_pii, is_pidgin_particle
# Mask Nigerian phone numbers, emails, BVN, NIN
mask_pii("Call me on 08012345678 or email me@example.com")
# → 'Call me on [PHONE] or [EMAIL]'
# Check Pidgin particles (words often stripped by other NLP tools)
is_pidgin_particle("sha") # → True
is_pidgin_particle("sef") # → True
is_pidgin_particle("abeg") # → True
Nigerian Constants
from naijaml.utils.constants import STATES, BANKS, format_naira, get_telco
# All 36 states + FCT
STATES["Lagos"] # → 'Ikeja'
# Nigerian banks
BANKS["GTBank"] # → '058'
# Format Naira
format_naira(1500000) # → '₦1,500,000.00'
# Identify telco from phone number
get_telco("08031234567") # → 'MTN'
Design Philosophy
- CPU-first: Everything works on a laptop with 4GB RAM
- Minimal dependencies: Core package needs only
numpy,requests,tqdm - Offline-capable: Models cached locally after first download
- Honest metrics: We report real accuracy numbers, not cherry-picked results
Models
All models are lightweight and included in the package:
| Model | Size | Approach |
|---|---|---|
| Language Detection | 1.8MB | Naive Bayes + char n-grams |
| Yorùbá Diacritizer | 6.4MB | Syllable-based k-NN |
| Igbo Diacritizer | 4.9MB | Syllable-based k-NN |
| Sentiment Analysis | 4.3MB | TF-IDF + Logistic Regression |
Limitations
Be aware of current limitations:
- Yorùbá tones: Dot-below restoration is 97.5% accurate, but full tonal diacritization is ~77% due to contextual ambiguity
- Sentiment: 72% accuracy on Twitter data. Production use cases may want the optional transformer model (coming soon)
- Pidgin-English: Short texts can be ambiguous between Pidgin and English
Coming Soon
- Hausa diacritizer
- More dataset loaders (MENYO-20k, NollySenti, AfriQA, MasakhaPOS)
- Optional transformer models via
pip install naijaml[nlp] - Named Entity Recognition wrapper
- Speech-to-text for Nigerian languages
Contributing
We welcome contributions from the Nigerian and global ML community.
# Clone and install dev dependencies
git clone https://github.com/naijaml/naijaml.git
cd naijaml
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
License
MIT
Acknowledgments
Built with data from Masakhane, HausaNLP, and the African NLP community.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file naijaml-0.1.0.tar.gz.
File metadata
- Download URL: naijaml-0.1.0.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9d826fb445c5396cc74f17967a914df3ad687f174b0369b875d6b8afca96870
|
|
| MD5 |
ab11942f837ee4a548185f2d53110b9a
|
|
| BLAKE2b-256 |
21928fd649cd1a39becda8932707fa378c9c20f28c42d7477ccf76a4c408df48
|
Provenance
The following attestation bundles were made for naijaml-0.1.0.tar.gz:
Publisher:
release.yml on naijaml/naijaml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
naijaml-0.1.0.tar.gz -
Subject digest:
d9d826fb445c5396cc74f17967a914df3ad687f174b0369b875d6b8afca96870 - Sigstore transparency entry: 944004612
- Sigstore integration time:
-
Permalink:
naijaml/naijaml@da4d18b7705312281fea667f7d334bd58109a5e9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/naijaml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@da4d18b7705312281fea667f7d334bd58109a5e9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file naijaml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: naijaml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 3.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
760928ce10f5b92aa27b1e8a52c25c582f3c05787b84a1c2b927252f846d2b9a
|
|
| MD5 |
93d0a20a3a869d39a1464b54b77b526b
|
|
| BLAKE2b-256 |
3228e04a767f2176b871262ec68ce908ce1a3ae0c71ef74a116da804d0479aad
|
Provenance
The following attestation bundles were made for naijaml-0.1.0-py3-none-any.whl:
Publisher:
release.yml on naijaml/naijaml
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
naijaml-0.1.0-py3-none-any.whl -
Subject digest:
760928ce10f5b92aa27b1e8a52c25c582f3c05787b84a1c2b927252f846d2b9a - Sigstore transparency entry: 944004616
- Sigstore integration time:
-
Permalink:
naijaml/naijaml@da4d18b7705312281fea667f7d334bd58109a5e9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/naijaml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@da4d18b7705312281fea667f7d334bd58109a5e9 -
Trigger Event:
push
-
Statement type: