Skip to main content

Open-source ML tools, libraries, and notebooks for the Nigerian ML ecosystem

Project description

NaijaML

Sovereign ML infrastructure for Nigeria.

Production-ready NLP tools for Yoruba, Hausa, Igbo, and Nigerian Pidgin.
Works on CPU. Works offline. 17MB total. No GPU required.

PyPI Python License HuggingFace


Standard NLP tools don't work for Nigeria. Tokenizers strip Yoruba diacritics. NER models don't recognize Nigerian names or states. Sentiment tools think Pidgin is broken English. Preprocessing libraries flag "sha" and "sef" as misspellings.

NaijaML is an open-source Python library that fixes this — built for the real constraints of developing ML in Nigeria: limited compute, intermittent connectivity, expensive bandwidth, and 500+ languages that the global ML ecosystem ignores.

pip install naijaml

Quick Start

Yoruba Diacritizer

from naijaml.nlp import diacritize_dot_below

diacritize_dot_below("Ojo lo si oja")
# → 'Ọjọ lo si ọja'

diacritize_dot_below("Ese pupo fun iranlowo re")
# → 'Ẹsẹ pupo fun iranlọwọ rẹ'

# 97.5% accuracy | 6.4MB model | CPU only | Works offline

Igbo Diacritizer

from naijaml.nlp import diacritize_igbo

diacritize_igbo("Kedu ka i mere")
# → 'Kedụ ka ị mere'

# 95.2% accuracy | 4.9MB model | CPU only

Language Detection

from naijaml.nlp import detect_language

detect_language("Bawo ni, se daadaa ni?")   # → 'yor'
detect_language("Ina kwana?")                # → 'hau'
detect_language("Kedu ka ị mere?")           # → 'ibo'
detect_language("How far, wetin dey happen?") # → 'pcm'

# 5 languages: Yoruba, Hausa, Igbo, Pidgin, English | ~95% accuracy

Sentiment Analysis

from naijaml.nlp import analyze_sentiment

analyze_sentiment("This film too sweet!")
# → {'label': 'positive', 'confidence': 0.64, ...}

analyze_sentiment("I no like am at all")
# → {'label': 'negative', 'confidence': 0.54, ...}

analyze_sentiment("Wannan fim din yana da kyau")  # Hausa
# → {'label': 'positive', 'confidence': 0.81, ...}

# Works across Yoruba, Hausa, Igbo, and Pidgin

Load Nigerian Datasets

from naijaml.data import load_dataset

# NaijaSenti — Sentiment in 4 Nigerian languages
data = load_dataset("naijasenti", lang="yor", split="train")
# → 8,522 Yoruba samples, 14,172 Hausa, 10,192 Igbo, 5,121 Pidgin

# MasakhaNER — Named Entity Recognition
ner_data = load_dataset("masakhaner", lang="hau", split="train")
# → Tags: PER, ORG, LOC, DATE

# MasakhaNEWS — News Classification
news = load_dataset("masakhanews", lang="pcm", split="train")
# → Categories: business, entertainment, health, politics, sports, technology

# 7 datasets total | Downloads once, cached offline

Text Preprocessing

from naijaml.nlp import mask_pii, is_pidgin_particle

# Mask Nigerian PII patterns
mask_pii("Call me on 08012345678 or email me@example.com")
# → 'Call me on [PHONE] or [EMAIL]'
# Detects: +234 numbers, 080x/070x/090x, BVN, NIN, emails

# Pidgin-aware — preserves particles other tools strip
is_pidgin_particle("sha")   # → True
is_pidgin_particle("sef")   # → True
is_pidgin_particle("abeg")  # → True

Nigerian Constants

from naijaml.utils.constants import STATES, BANKS, format_naira, get_telco

STATES["Lagos"]              # → 'Ikeja'
BANKS["GTBank"]              # → '058'
format_naira(1500000)        # → '₦1,500,000.00'
get_telco("08031234567")     # → 'MTN'

Features

Feature Status Accuracy Model Size
Language Detection ~95% 1.8MB
Yoruba Diacritizer (dot-below) 97.5% 6.4MB
Igbo Diacritizer 95.2% 4.9MB
Sentiment Analysis 72% 4.3MB
Dataset Loaders (7 datasets)
Text Preprocessing & PII Masking
Nigerian Constants (states, banks, telcos)

Total model size: ~17MB. Everything runs on CPU. No GPU required.

Design Philosophy

CPU-first. Every feature works on a laptop with 4GB RAM. GPU makes things faster but is never required. 95% of African AI talent has no meaningful GPU access — NaijaML is built for them.

Offline-capable. Models are bundled. Datasets cache locally after first download. Core features work without internet.

Minimal dependencies. Core package needs only numpy, requests, tqdm. We don't pull in PyTorch if we don't need it.

Honest metrics. We report real accuracy numbers, not cherry-picked results. The sentiment model is 72%, not 95%. The Yoruba diacritizer handles dot-below at 97.5% but full tonal is ~77%. We tell you upfront.

Nigerian context. Examples use Nigerian names, cities, and data. PII masking handles Nigerian phone formats and national ID numbers. Currency is in Naira, not dollars.

Models

Model Size Approach
Language Detection 1.8MB Naive Bayes + char n-grams
Yoruba Diacritizer 6.4MB Syllable-based k-NN
Igbo Diacritizer 4.9MB Syllable-based k-NN
Sentiment Analysis 4.3MB TF-IDF + Logistic Regression

Limitations

We believe in transparency. Here's what NaijaML can't do yet:

  • Yoruba tones: Dot-below restoration (ọ, ẹ, ṣ) is 97.5% accurate. Full tonal diacritization (à, á, è, é) is ~77% due to contextual ambiguity — even native speakers sometimes disagree on tones.
  • Sentiment accuracy: 72% on Twitter data. Good enough for trend analysis, not for production decisions on individual texts. Optional transformer models coming soon.
  • Pidgin vs English: Short texts can be ambiguous between Pidgin and informal English. The detector works best on sentences of 5+ words.

Roadmap

  • Hausa diacritizer
  • More dataset loaders (MENYO-20k, NollySenti, AfriQA, MasakhaPOS)
  • Optional transformer models via pip install naijaml[transformers]
  • Named Entity Recognition for Nigerian entities
  • Speech-to-text for Nigerian languages
  • Nigerian ML Benchmark — how do GPT-4, Llama, Gemma perform on Nigerian tasks?

Contributing

We need people who know Nigerian languages, Nigerian data, and Nigerian problems — ML engineers, linguists, data scientists, and domain experts in fintech, agritech, and health.

git clone https://github.com/naijaml/naijaml.git
cd naijaml
pip install -e ".[dev]"
pytest tests/ -v

Links

Acknowledgments

Built with data and research from Masakhane, HausaNLP, and the African NLP community.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naijaml-0.1.1.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

naijaml-0.1.1-py3-none-any.whl (3.7 MB view details)

Uploaded Python 3

File details

Details for the file naijaml-0.1.1.tar.gz.

File metadata

  • Download URL: naijaml-0.1.1.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for naijaml-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a66384313b92a1b2e49d957b0ec72f8f081824631fd4efa9a239578accdf2997
MD5 c6bd630b49fdbddc311d9c66fbfbde4d
BLAKE2b-256 244e85ee35cb32d0cf9a6524db8461a41ef0b76f9e017b0f9b4fafbdd5a31083

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaml-0.1.1.tar.gz:

Publisher: release.yml on naijaml/naijaml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file naijaml-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: naijaml-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for naijaml-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2301b60fe81364a413fa02a0941db4ce3618d0e4363b3cf98fcc052b67bb7f42
MD5 0aa8f497a461ee57a02065573c8532b8
BLAKE2b-256 b1e8287429c4049f5838a8ead590ec1e4ded567aa784b630e6b2623ea0643f18

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaml-0.1.1-py3-none-any.whl:

Publisher: release.yml on naijaml/naijaml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page