Skip to main content

Open-source ML tools, libraries, and notebooks for the Nigerian ML ecosystem

Project description

NaijaML

Python library for Nigerian language NLP. Supports Yorùbá, Hausa, Igbo, and Nigerian Pidgin.

pip install naijaml

What's Inside

Feature What it does Accuracy
Language Detection Identify yor/hau/ibo/pcm/eng ~95%
Yorùbá Diacritizer Restore ọ, ẹ, ṣ marks 97.5%
Igbo Diacritizer Restore ị, ọ, ụ marks 95.2%
Sentiment Analysis Classify pos/neg/neutral 72%
Dataset Loaders NaijaSenti, MasakhaNER, MasakhaNEWS -
Text Preprocessing PII masking, Pidgin-aware cleaning -

All models run on CPU. No GPU required. Total size: ~17MB.

Quick Start

Language Detection

from naijaml.nlp import detect_language

detect_language("Bawo ni, se daadaa ni?")
# → 'yor'

detect_language("Ina kwana?")
# → 'hau'

detect_language("Kedu ka ị mere?")
# → 'ibo'

detect_language("How far, wetin dey happen?")
# → 'pcm'

Yorùbá Diacritizer

from naijaml.nlp import diacritize_dot_below

diacritize_dot_below("Ojo lo si oja")
# → 'Ọjọ lo si ọja'

diacritize_dot_below("Ese pupo fun iranlowo re")
# → 'Ẹsẹ pupo fun iranlọwọ rẹ'

Igbo Diacritizer

from naijaml.nlp import diacritize_igbo

diacritize_igbo("Kedu ka i mere")
# → 'Kedụ ka ị mere'

diacritize_igbo("Daalu nne")
# → 'Daalu nne'

Sentiment Analysis

from naijaml.nlp import analyze_sentiment

analyze_sentiment("This film too sweet!")
# → {'label': 'positive', 'confidence': 0.64, ...}

analyze_sentiment("I no like am at all")
# → {'label': 'negative', 'confidence': 0.54, ...}

analyze_sentiment("Wannan fim din yana da kyau")  # Hausa
# → {'label': 'positive', 'confidence': 0.81, ...}

Dataset Loaders

from naijaml.data import load_dataset

# NaijaSenti - Sentiment Analysis (Hausa, Igbo, Yorùbá, Pidgin)
data = load_dataset("naijasenti", lang="yor", split="train")
# → [{'text': 'Ọjọ́ yìí dára gan!', 'label': 'positive'}, ...]
# 8,522 Yorùbá samples, 14,172 Hausa, 10,192 Igbo, 5,121 Pidgin

# MasakhaNER - Named Entity Recognition (Hausa, Igbo, Yorùbá)
ner_data = load_dataset("masakhaner", lang="hau", split="train")
# → [{'tokens': ['Shugaba', 'Tinubu', 'ya', ...], 'ner_tags': ['B-PER', 'I-PER', 'O', ...]}, ...]
# Tags: PER (person), ORG (organization), LOC (location), DATE

# MasakhaNEWS - News Classification (Hausa, Igbo, Yorùbá, Pidgin)
news = load_dataset("masakhanews", lang="pcm", split="train")
# → [{'text': '...', 'label': 'sports', 'headline': '...', 'url': '...'}, ...]
# Categories: business, entertainment, health, politics, sports, technology

Text Preprocessing

from naijaml.nlp import mask_pii, is_pidgin_particle

# Mask Nigerian phone numbers, emails, BVN, NIN
mask_pii("Call me on 08012345678 or email me@example.com")
# → 'Call me on [PHONE] or [EMAIL]'

# Check Pidgin particles (words often stripped by other NLP tools)
is_pidgin_particle("sha")  # → True
is_pidgin_particle("sef")  # → True
is_pidgin_particle("abeg") # → True

Nigerian Constants

from naijaml.utils.constants import STATES, BANKS, format_naira, get_telco

# All 36 states + FCT
STATES["Lagos"]  # → 'Ikeja'

# Nigerian banks
BANKS["GTBank"]  # → '058'

# Format Naira
format_naira(1500000)  # → '₦1,500,000.00'

# Identify telco from phone number
get_telco("08031234567")  # → 'MTN'

Design Philosophy

  • CPU-first: Everything works on a laptop with 4GB RAM
  • Minimal dependencies: Core package needs only numpy, requests, tqdm
  • Offline-capable: Models cached locally after first download
  • Honest metrics: We report real accuracy numbers, not cherry-picked results

Models

All models are lightweight and included in the package:

Model Size Approach
Language Detection 1.8MB Naive Bayes + char n-grams
Yorùbá Diacritizer 6.4MB Syllable-based k-NN
Igbo Diacritizer 4.9MB Syllable-based k-NN
Sentiment Analysis 4.3MB TF-IDF + Logistic Regression

Limitations

Be aware of current limitations:

  • Yorùbá tones: Dot-below restoration is 97.5% accurate, but full tonal diacritization is ~77% due to contextual ambiguity
  • Sentiment: 72% accuracy on Twitter data. Production use cases may want the optional transformer model (coming soon)
  • Pidgin-English: Short texts can be ambiguous between Pidgin and English

Coming Soon

  • Hausa diacritizer
  • More dataset loaders (MENYO-20k, NollySenti, AfriQA, MasakhaPOS)
  • Optional transformer models via pip install naijaml[nlp]
  • Named Entity Recognition wrapper
  • Speech-to-text for Nigerian languages

Contributing

We welcome contributions from the Nigerian and global ML community.

# Clone and install dev dependencies
git clone https://github.com/naijaml/naijaml.git
cd naijaml
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

License

MIT

Acknowledgments

Built with data from Masakhane, HausaNLP, and the African NLP community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

naijaml-0.1.0.tar.gz (4.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

naijaml-0.1.0-py3-none-any.whl (3.7 MB view details)

Uploaded Python 3

File details

Details for the file naijaml-0.1.0.tar.gz.

File metadata

  • Download URL: naijaml-0.1.0.tar.gz
  • Upload date:
  • Size: 4.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for naijaml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d9d826fb445c5396cc74f17967a914df3ad687f174b0369b875d6b8afca96870
MD5 ab11942f837ee4a548185f2d53110b9a
BLAKE2b-256 21928fd649cd1a39becda8932707fa378c9c20f28c42d7477ccf76a4c408df48

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaml-0.1.0.tar.gz:

Publisher: release.yml on naijaml/naijaml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file naijaml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: naijaml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for naijaml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 760928ce10f5b92aa27b1e8a52c25c582f3c05787b84a1c2b927252f846d2b9a
MD5 93d0a20a3a869d39a1464b54b77b526b
BLAKE2b-256 3228e04a767f2176b871262ec68ce908ce1a3ae0c71ef74a116da804d0479aad

See more details on using hashes here.

Provenance

The following attestation bundles were made for naijaml-0.1.0-py3-none-any.whl:

Publisher: release.yml on naijaml/naijaml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page