Skip to main content

Utility library for analysis & (pre)processing of Yorùbá text

Project description


Build Status PyPI PyPI - Python Version License Style

Ìrànlọ́wọ́ is a set of utilities to analyze & process Yorùbá text for NLP tasks. The focus is on helping software developers build large, clean text datasets for (further) diacritic restoration and machine translation tasks.


ADR tools

  • <input type="checkbox" disabled="" /> Strip all diacritics from word-types
  • <input type="checkbox" disabled="" /> Verify that text is NFC or NFD
  • <input type="checkbox" disabled="" /> Normalize a corpus (from MS Word or elsewhere) → NFC
  • <input type="checkbox" disabled="" /> Split long sentences on certain characters like ;,:, etc
  • <input type="checkbox" disabled="" /> Automatically restore correct diacritics using a pre-trained model
  • <input type="checkbox" disabled="" /> Find all variants of all word-type in a given corpus
  • <input type="checkbox" disabled="" /> Partially strip diacritics from word-types

Ready to use webpage scrapers

  • <input type="checkbox" disabled="" /> Bíbélì Mímọ́ (Biblica, Bible Society of Nigeria)
  • <input type="checkbox" disabled="" /> Yorùbá Blog
  • <input type="checkbox" disabled="" /> BBC Yorùbá

Corpus analysis tools

  • <input type="checkbox" disabled="" /> Dataset character distribution
  • <input type="checkbox" disabled="" /> Dataset ambuiguity statistics → Lexdif, etc for a given corpus
  • <input type="checkbox" disabled="" /> Dataset scoring (proximity to correctly diacritized text, LM perplexity, KL divergence)


Obtainable from the Python Package Index (PyPI)pip install iranlowo


  • Show computing environment and installation process
  • Diacritize a phrase
$ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import iranlowo.adr as ránlọ
>>> ránlọ.diacritize_text("lootoo ni pe ojo gbogbo ni ti ole")
PRED AVG SCORE: -0.0037, PRED PPL: 1.0037
'lóòtóọ́ ni pé ọjọ́ gbogbo ni ti olè' 
  • Diacritize phrases, note we use ipython only because it renders nicer, easy-to-read text-colours in the terminal!


This is beta software, if you pass the diacritizer out-of-domain text, English, pidgin or any other non-Yorùbá text, you will experience very marvelous, black-box results.

Since this a work-in-progress and we are steadily improving, if you encounter any problems with correctness or performance, please submit pull-requests with corrections or file an issue.


This project is licensed under the MIT License.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for iranlowo, version
Filename, size File type Python version Upload date Hashes
Filename, size iranlowo- (87.9 MB) File type Wheel Python version py3 Upload date Hashes View
Filename, size iranlowo- (87.9 MB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page