Skip to main content

Utility library for analysis & (pre)processing of Yorùbá text

Project description

Ìrànlọ́wọ́

Build Status PyPI PyPI - Python Version License Style

Ìrànlọ́wọ́ is a set of utilities to analyze & process Yorùbá text for NLP tasks. The focus is on helping software developers build large, clean text datasets for (further) diacritic restoration and machine translation tasks.

Features

ADR tools

  • Strip all diacritics from word-types
  • Verify that text is NFC or NFD
  • Normalize a corpus (from MS Word or elsewhere) → NFC
  • Split long sentences on certain characters like ;,:, etc
  • Automatically restore correct diacritics using a pre-trained model
  • Find all variants of all word-type in a given corpus
  • Partially strip diacritics from word-types

Ready to use webpage scrapers

  • Bíbélì Mímọ́ (Biblica, Bible Society of Nigeria)
  • Yorùbá Blog
  • BBC Yorùbá

Corpus analysis tools

  • Dataset character distribution
  • Dataset ambuiguity statistics → Lexdif, etc for a given corpus
  • Dataset scoring (proximity to correctly diacritized text, LM perplexity, KL divergence)

Installation

Obtainable from the Python Package Index (PyPI)pip install iranlowo

Example

  • Show computing environment and installation process
  • Diacritize a phrase
$ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import iranlowo.adr as ránlọ
>>> ránlọ.diacritize_text("lootoo ni pe ojo gbogbo ni ti ole")
PRED AVG SCORE: -0.0037, PRED PPL: 1.0037
'lóòtóọ́ ni pé ọjọ́ gbogbo ni ti olè' 
  • Diacritize phrases, note we use ipython only because it renders nicer, easy-to-read text-colours in the terminal!

Disclaimer

This is beta software, if you pass the diacritizer out-of-domain text, English, pidgin or any other non-Yorùbá text, you will experience very marvelous, black-box results.

Since this a work-in-progress and we are steadily improving, if you encounter any problems with correctness or performance, please submit pull-requests with corrections or file an issue.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iranlowo-0.0.8.3.tar.gz (87.9 MB view details)

Uploaded Source

Built Distribution

iranlowo-0.0.8.3-py3-none-any.whl (87.9 MB view details)

Uploaded Python 3

File details

Details for the file iranlowo-0.0.8.3.tar.gz.

File metadata

  • Download URL: iranlowo-0.0.8.3.tar.gz
  • Upload date:
  • Size: 87.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for iranlowo-0.0.8.3.tar.gz
Algorithm Hash digest
SHA256 ae62ea57b96b9d27bcd3e768655f7faffb3df7a1fd4f78f49db1ac9402dca619
MD5 22e2aa01ff4918ff850ada8fa482c76d
BLAKE2b-256 b0e37516f763688cc1bae9e71db3b33c53d5313e16a52caeb2a89a2774e203a1

See more details on using hashes here.

File details

Details for the file iranlowo-0.0.8.3-py3-none-any.whl.

File metadata

  • Download URL: iranlowo-0.0.8.3-py3-none-any.whl
  • Upload date:
  • Size: 87.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for iranlowo-0.0.8.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5679c3421f4092033bd86c60efeebf0273910c2b2a8c5fb3358518efb2ba72df
MD5 e19836c57f28ca0a929c9fd9641bd1a1
BLAKE2b-256 3984fb9e39f146f3128c4976b851b92d230ef0de47fab051c92f56f5e69e762a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page