Utility library for analysis & (pre)processing of Yorùbá text
Project description
Ìrànlọ́wọ́
Ìrànlọ́wọ́ is a set of utilities to analyze & process Yorùbá text for NLP tasks. The focus is on helping software developers build large, clean text datasets for (further) diacritic restoration and machine translation tasks.
Features
ADR tools
- Strip all diacritics from word-types
- Verify that text is NFC or NFD
- Normalize a corpus (from MS Word or elsewhere) → NFC
- Split long sentences on certain characters like
;
,:
, etc - Automatically restore correct diacritics using a pre-trained model
- Find all variants of all word-type in a given corpus
- Partially strip diacritics from word-types
Ready to use webpage scrapers
- Bíbélì Mímọ́ (Biblica, Bible Society of Nigeria)
- Yorùbá Blog
- BBC Yorùbá
Corpus analysis tools
- Dataset character distribution
- Dataset ambuiguity statistics → Lexdif, etc for a given corpus
- Dataset scoring (proximity to correctly diacritized text, LM perplexity, KL divergence)
Installation
Obtainable from the Python Package Index (PyPI) → pip install iranlowo
Example
- Show computing environment and installation process
- Diacritize a phrase
$ python
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import iranlowo.adr as ránlọ
>>> ránlọ.diacritize_text("lootoo ni pe ojo gbogbo ni ti ole")
PRED AVG SCORE: -0.0037, PRED PPL: 1.0037
'lóòtóọ́ ni pé ọjọ́ gbogbo ni ti olè'
- Diacritize phrases, note we use
ipython
only because it renders nicer, easy-to-read text-colours in the terminal!
Disclaimer
This is beta software, if you pass the diacritizer out-of-domain text, English, pidgin or any other non-Yorùbá text, you will experience very marvelous, black-box results.
Since this a work-in-progress and we are steadily improving, if you encounter any problems with correctness or performance, please submit pull-requests with corrections or file an issue.
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file iranlowo-0.0.8.3.tar.gz
.
File metadata
- Download URL: iranlowo-0.0.8.3.tar.gz
- Upload date:
- Size: 87.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae62ea57b96b9d27bcd3e768655f7faffb3df7a1fd4f78f49db1ac9402dca619 |
|
MD5 | 22e2aa01ff4918ff850ada8fa482c76d |
|
BLAKE2b-256 | b0e37516f763688cc1bae9e71db3b33c53d5313e16a52caeb2a89a2774e203a1 |
File details
Details for the file iranlowo-0.0.8.3-py3-none-any.whl
.
File metadata
- Download URL: iranlowo-0.0.8.3-py3-none-any.whl
- Upload date:
- Size: 87.9 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5679c3421f4092033bd86c60efeebf0273910c2b2a8c5fb3358518efb2ba72df |
|
MD5 | e19836c57f28ca0a929c9fd9641bd1a1 |
|
BLAKE2b-256 | 3984fb9e39f146f3128c4976b851b92d230ef0de47fab051c92f56f5e69e762a |