Utility package for analysis & (pre)processing of Yorùbá text
Project description
Ìrànlọ́wọ́
Ìrànlọ́wọ́ is a set of utilities to analyze & process Yorùbá text for NLP tasks. The initial focus is on help for diacritic restoration or machine translation.
Features
ADR tools
- Strip all diacritics from word-types
- Verify that text is NFC or NFD
- Canonicalize a corpus (from MS Word or elsewhere) → NFC
- Split long sentences on certain characters like
;
,:
, etc - Compute a score of diacritic ambiguity in a given corpus
- Find all variants of all word-type in a given corpus
- Automatically restore correct diacritics using a pre-trained model
- Partially strip diacritics from word-types
Ready to use webpage scrapers
- Bíbélì Mímọ́
- Yoruba Bible - Bible Society of Nigeria
- Yorùbá Blog
- BBC Yorùbá
Corpus analysis tools
- Dataset scoring (proximity to correctly diacritized text, lm perplexity, KL divergence)
- dataset character distribution
- dataset ambuiguity statistics → Lexdif, etc
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
iranlowo-0.0.5.4.tar.gz
(65.2 MB
view hashes)
Built Distribution
iranlowo-0.0.5.4-py3-none-any.whl
(65.2 MB
view hashes)
Close
Hashes for iranlowo-0.0.5.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f5c884f9341ecf1a335fbfaf0f6adfab36e4aa42e868a7e98b3416a85c94fba5 |
|
MD5 | 20b55e1b43245e8e81bd2a89bb9268a9 |
|
BLAKE2b-256 | 8044c8d1f98dab639f6dc6f622d6551395c4ac2c2265d2e1ab9e4c7834bd3fd3 |