Skip to main content

Utility package for analysis & (pre)processing of Yorùbá text

Project description

Ìrànlọ́wọ́

Build Status PyPI License Style

Ìrànlọ́wọ́ is a set of utilities to analyze & process Yorùbá text for NLP tasks. The initial focus is on help for diacritic restoration or machine translation.

Features

ADR tools

  • Strip all diacritics from word-types
  • Verify that text is NFC or NFD
  • Canonicalize a corpus (from MS Word or elsewhere) → NFC
  • Split long sentences on certain characters like ;,:, etc
  • Compute a score of diacritic ambiguity in a given corpus
  • Find all variants of all word-type in a given corpus
  • Automatically restore correct diacritics using a pre-trained model
  • Partially strip diacritics from word-types

Ready to use webpage scrapers

  • Bíbélì Mímọ́
  • Yoruba Bible - Bible Society of Nigeria
  • Yorùbá Blog
  • BBC Yorùbá

Corpus analysis tools

  • Dataset scoring (proximity to correctly diacritized text, lm perplexity, KL divergence)
  • dataset character distribution
  • dataset ambuiguity statistics → Lexdif, etc

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iranlowo-0.0.4.zip (65.2 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page