Ethiopian Language NLP Toolkit

These details have not been verified by PyPI

Project links

Homepage

Project description

Ethiopian Language Toolkit (etltk)

The Ethiopian Natural Language Toolkit (ETLTK) project aimed to develop a suite of open source Natural Language Processing modules for the Ethiopian languages.
The Ethiopian Language Toolkit (ETLTK) is built using python language and takes inspiration from spacy and nltk libraries.

Installation

pip

etltk supports Python 3.6 or later. We recommend that you install etltk via pip, the Python package manager. To install, simply run:
```
  pip install etltk
```

From Source

Alternatively, you can also install from source via ethiopian_language_toolkit’s git repository, which will give you more flexibility in developing on top of etltk. For this option, run
```
  git clone https://github.com/robikieq/ethiopian_language_toolkit.git
  
  cd ethiopian_language_toolkit
  
  pip install -e .
```

Documentation

https://etltk.netlify.app/

Usage

Amharic text preprocessing with Amharic document

Preprocessing amharic text is very simple: you can simply pass the text to the Amharic document and access all annotations from the returned Amharic document object:

  from etltk import Amharic

  sample_text = """
    ሚያዝያ 14፣ 2014 ዓ.ም 🤗 በአገር ደረጃ የሰው ሰራሽ አስተውሎት /Artificial Intelligence/ አሁን ካለበት ዝቅተኛ ደረጃ ወደ ላቀ ደረጃ ለማድረስ፣ ሃገርኛ ቋንቋዎችን ለዓለም ተደራሽ ለማድረግ፣ አገራዊ አቅምን ለማሳደግ እና ተጠቃሚ ለመሆን በጋራ አብሮ መስራቱ እጅግ ጠቃሚ ነው፡፡

    በማሽን ዓስተምሮ (Machine Learning) አማካኝነት የጽሁፍ ናሙናዎች በአርቲፊሻል ኢንተለጀንስ ሥርዓት ለማሰልጠን፣ የጽሁፍ ዳታን መሰብሰብ እና ማደራጀት፤ የናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎችን /Natural Language Processing Tools/ በመጠቀም የጽሁፍ ዳታን ፕሮሰስ ማድረግ ተቀዳሚ እና መሰረታዊ ጉዳይ ነው።
  """

  # Annotating Amharic document
  doc = Amharic(sample_text)

  # print the `clean` text:
  print(doc)
  
  # output: Amharic("ሚያዝያ ዓመተ ምህረት በአገር ደረጃ የሰው ሰራሽ አስተውሎት አሁን ካለበት ዝቅተኛ ደረጃ ወደ ላቀ ደረጃ ለማድረስ ሀገርኛ ቋንቋዎችን ለአለም ተደራሽ ለማድረግ አገራዊ አቅምን ለማሳደግ እና ተጠቃሚ ለመሆን በጋራ አብሮ መስራቱ እጅግ ጠቃሚ ነው በማሽን አስተምሮ አማካኝነት የፅሁፍ ናሙናዎች በአርቲፊሻል ኢንተለጀንስ ስርአት ለማሰልጠን የፅሁፍ ዳታን መሰብሰብ እና ማደራጀት የናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎችን በመጠቀም የፅሁፍ ዳታን ፕሮሰስ ማድረግ ተቀዳሚ እና መሰረታዊ ጉዳይ ነው")

Here is a another example of performing text cleaning on a piece of plaintext using clean_amharic function:

from etltk.lang.am import (
  preprocessing,
  clean_amharic
)

sample_text = """
  ሚያዝያ 14፣ 2014 ዓ.ም 🤗 በአገር ደረጃ የሰው ሰራሽ አስተውሎት /Artificial Intelligence/ አሁን ካለበት ዝቅተኛ ደረጃ ወደ ላቀ ደረጃ ለማድረስ፣ ሃገርኛ ቋንቋዎችን ለዓለም ተደራሽ ለማድረግ፣ አገራዊ አቅምን ለማሳደግ እና ተጠቃሚ ለመሆን በጋራ አብሮ መስራቱ እጅግ ጠቃሚ ነው፡፡

  በማሽን ዓስተምሮ (Machine Learning) አማካኝነት የጽሁፍ ናሙናዎች በአርቲፊሻል ኢንተለጀንስ ሥርዓት ለማሰልጠን፣ የጽሁፍ ዳታን መሰብሰብ እና ማደራጀት፤ የናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎችን /Natural Language Processing Tools/ በመጠቀም የጽሁፍ ዳታን ፕሮሰስ ማድረግ ተቀዳሚ እና መሰረታዊ ጉዳይ ነው።
"""

# Define a custom preprocessor pipeline
custom_pipeline = [
  preprocessing.remove_emojis, 
  preprocessing.remove_digits,
  preprocessing.remove_ethiopic_punct,
  preprocessing.remove_english_chars, 
  preprocessing.remove_punct
]

# `clean_amharic` function takes a custom pipeline, if not uses the default pipeline
cleaned = clean_amharic(input_text, abbrev=False, pipeline=custom_pipeline)

# print the `clean` text:
print(cleaned)
# output: ሚያዝያ ዓመተ ምህረት በአገር ደረጃ የሰው ሰራሽ አስተውሎት አሁን ካለበት ዝቅተኛ ደረጃ ወደ ላቀ ደረጃ ለማድረስ ሀገርኛ ቋንቋዎችን ለአለም ተደራሽ ለማድረግ አገራዊ አቅምን ለማሳደግ እና ተጠቃሚ ለመሆን በጋራ አብሮ መስራቱ እጅግ ጠቃሚ ነው በማሽን አስተምሮ አማካኝነት የፅሁፍ ናሙናዎች በአርቲፊሻል ኢንተለጀንስ ስርአት ለማሰልጠን የፅሁፍ ዳታን መሰብሰብ እና ማደራጀት የናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎችን በመጠቀም የፅሁፍ ዳታን ፕሮሰስ ማድረግ ተቀዳሚ እና መሰረታዊ ጉዳይ ነው

Tokenization - Sentence

Here is a simple example of performing sentence tokenization on a piece of plaintext using Amharic document:
Within Amharic document, annotations are further stored in Sentences

from etltk import Amharic

sample_text = """
  የማሽን ለርኒንግ ስልተ-ቀመሮች  (Algorithms) በመጠቀም ቋንቋዎችን መለየት እና መረዳት፣ የጽሁፍ ይዘቶችን መለየት፣ የቋንቋን መዋቅር መተንተን የሚያስችሉ የሃገሪኛ ናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎች (NLP tools) ፣ ስልተ-ቀመሮች እና ሞዴሎችን ማዘጋጀት ተገቢ ነው። በዚህም መሰረት አማርኛ፣ አፋን ኦሮሞ፣ ሶማሊኛ እና ትግርኛ ቋንቋዎችን ለማሽን የማስተማር ሂደትን ቀላልና የተቀላተፍ እንዲሆን ያስችላል፡፡
"""

# Annotating Amharic Text
doc = Amharic(sample_text)

# print all list of `Sentence` in a document:
print(doc.sentences)
# output: [Sentence("የማሽን ለርኒንግ ስልተቀመሮች በመጠቀም ቋንቋዎችን መለየት እና መረዳት የፅሁፍ ይዘቶችን መለየት የቋንቋን መዋቅር መተንተን የሚያስችሉ የሀገሪኛ ናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎች ስልተቀመሮች እና ሞዴሎችን ማዘጋጀት ተገቢ ነው"), Sentence("በዚህም መሰረት አማርኛ አፋን ኦሮሞ ሶማሊኛ እና ትግርኛ ቋንቋዎችን ለማሽን የማስተማር ሂደትን ቀላልና የተቀላተፍ እንዲሆን ያስችላል")]

Here is another example of performing sentence tokenization on a piece of plaintext using sentence_tokenize function:

from etltk.tokenize.am import sent_tokenize

sample_text = """
  የማሽን ለርኒንግ ስልተ-ቀመሮች  (Algorithms) በመጠቀም ቋንቋዎችን መለየት እና መረዳት፣ የጽሁፍ ይዘቶችን መለየት፣ የቋንቋን መዋቅር መተንተን የሚያስችሉ የሃገሪኛ ናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎች (NLP tools) ፣ ስልተ-ቀመሮች እና ሞዴሎችን ማዘጋጀት ተገቢ ነው። በዚህም መሰረት አማርኛ፣ አፋን ኦሮሞ፣ ሶማሊኛ እና ትግርኛ ቋንቋዎችን ለማሽን የማስተማር ሂደትን ቀላልና የተቀላተፍ እንዲሆን ያስችላል፡፡
"""

# Annotating a Document
sentences = sent_tokenize(sample_text)

# print all list of sentence:
print(sentences)
# output: ['የማሽን ለርኒንግ ስልተቀመሮች በመጠቀም ቋንቋዎችን መለየት እና መረዳት የፅሁፍ ይዘቶችን መለየት የቋንቋን መዋቅር መተንተን የሚያስችሉ የሀገሪኛ ናቹራል ላንጉዌጅ ፕሮሰሲንግ ቱሎች ስልተቀመሮች እና ሞዴሎችን ማዘጋጀት ተገቢ ነው', 'በዚህም መሰረት አማርኛ አፋን ኦሮሞ ሶማሊኛ እና ትግርኛ ቋንቋዎችን ለማሽን የማስተማር ሂደትን ቀላልና የተቀላተፍ እንዲሆን ያስችላል']

Tokenization - Word

Here is a simple example of performing word tokenization on a piece of plaintext using AmharicDocument:
Within Amharic focument, annotations are further stored in Words.

from etltk import AmharicDocument

sample_text = """
  “ተረኛ፣ ተረኛ!” አለ ነርሱ። ወይዘሮ
  ታሪኳ፣ “አቤት!” ብለው የሁለት
  ዓመት ልጃቸውን ይዘው ገቡ።
  “ምኑን ነው ያመመው?” ዶክተሯ
  ጠየቁ። “አያዩትም! ፀጉሩ ሳስቷል፤
  ሆዱ ተነፍቷል፤ ድዱም ይደማል”
  አሉ ወይዘሮ ታሪኳ። ዶክተሯም፣
  “በጣም ያሳዝናል፤ እንደዚህ
  ያደረገው የተመጣጠነ ምግብ አለማግኘቱ ነው። አሁንም ወተት፣
  እንቁላል፣ ማር፣ አትክልትና ፍራፍሬ ይመግቡት፤ ቶሎ ይሻለዋል፤
  ለአሁኑ ግን መድኃኒት አዝለታለሁ” በማለት አስረዷቸው። ወይዘሮ
  ታሪኳም “ወይ አለማወቅ! ልጄን በምግብ እጥረት ገድዬው ነበር"
  በማለት አለቀሱ።

  """

# Annotating Amharic Text
doc = Amharic(sample_text)

# print all list of `AmharicWord` in a document:
print(doc.words)
# output: ['ተረኛ', 'ተረኛ', 'አለ', 'ነርሱ', 'ወይዘሮ', 'ታሪኳ', 'አቤት', 'ብለው', 'የሁለት', 'አመት', 'ልጃቸውን', 'ይዘው', 'ገቡ', 'ምኑን', 'ነው', 'ያመመው', 'ዶክተሯ', 'ጠየቁ', 'አያዩትም', 'ፀጉሩ', 'ሳስቷል', 'ሆዱ', 'ተነፍቷል', 'ድዱም', 'ይደማል', 'አሉ', 'ወይዘሮ', 'ታሪኳ', 'ዶክተሯም', 'በጣም', 'ያሳዝናል', 'እንደዚህ', 'ያደረገው', 'የተመጣጠነ', 'ምግብ', 'አለማግኘቱ', 'ነው', 'አሁንም', 'ወተት', 'እንቁላል', 'ማር', 'አትክልትና', 'ፍራፍሬ', 'ይመግቡት', 'ቶሎ', 'ይሻለዋል', 'ለአሁኑ', 'ግን', 'መድሀኒት', 'አዝለታለሁ', 'በማለት', 'አስረዷቸው', 'ወይዘሮ', 'ታሪኳም', 'ወይ', 'አለማወቅ', 'ልጄን', 'በምግብ', 'እጥረት', 'ገድዬው', 'ነበር', 'በማለት', 'አለቀሱ']

Here is another example of performing word tokenization on a piece of plaintext using word_tokenize function:

from etltk.tokenize.am import word_tokenize

sample_text = """
  “ተረኛ፣ ተረኛ!” አለ ነርሱ። ወይዘሮ
  ታሪኳ፣ “አቤት!” ብለው የሁለት
  ዓመት ልጃቸውን ይዘው ገቡ።
  “ምኑን ነው ያመመው?” ዶክተሯ
  ጠየቁ። “አያዩትም! ፀጉሩ ሳስቷል፤
  ሆዱ ተነፍቷል፤ ድዱም ይደማል”
  አሉ ወይዘሮ ታሪኳ። ዶክተሯም፣
  “በጣም ያሳዝናል፤ እንደዚህ
  ያደረገው የተመጣጠነ ምግብ አለማግኘቱ ነው። አሁንም ወተት፣
  እንቁላል፣ ማር፣ አትክልትና ፍራፍሬ ይመግቡት፤ ቶሎ ይሻለዋል፤
  ለአሁኑ ግን መድኃኒት አዝለታለሁ” በማለት አስረዷቸው። ወይዘሮ
  ታሪኳም “ወይ አለማወቅ! ልጄን በምግብ እጥረት ገድዬው ነበር"
  በማለት አለቀሱ።

"""
  
# word tokenization
words = word_tokenize(sample_text)

# print all list of word:
print(words)
# output: ['ተረኛ', 'ተረኛ', 'አለ', 'ነርሱ', 'ወይዘሮ', 'ታሪኳ', 'አቤት', 'ብለው', 'የሁለት', 'አመት', 'ልጃቸውን', 'ይዘው', 'ገቡ', 'ምኑን', 'ነው', 'ያመመው', 'ዶክተሯ', 'ጠየቁ', 'አያዩትም', 'ፀጉሩ', 'ሳስቷል', 'ሆዱ', 'ተነፍቷል', 'ድዱም', 'ይደማል', 'አሉ', 'ወይዘሮ', 'ታሪኳ', 'ዶክተሯም', 'በጣም', 'ያሳዝናል', 'እንደዚህ', 'ያደረገው', 'የተመጣጠነ', 'ምግብ', 'አለማግኘቱ', 'ነው', 'አሁንም', 'ወተት', 'እንቁላል', 'ማር', 'አትክልትና', 'ፍራፍሬ', 'ይመግቡት', 'ቶሎ', 'ይሻለዋል', 'ለአሁኑ', 'ግን', 'መድሀኒት', 'አዝለታለሁ', 'በማለት', 'አስረዷቸው', 'ወይዘሮ', 'ታሪኳም', 'ወይ', 'አለማወቅ', 'ልጄን', 'በምግብ', 'እጥረት', 'ገድዬው', 'ነበር', 'በማለት', 'አለቀሱ']

Normalization

Character Level Normalization such as "ጸሀይ" and "ፀሐይ"
Labialized Character Normalzation such as "ሞልቱዋል" to "ሞልቷል"
Short Form Expansion such as "አ.አ" to "አዲስ አበባ"
Punctuation Normalization such as :: to ።

Here is a simple example of performing normalization on a piece of plaintext using normalize function:

from etltk.lang.am import normalize

sample_text = """
  ሚያዝያ 14፣ 2014 ዓ.ም በዓገር ደረጃ የሰው ሰራሽ አስተውሎት የውይይት መድረክ ላይ
  የሃገርኛ ቋንቋዎች ትርጉም አገልግሎት፣ 
  ቻትቦት (የውይይት መለዋወጫ ሮቦት): 
  የፅሁፍ ሰነዶች ለመለየት፣ የቃላት ትክክለኛነትን ለማረጋገጥ፣ 
  በቋንቋን ሕግጋት መሠረት ጽሑፎችን ለማዋቀር እና ለመመስረት፣ 
  ረጅም ጽሁፎችን ለማሳጠር፣ አንኳር ጉዳዮችን መለየት ወይም ጥቅል ሃሳብ ለማውጣት፣ 
  ንግግርን ወደ ጽሁፍ ለመቀየር የሚያስችሉ መተግበሪያዎችን ማልማት አስረላጊነቱ ተገልጹዋል::
"""

# normalization
normalized_text = normalize(sample_text)

# The following example shows how to print all normalized in a document:
print(normalized_text)
# output: ሚያዝያ 14፣ 2014 አመተ ምህረት በአገር ደረጃ የሰው ሰራሽ አስተውሎት የውይይት መድረክ ላይ
# የሀገርኛ ቋንቋዎች ትርጉም አገልግሎት፣ 
# ቻትቦት (የውይይት መለዋወጫ ሮቦት)፡ 
# የፅሁፍ ሰነዶች ለመለየት፣ የቃላት ትክክለኛነትን ለማረጋገጥ፣ 
# በቋንቋን ህግጋት መሰረት ፅሁፎችን ለማዋቀር እና ለመመስረት፣ 
# ረጅም ፅሁፎችን ለማሳጠር፣ አንኳር ጉዳዮችን መለየት ወይም ጥቅል ሀሳብ ለማውጣት፣ 
# ንግግርን ወደ ፅሁፍ ለመቀየር የሚያስችሉ መተግበሪያዎችን ማልማት አስረላጊነቱ ተገልጿል። """

Here is another example of performing normalization on a piece of plaintext using normalize_char, normalize_punct, normalize_labialized, normalize_shortened function:

from etltk.lang.am.normalizer import ( 
  normalize_labialized, 
  normalize_shortened,
  normalize_punct,
  normalize_char
)

# normalize labialized 
normalized_text = normalize_labialized("ንግግርን ወደ ጽሁፍ ለመቀየር የሚያስችሉ መተግበሪያዎችን ማልማት አስረላጊነቱ ተገልጹዋል")
print(normalized_text)
# output: ንግግርን ወደ ፅሁፍ ለመቀየር የሚያስችሉ መተግበሪያዎችን ማልማት አስረላጊነቱ ተገልጿል

# normalize short forms
normalized_text = normalize_shortened("ሚያዝያ 14፣ 2014 ዓ.ም በዓገር ደረጃ የሰው ሰራሽ አስተውሎት የውይይት መድረክ")
print(normalized_text)
# output: ሚያዝያ 14፣ 2014 ዓመተ ምህረት በአገር ደረጃ የሰው ሰራሽ አስተውሎት የውይይት መድረክ

# normalize punctuation
normalized_text = normalize_punct("መተግበሪያዎችን ማልማት አስረላጊነቱ ተገልጹዋል::")
print(normalized_text)
# output: መተግበሪያዎችን ማልማት አስረላጊነቱ ተገልጿል።

# normalize characters
normalized_text = normalize_char("በቋንቋዉ ሕግጋት መሠረት ጽሑፎችን ማዋቀር እና መመሥረት")
print(normalized_text)
# output: በቋንቋዉ ህግጋት መሰረት ፅሁፎችን ማዋቀር እና መመስረት

Features

Text preprocessing functions.

from etltk.lang.am import preprocessing

Function	Description
remove_whitespaces	Remove extra spaces, tabs, and new lines from a text string
remove_links	Remove URLs from a text string
remove_tags	Remove HTML tags from a text string
remove_emojis	Remove emojis from a text string
remove_email	Remove email adresses from a text string
remove_digits	Remove all digits from a text string
remove_english_chars	Remove ascii characters from a text string
remove_arabic_chars	Remove arabic characters and numerals from a text string
remove_chinese_chars	Remove chinese characters from a text string
remove_ethiopic_digits	Remove all ethiopic digits from a text string
remove_ethiopic_punct	Remove ethiopic punctuations from a text string
remove_non_ethiopic	Remove non ethioipc characters from a text string
remove_stopwords	Remove stop words

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.22

May 16, 2022

0.0.20

Apr 27, 2022

0.0.18

Apr 25, 2022

0.0.14

Apr 23, 2022

0.0.12

Apr 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etltk-0.0.22.tar.gz (20.6 kB view details)

Uploaded May 16, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

etltk-0.0.22-py3-none-any.whl (20.6 kB view details)

Uploaded May 16, 2022 Python 3

File details

Details for the file etltk-0.0.22.tar.gz.

File metadata

Download URL: etltk-0.0.22.tar.gz
Upload date: May 16, 2022
Size: 20.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for etltk-0.0.22.tar.gz
Algorithm	Hash digest
SHA256	`3e1992b8fa42ba50e457b601e973126a723327d36f4730da98c132409bee966e`
MD5	`c76905439832e4131d244bf34549074e`
BLAKE2b-256	`f47ee7bb001647397f49fbffe01a0638db011542447fb8d85917c9d7af3163d4`

See more details on using hashes here.

File details

Details for the file etltk-0.0.22-py3-none-any.whl.

File metadata

Download URL: etltk-0.0.22-py3-none-any.whl
Upload date: May 16, 2022
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.0 CPython/3.8.13

File hashes

Hashes for etltk-0.0.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f5b992a01df2259daa7e1785b85d0b62ba831a52db8da8c91594d8569b0a2f37`
MD5	`d16683b9e49641986e2373ded82d2c6a`
BLAKE2b-256	`9fa66977be39cc33c57de7d5240b7070513677048fb3f204694c42f4e6c8cbd8`

See more details on using hashes here.

etltk 0.0.22

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Ethiopian Language Toolkit (etltk)

Installation

pip

From Source

Documentation

Usage

Features

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes