Skip to main content

This is an Amharic document segmentation and normalization tool

Project description

Amharic Segmenter and tokenizer

This is a simple script that split an Amharic document into different sentences and tokenes. If you find an issue, please let us know in the GitHub Issues

The Segmenter is part of the Semantic Models for Amharic Project image0

Usage

Install the segmenter: pip install amseg

Tokenization and Segmentation

Use the following code for sentence segmentation and word tokenization

from amseg.amharicSegmenter import AmharicSegmenter
sent_punct = []
word_punct = []
segmenter = AmharicSegmenter(sent_punct,word_punct)
words = segmenter.amharic_tokenizer("እአበበ በሶ በላ።")
sentences = segmenter.tokenize_sentence("እአበበ በሶ በላ። ከበደ ጆንያ፤ ተሸከመ፡!ለምን?")

Outputs

words = [‘እአበበ’, ‘በሶ’, ‘በላ’, ‘።’]

sentences = [‘እአበበ በሶ በላ።’, ‘ከበደ ጆንያ፤ ተሸከመ፡!’, ‘ለምን?’]

Romanization and Normalization

The following code show cases how to normalize and romanize a given Amharic text

from amseg.amharicNormalizer import AmharicNormalizer as normalizer
from amseg.amharicRomanizer import AmharicRomanizer as romanizer
normalized = normalizer.normalize('ሑለት ሦስት')
romanized = romanizer.romanize('ሑለት ሦስት')
Outputs

> normalized = ‘ሁለት ሶስት’ > romanized = ‘ḥulat śosət’

Transliteration to Amharic Fidel

The following code show cases how to transliterate a given latin script text to Amahric Fidel script text

from amseg.amharicTranslitrator import AmharicTranslitrator as  transliterator
transliterated = transliterator.transliterate('misa belah')
Outputs

> transliterated = ‘ሚሳ በላህ’

Publications

To cite the Amharic segmenter/tokenizer tool, use the following paper

@Article{fi13110275,
AUTHOR = {Yimam, Seid Muhie and Ayele, Abinew Ali and Venkatesh, Gopalakrishnan and Gashaw, Ibrahim and Biemann, Chris},
TITLE = {Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets},
JOURNAL = {Future Internet},
VOLUME = {13},
YEAR = {2021},
NUMBER = {11},
ARTICLE-NUMBER = {275},
URL = {https://www.mdpi.com/1999-5903/13/11/275},
ISSN = {1999-5903},
DOI = {10.3390/fi13110275}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amseg-2.3.tar.gz (11.5 kB view details)

Uploaded Source

File details

Details for the file amseg-2.3.tar.gz.

File metadata

  • Download URL: amseg-2.3.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.10.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.10

File hashes

Hashes for amseg-2.3.tar.gz
Algorithm Hash digest
SHA256 1d51cce1ca9b00b365b33fad5089e2660547be258750efbfb2e9658c53cff599
MD5 e638ffb03e91e5fe48bb634b6e0af4f8
BLAKE2b-256 b1664f9442687a9a0f7c3d7a7e0013e60b8d400ad758903690f6535f31490f75

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page