Skip to main content

This is an Amharic document segmentation and normalization tool

Project description

Amharic Segmenter and tokenizer

This is a simple script that split an Amharic document into different sentences and tokenes. If you find an issue, please let us know in the GitHub Issues

The Segmenter is part of the Semantic Models for Amharic Project image0

Usage

Install the segmenter: pip install amseg

Tokenization and Segmentation

Use the following code for sentence segmentation and word tokenization

from amseg.amharicSegmenter import AmharicSegmenter
sent_punct = []
word_punct = []
segmenter = AmharicSegmenter(sent_punct,word_punct)
words = segmenter.amharic_tokenizer("እአበበ በሶ በላ።")
sentences = segmenter.tokenize_sentence("እአበበ በሶ በላ። ከበደ ጆንያ፤ ተሸከመ፡!ለምን?")

Outputs

words = [‘እአበበ’, ‘በሶ’, ‘በላ’, ‘።’]

sentences = [‘እአበበ በሶ በላ።’, ‘ከበደ ጆንያ፤ ተሸከመ፡!’, ‘ለምን?’]

Romanization and Normalization

The following code show cases how to normalize and romanize a given Amharic text

from amseg.amharicNormalizer import AmharicNormalizer as normalizer
from amseg.amharicRomanizer import AmharicRomanizer as romanizer
normalized = normalizer.normalize('ሑለት ሦስት')
romanized = romanizer.romanize('ሑለት ሦስት')
Outputs

> normalized = ‘ሁለት ሶስት’ > romanized = ‘ḥulat śosət’

Transliteration to Amharic Fidel

The following code show cases how to transliterate a given latin script text to Amahric Fidel script text

from amseg.amharicTranslitrator import AmharicTranslitrator as  transliterate
transliterated = transliterator.transliterate('misa belah')
Outputs

> transliterated = ‘ምሳ በላህ’

Publications

To cite the Amharic segmenter/tokenizer tool, use the following paper

@Article{fi13110275,
AUTHOR = {Yimam, Seid Muhie and Ayele, Abinew Ali and Venkatesh, Gopalakrishnan and Gashaw, Ibrahim and Biemann, Chris},
TITLE = {Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets},
JOURNAL = {Future Internet},
VOLUME = {13},
YEAR = {2021},
NUMBER = {11},
ARTICLE-NUMBER = {275},
URL = {https://www.mdpi.com/1999-5903/13/11/275},
ISSN = {1999-5903},
DOI = {10.3390/fi13110275}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amseg-1.4.tar.gz (10.1 kB view details)

Uploaded Source

File details

Details for the file amseg-1.4.tar.gz.

File metadata

  • Download URL: amseg-1.4.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.10.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.7.10

File hashes

Hashes for amseg-1.4.tar.gz
Algorithm Hash digest
SHA256 ec1c41d0f50f739c6dc5d88a15ce56c58f263234f7376d24484c9e9833d74d22
MD5 9cf541a0bfab4a05b0de54f8f17d2e7d
BLAKE2b-256 c79fdc78ecfcfb8b49a24754774c1edf68204ce1e13085d494e16f36d438af48

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page