amharicNLP is a Python package for Amharic Natural Language Processing (NLP) and text preprocessing.
Project description
🇪🇹 Amharic NLP Toolkit
Amharic NLP Toolkit is a lightweight, powerful, and easy-to-use Natural Language Processing (NLP) toolkit designed specifically for the Amharic language. It provides complete tools for Amharic text preprocessing, cleaning, tokenization, normalization, stopword removal, stemming, lemmatization, and sentiment analysis.
Perfect for machine learning, deep learning, LLMs, AI projects, and any Ethiopian language application.
🌍 Why Amharic Needs Its Own NLP Toolkit
Amharic — Ethiopia’s official language — is morphologically rich and syntactically complex.
A single Amharic word can contain:
✔️ Subject ✔️ Tense ✔️ Negation ✔️ Verb root ✔️ Suffix
Example: “አልሄደም” = negation + verb root + suffix.
Tools built for English (NLTK, SpaCy) cannot correctly handle:
- Fidel script
- Complex morphology
- Combined affixes
- Amharic punctuation
- Unicode inconsistencies
amharicNLP solves this challenge with a full, language-specific preprocessing pipeline.
⚙️ What Is amharicNLP?
amharicNLP is a modular Python package built for end-to-end Amharic text preprocessing.
🧩 It includes six core components:
- Cleaner – Removes HTML, emojis, numbers & noise
- Normalizer – Fixes inconsistencies in characters & punctuation
- Tokenizer – Splits text into meaningful tokens
- Stopword Processor – Removes common filler words
- Lemmatizer – Converts words to their base dictionary form
- Stemmer – Reduces words to their root for ML tasks
📦 Installation
Option 1: Install from PyPI (Recommended)
pip install amharicNLP
Option 2: Install Latest Development Version
git clone https://github.com/yonasab12/amharicNLP.git
cd amharicNLP
pip install .
🧪 Full Demo: End-to-End Amharic Text Preprocessing
from amharicNLP.resources.cleaner import AmharicCleaner
from amharicNLP.resources.normalizer import AmharicNormalizer
from amharicNLP.resources.lemmatizer import AmharicLemmatizer
from amharicNLP.resources.stemmer import AmharicStemmer
from amharicNLP.resources.stopwrod import AmharicStopwordProcessor
from amharicNLP.resources.tokenizer import AmharicWordTokenizer
sample_text = "በአገራችን ኢትዮጵያ <h1/> ላ ያሉ ተማሪዎች በትምህርት ላይ ትኩረት ማድረግ አለባቸው። 123 ቁጥር! በላይ ዘለቀ ጀግና የኢትዮጵያ አርበኛ ነበር።"
🧹 Step 1: Cleaning
cleaner = AmharicCleaner()
cleaned_html = cleaner.remove_html(sample_text)
cleaned_text = cleaner.remove_noise(cleaned_html)
print(cleaned_text)
Output
በአገራችን ኢትዮጵያ ላ ያሉ ተማሪዎች በትምህርት ላይ ትኩረት ማድረግ አለባቸው። ቁጥር በላይ ዘለቀ ጀግና የኢትዮጵያ አርበኛ ነበር።
✔️ HTML removed ✔️ Numbers & non-Amharic characters cleaned
🔤 Step 2: Normalization
normalizer = AmharicNormalizer()
text1 = normalizer.normalize_amharic_chars(cleaned_text)
text2 = normalizer.normalize_punctuation_spacing(text1)
text3 = normalizer.expand_abbreviations(text2)
print(text3)
✔️ Standardized characters ✔️ Clean punctuation spacing
🪶 Step 3: Stopword Removal
stopword_processor = AmharicStopwordProcessor()
filtered_text = stopword_processor.remove(text3)
print(filtered_text)
✔️ Removes high-frequency filler words
📖 Step 4: Lemmatization
lemmatizer = AmharicLemmatizer()
lemmatized_text = lemmatizer.lemmatize(filtered_text)
print(lemmatized_text)
✔️ Converts words to canonical dictionary forms
🌱 Step 5: Stemming
stemmer = AmharicStemmer()
stemmed = [stemmer.stemaize(word) for word in filtered_text]
print(stemmed)
✔️ Ideal for ML pipelines (text clustering, topic modeling)
🧠 Why This Matters
amharicNLP significantly improves NLP performance by:
- Cleaning & normalizing messy text
- Reducing vocabulary sparsity
- Preparing text for downstream tasks like: ✔ Sentiment analysis ✔ Text classification ✔ POS tagging ✔ Named Entity Recognition (NER) ✔ Language modeling
🧭 Module Summary
| Step | Module | Purpose |
|---|---|---|
| 1 | AmharicCleaner | Removes noise, HTML, punctuation errors |
| 2 | AmharicNormalizer | Standardizes characters & spacing |
| 3 | AmharicWordTokenizer | Splits text into tokens |
| 4 | AmharicStopwordProcessor | Removes common stopwords |
| 5 | AmharicLemmatizer | Finds base word form |
| 6 | AmharicStemmer | Extracts root word |
🚀 Final Thoughts
amharicNLP bridges the gap between AI and one of Africa’s most important Semitic languages.
With only a few lines of code, you can prepare Amharic data for machine learning, deep learning, and linguistic analysis.
“By teaching computers to understand Amharic, we make technology speak our language.” 🇪🇹💻
✍️ Author
👤 Yonas Abebe
Exploring Amharic NLP, machine learning, and AI tools for Ethiopian languages. GitHub: yonasab12
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amharicnlp-1.1.0-py3-none-any.whl.
File metadata
- Download URL: amharicnlp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 120.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fd0e6143ca8f9d7ecf23a73f0dde20fb65fe4a954105b5cf140f52ce112f85d
|
|
| MD5 |
5f1fd7c2b4fa4b00b2baea566d07e4de
|
|
| BLAKE2b-256 |
e6d1d8a11661f4a7a5afaff34e0b3d2e0f3965f0b7b801233526f0aa85302698
|
Provenance
The following attestation bundles were made for amharicnlp-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on yonasab12/amharicNLP
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amharicnlp-1.1.0-py3-none-any.whl -
Subject digest:
7fd0e6143ca8f9d7ecf23a73f0dde20fb65fe4a954105b5cf140f52ce112f85d - Sigstore transparency entry: 731006868
- Sigstore integration time:
-
Permalink:
yonasab12/amharicNLP@affdce0667d2aecd8aa2fe2889da10ff1230a615 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/yonasab12
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@affdce0667d2aecd8aa2fe2889da10ff1230a615 -
Trigger Event:
release
-
Statement type: