amharicNLP is a Python package for Amharic Natural Language Processing (NLP) and text preprocessing.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

🇪🇹 Amharic NLP Toolkit

Amharic NLP Toolkit is a lightweight, powerful, and easy-to-use Natural Language Processing (NLP) toolkit designed specifically for the Amharic language. It provides complete tools for Amharic text preprocessing, cleaning, tokenization, normalization, stopword removal, stemming, lemmatization, and sentiment analysis.

Perfect for machine learning, deep learning, LLMs, AI projects, and any Ethiopian language application.

🌍 Why Amharic Needs Its Own NLP Toolkit

Amharic — Ethiopia’s official language — is morphologically rich and syntactically complex.

A single Amharic word can contain:

✔️ Subject ✔️ Tense ✔️ Negation ✔️ Verb root ✔️ Suffix

Example: “አልሄደም” = negation + verb root + suffix.

Tools built for English (NLTK, SpaCy) cannot correctly handle:

Fidel script
Complex morphology
Combined affixes
Amharic punctuation
Unicode inconsistencies

amharicNLP solves this challenge with a full, language-specific preprocessing pipeline.

⚙️ What Is amharicNLP?

amharicNLP is a modular Python package built for end-to-end Amharic text preprocessing.

🧩 It includes six core components:

Cleaner – Removes HTML, emojis, numbers & noise
Normalizer – Fixes inconsistencies in characters & punctuation
Tokenizer – Splits text into meaningful tokens
Stopword Processor – Removes common filler words
Lemmatizer – Converts words to their base dictionary form
Stemmer – Reduces words to their root for ML tasks

📦 Installation

Option 1: Install from PyPI (Recommended)

pip install amharicNLP

Option 2: Install Latest Development Version

git clone https://github.com/yonasab12/amharicNLP.git
cd amharicNLP
pip install .

🧪 Full Demo: End-to-End Amharic Text Preprocessing

from amharicNLP.resources.cleaner import AmharicCleaner
from amharicNLP.resources.normalizer import AmharicNormalizer
from amharicNLP.resources.lemmatizer import AmharicLemmatizer
from amharicNLP.resources.stemmer import AmharicStemmer
from amharicNLP.resources.stopwrod import AmharicStopwordProcessor
from amharicNLP.resources.tokenizer import AmharicWordTokenizer

sample_text = "በአገራችን ኢትዮጵያ <h1/> ላ ያሉ ተማሪዎች በትምህርት ላይ ትኩረት ማድረግ አለባቸው። 123 ቁጥር! በላይ ዘለቀ ጀግና የኢትዮጵያ አርበኛ ነበር።"

🧹 Step 1: Cleaning

cleaner = AmharicCleaner()
cleaned_html = cleaner.remove_html(sample_text)
cleaned_text = cleaner.remove_noise(cleaned_html)
print(cleaned_text)

Output

በአገራችን ኢትዮጵያ ላ ያሉ ተማሪዎች በትምህርት ላይ ትኩረት ማድረግ አለባቸው። ቁጥር በላይ ዘለቀ ጀግና የኢትዮጵያ አርበኛ ነበር።

✔️ HTML removed ✔️ Numbers & non-Amharic characters cleaned

🔤 Step 2: Normalization

normalizer = AmharicNormalizer()
text1 = normalizer.normalize_amharic_chars(cleaned_text)
text2 = normalizer.normalize_punctuation_spacing(text1)
text3 = normalizer.expand_abbreviations(text2)
print(text3)

✔️ Standardized characters ✔️ Clean punctuation spacing

🪶 Step 3: Stopword Removal

stopword_processor = AmharicStopwordProcessor()
filtered_text = stopword_processor.remove(text3)
print(filtered_text)

✔️ Removes high-frequency filler words

📖 Step 4: Lemmatization

lemmatizer = AmharicLemmatizer()
lemmatized_text = lemmatizer.lemmatize(filtered_text)
print(lemmatized_text)

✔️ Converts words to canonical dictionary forms

🌱 Step 5: Stemming

stemmer = AmharicStemmer()
stemmed = [stemmer.stemaize(word) for word in filtered_text]
print(stemmed)

✔️ Ideal for ML pipelines (text clustering, topic modeling)

🧠 Why This Matters

amharicNLP significantly improves NLP performance by:

Cleaning & normalizing messy text
Reducing vocabulary sparsity
Preparing text for downstream tasks like: ✔ Sentiment analysis ✔ Text classification ✔ POS tagging ✔ Named Entity Recognition (NER) ✔ Language modeling

🧭 Module Summary

Step	Module	Purpose
1	AmharicCleaner	Removes noise, HTML, punctuation errors
2	AmharicNormalizer	Standardizes characters & spacing
3	AmharicWordTokenizer	Splits text into tokens
4	AmharicStopwordProcessor	Removes common stopwords
5	AmharicLemmatizer	Finds base word form
6	AmharicStemmer	Extracts root word

🚀 Final Thoughts

amharicNLP bridges the gap between AI and one of Africa’s most important Semitic languages. With only a few lines of code, you can prepare Amharic data for machine learning, deep learning, and linguistic analysis.

“By teaching computers to understand Amharic, we make technology speak our language.” 🇪🇹💻

✍️ Author

👤 Yonas Abebe

Exploring Amharic NLP, machine learning, and AI tools for Ethiopian languages. GitHub: yonasab12

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yonasab12

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Nov 28, 2025

1.0.0

Nov 2, 2025

0.8.0

Aug 23, 2025

0.7.0

Aug 23, 2025

0.6.0

Aug 23, 2025

0.5.0

Aug 23, 2025

0.4.0

Aug 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amharicnlp-1.1.0-py3-none-any.whl (120.8 kB view details)

Uploaded Nov 28, 2025 Python 3

File details

Details for the file amharicnlp-1.1.0-py3-none-any.whl.

File metadata

Download URL: amharicnlp-1.1.0-py3-none-any.whl
Upload date: Nov 28, 2025
Size: 120.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amharicnlp-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7fd0e6143ca8f9d7ecf23a73f0dde20fb65fe4a954105b5cf140f52ce112f85d`
MD5	`5f1fd7c2b4fa4b00b2baea566d07e4de`
BLAKE2b-256	`e6d1d8a11661f4a7a5afaff34e0b3d2e0f3965f0b7b801233526f0aa85302698`

See more details on using hashes here.

Provenance

The following attestation bundles were made for amharicnlp-1.1.0-py3-none-any.whl:

Publisher: publish.yml on yonasab12/amharicNLP

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: amharicnlp-1.1.0-py3-none-any.whl
- Subject digest: 7fd0e6143ca8f9d7ecf23a73f0dde20fb65fe4a954105b5cf140f52ce112f85d
- Sigstore transparency entry: 731006868
- Sigstore integration time: Nov 28, 2025
Source repository:
- Permalink: yonasab12/amharicNLP@affdce0667d2aecd8aa2fe2889da10ff1230a615
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/yonasab12
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@affdce0667d2aecd8aa2fe2889da10ff1230a615
- Trigger Event: release

amharicNLP 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

🇪🇹 Amharic NLP Toolkit

🌍 Why Amharic Needs Its Own NLP Toolkit

⚙️ What Is amharicNLP?

🧩 It includes six core components:

📦 Installation

Option 1: Install from PyPI (Recommended)

Option 2: Install Latest Development Version

🧪 Full Demo: End-to-End Amharic Text Preprocessing

🧹 Step 1: Cleaning

🔤 Step 2: Normalization

🪶 Step 3: Stopword Removal

📖 Step 4: Lemmatization

🌱 Step 5: Stemming

🧠 Why This Matters

🧭 Module Summary

🚀 Final Thoughts

✍️ Author

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Provenance