A modular toolkit for cleaning and normalizing Amharic text.
Project description
Amharic Text Processor
Amharic Text Processor is a modular Python toolkit for cleaning, normalizing, and formatting Amharic text. Each processing step is a small class with a predictable .apply() method, and steps are easily chained with Pipeline.
Why this exists: Amharic text from the web, documents, and OCR often arrives with HTML noise, mixed Ethiopic variants, inconsistent punctuation, legacy abbreviations, and numerals in different forms. This toolkit provides predictable, composable processors so you can rapidly build robust pipelines for ML datasets, search indexing, or downstream NLP tasks without reinventing cleaning logic. Many of these components were developed while processing large volumes of Amharic text crawled from Amharic-focused websites indexed by Common Crawl.
✨ Features
- Composable pipeline built from simple processor classes
- Consistent I/O contract: accepts
stror{"text": ...}, returns a dict with"text" - HTML stripping, whitespace cleanup, Amharic character filtering
- Punctuation and Unicode normalization (keeps Ethiopic marks, preserves decimals) plus configurable regex filtering
- Sentence-level deduplication using fuzzy similarity
- Abbreviation handling for slash/dot forms; dotted abbreviations can be normalized before expansion
- Helpers to add spaces between Ethiopic letters and digits, and to place sentences on separate lines
- Noise removal for common Latin/underscore tokens and foreign-only brackets
- Ethiopic→Latin transliteration using a romanization table
- Pure, side-effect-free processors that are easy to test and extend
📦 Installation
pip install amharic-text-processor
🚀 Quick Start
from amharic_text_processor import Pipeline
from amharic_text_processor.processors import (
HtmlStripper,
WhitespaceNormalizer,
PunctuationNormalizer,
UnicodeNormalizer,
CharacterRemapper,
AbbreviationExpander,
DottedAbbreviationNormalizer,
AmharicCharacterFilter,
CommonNoiseRemover,
)
pipeline = Pipeline([
HtmlStripper(), # drop HTML/script/style
UnicodeNormalizer(), # NFC + strip control chars
CharacterRemapper(), # normalize Ethiopic variants (ሠ->ሰ, ዐ->አ, ...)
DottedAbbreviationNormalizer(), # turn dotted abbreviations into slash form
AbbreviationExpander(), # expand slash/dot abbreviations (e.g., ዓ.ም. -> ዓመተ ምሕረት)
PunctuationNormalizer(), # unify punctuation (keeps Ethiopic marks, protects decimals)
WhitespaceNormalizer(), # collapse repeated whitespace
AmharicCharacterFilter(), # keep Ethiopic chars and safe punctuation/digits
CommonNoiseRemover(), # drop tokens like IMG_1124 or (FlyDubai)
])
raw = """
<article>
<p> ሰላም። ልኡኣ ዓ.ም. 2016 ሀ/ማርያም በሚሊዮን ይዘት ሰጠ። </p>
<script>alert('ignore me')</script>
</article>
"""
result = pipeline.apply(raw)
print(result["text"])
# -> ሰላም። ሏ ዓመተ ምሕረት 2016 ሀይለ ማርያም በሚሊዮን ይዘት ሰጠ።
# Transliteration to Latin
rawtext = "እሺ፣ የክፍያ ሂደቱን በአማርኛ እመራዎታለሁ። የአካውንት ቁጥርዎን ያስገቡ። 565 የኢትዮጵያ ንግድ ባንክ ነው።"
new_text = AmharicTransliterator().apply(rawtext)
print(new_text["text"])
# -> eshi, yakefeyaa hidatune baamaarenyaa emaraawotaalahu. yaakaawenete quterewone yaasegabu. 565 yaiteyopheyaa negede baaneke nawe.
🔗 Pipeline Contract
- Input:
strordictcontaining"text": str - Output: always a
dictwith at least"text": str - Processors run in order; output from one is passed to the next
- Fail-fast validation on invalid inputs or processor outputs
📚 Code Documentation
- Each processor and the pipeline include docstrings describing inputs/outputs and behavior (see
amharic_text_processor/base.py,pipeline.py, and files inamharic_text_processor/processors/). - Browse in an editor or via
pydoc amharic_text_processor.processors.<name>for details. - All processors follow the same contract:
.apply(data: str | {"text": str}) -> {"text": str, ...}. - See
docs/for a quick reference (docs/index.md,docs/processors.md). To generate HTML docs locally you can runpdoc -o docs amharic_text_processor.
🧰 Built-in Processors
HtmlStripper: remove HTML tags and script/style contentWhitespaceNormalizer: collapse repeated whitespace and trimPunctuationNormalizer: unify Ethiopic/ASCII punctuation, collapse repeats, keep decimals intactUnicodeNormalizer: normalize Unicode (default NFC) and strip control charsAmharicCharacterFilter: keep Ethiopic characters plus safe punctuation/digitsCharacterRemapper: normalize variant Ethiopic glyphs to canonical formsDottedAbbreviationNormalizer: convert dotted abbreviations (e.g., እ.ኤ.አ) into slash form before expansionAbbreviationExpander: expand slash/dot Amharic abbreviations to full forms (e.g., ፍ/ቤቱ -> ፍርድ ቤቱ, ፕ/ር -> ፕሮፌሰር, ዓ.ም. -> ዓመተ ምሕረት)NumberToGeez: convert Arabic digits in text to Ethiopic (Geez) numerals (e.g., 31 -> ፴፩)GeezToNumber: convert Ethiopic (Geez) numerals back to Arabic digits (e.g., ፴፩ -> 31)WordNumberToDigits: convert Amharic worded numbers (e.g., “ሁለት ሺህ ሶስት መቶ”) to Arabic digitsDigitsToWordNumber: turn Arabic digit sequences into Amharic worded numbers (supports up to trillions)OldPhoneMapper: convert legacy phone representations to modern forms via a predefined mappingEthiopicNumberSpacer: insert spaces between Ethiopic letters and adjacent digits (e.g., "ዜና11" -> "ዜና 11")SentenceLineFormatter: place each sentence on its own line after end punctuationSentenceDeduplicator: drop exact or near-duplicate sentences with RapidFuzz similarityAmharicTransliterator: transliterate Ethiopic (Amharic) text to Latin using a romanization tableCommonNoiseRemover: remove noisy tokens likeIMG_1124or non-Ethiopic bracketed text(some_not_amharic_words)RegexFilter: run a configurable regex substitution with counts
Sentence deduplication example
from amharic_text_processor.processors import SentenceDeduplicator
deduper = SentenceDeduplicator(similarity_threshold=0.85)
text = "ሰላም ዓለም። ሰላም ዓለም። እንዴት ነህ? እርስዎ እንዴት ነው?"
result = deduper.apply(text)
print(result["text"])
# -> ሰላም ዓለም። እንዴት ነህ?
print(result["sentences_removed"]) # duplicates that were dropped
🧧 Custom Processor Example
from amharic_text_processor import BaseProcessor
class ExampleProcessor(BaseProcessor):
def apply(self, data):
text = BaseProcessor._extract_text(data)
processed = text.replace("old", "new")
return {"text": processed, "modified": True}
Add it to a pipeline just like the built-ins.
🧪 Testing
pytest -q
🤝 Contributing
See CONTRIBUTING.md for guidelines on adding processors, running tests, and coding style.
📦 Publishing
GitHub Actions workflows are included:
CIruns tests on pushes/PRs.Publish to PyPIbuilds and publishes on release creation.- See CHANGELOG.md for release notes.
📜 License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amharic_text_processor-0.1.3.tar.gz.
File metadata
- Download URL: amharic_text_processor-0.1.3.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f605cb2fc4883b2be31c5dc8e84ba48d28d32c11a29cfca359d14b144b0aeaf3
|
|
| MD5 |
7d47ddb1664640ffc2808f24c0f50f31
|
|
| BLAKE2b-256 |
1d354d948b526b08ded8fdd49b31a8baa9341371c7769fde1e127c741cb652bf
|
Provenance
The following attestation bundles were made for amharic_text_processor-0.1.3.tar.gz:
Publisher:
publish.yml on ethiopicai/Amharic-Text-Processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amharic_text_processor-0.1.3.tar.gz -
Subject digest:
f605cb2fc4883b2be31c5dc8e84ba48d28d32c11a29cfca359d14b144b0aeaf3 - Sigstore transparency entry: 776510078
- Sigstore integration time:
-
Permalink:
ethiopicai/Amharic-Text-Processor@3175feaefb31e7af48118ac4760fa3cdc167c7fe -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ethiopicai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3175feaefb31e7af48118ac4760fa3cdc167c7fe -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file amharic_text_processor-0.1.3-py3-none-any.whl.
File metadata
- Download URL: amharic_text_processor-0.1.3-py3-none-any.whl
- Upload date:
- Size: 29.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f22db39b723ad04bc1202e6c7b049f3e500f306c970a2f8e0473e121334111aa
|
|
| MD5 |
69996e57ec668e063633205e4bcdbaaa
|
|
| BLAKE2b-256 |
ccb9079994731627034b5eeef7499315d7ef7fa4e91d9ebe8029f84e5319b909
|
Provenance
The following attestation bundles were made for amharic_text_processor-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on ethiopicai/Amharic-Text-Processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
amharic_text_processor-0.1.3-py3-none-any.whl -
Subject digest:
f22db39b723ad04bc1202e6c7b049f3e500f306c970a2f8e0473e121334111aa - Sigstore transparency entry: 776510104
- Sigstore integration time:
-
Permalink:
ethiopicai/Amharic-Text-Processor@3175feaefb31e7af48118ac4760fa3cdc167c7fe -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ethiopicai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3175feaefb31e7af48118ac4760fa3cdc167c7fe -
Trigger Event:
workflow_dispatch
-
Statement type: