A modular toolkit for cleaning and normalizing Amharic text.

These details have not been verified by PyPI

Project links

Project description

Amharic Text Processor

Amharic Text Processor is a modular Python toolkit for cleaning, normalizing, and formatting Amharic text. Each processing step is a small class with a predictable .apply() method, and steps are easily chained with Pipeline.

Why this exists: Amharic text from the web, documents, and OCR often arrives with HTML noise, mixed Ethiopic variants, inconsistent punctuation, legacy abbreviations, and numerals in different forms. This toolkit provides predictable, composable processors so you can rapidly build robust pipelines for ML datasets, search indexing, or downstream NLP tasks without reinventing cleaning logic. Many of these components were developed while processing large volumes of Amharic text crawled from Amharic-focused websites indexed by Common Crawl.

✨ Features

Composable pipeline built from simple processor classes
Consistent I/O contract: accepts str or {"text": ...}, returns a dict with "text"
HTML stripping, whitespace cleanup, Amharic character filtering
Punctuation and Unicode normalization (keeps Ethiopic marks, preserves decimals) plus configurable regex filtering
Sentence-level deduplication using fuzzy similarity
Abbreviation handling for slash/dot forms; dotted abbreviations can be normalized before expansion
Helpers to add spaces between Ethiopic letters and digits, and to place sentences on separate lines
Noise removal for common Latin/underscore tokens and foreign-only brackets
Ethiopic→Latin transliteration using a romanization table
Pure, side-effect-free processors that are easy to test and extend

📦 Installation

pip install amharic-text-processor

🚀 Quick Start

from amharic_text_processor import Pipeline
from amharic_text_processor.processors import (
    HtmlStripper,
    WhitespaceNormalizer,
    PunctuationNormalizer,
    UnicodeNormalizer,
    CharacterRemapper,
    AbbreviationExpander,
    DottedAbbreviationNormalizer,
    AmharicCharacterFilter,
    CommonNoiseRemover,
)

pipeline = Pipeline([
    HtmlStripper(),             # drop HTML/script/style
    UnicodeNormalizer(),        # NFC + strip control chars
    CharacterRemapper(),        # normalize Ethiopic variants (ሠ->ሰ, ዐ->አ, ...)
    DottedAbbreviationNormalizer(),  # turn dotted abbreviations into slash form
    AbbreviationExpander(),     # expand slash/dot abbreviations (e.g., ዓ.ም. -> ዓመተ ምሕረት)
    PunctuationNormalizer(),    # unify punctuation (keeps Ethiopic marks, protects decimals)
    WhitespaceNormalizer(),     # collapse repeated whitespace
    AmharicCharacterFilter(),   # keep Ethiopic chars and safe punctuation/digits
    CommonNoiseRemover(),       # drop tokens like IMG_1124 or (FlyDubai)
])

raw = """
<article>
  <p>  ሰላም። ልኡኣ ዓ.ም. 2016 ሀ/ማርያም በሚሊዮን ይዘት ሰጠ። </p>
  <script>alert('ignore me')</script>
</article>
"""

result = pipeline.apply(raw)
print(result["text"])
# -> ሰላም። ሏ ዓመተ ምሕረት 2016 ሀይለ ማርያም በሚሊዮን ይዘት ሰጠ።

# Transliteration to Latin
rawtext = "እሺ፣ የክፍያ ሂደቱን በአማርኛ እመራዎታለሁ። የአካውንት ቁጥርዎን ያስገቡ። 565 የኢትዮጵያ ንግድ ባንክ ነው።"
new_text = AmharicTransliterator().apply(rawtext)
print(new_text["text"])
# -> eshi, yakefeyaa hidatune baamaarenyaa emaraawotaalahu. yaakaawenete quterewone yaasegabu. 565 yaiteyopheyaa negede baaneke nawe.

🔗 Pipeline Contract

Input: str or dict containing "text": str
Output: always a dict with at least "text": str
Processors run in order; output from one is passed to the next
Fail-fast validation on invalid inputs or processor outputs

📚 Code Documentation

Each processor and the pipeline include docstrings describing inputs/outputs and behavior (see amharic_text_processor/base.py, pipeline.py, and files in amharic_text_processor/processors/).
Browse in an editor or via pydoc amharic_text_processor.processors.<name> for details.
All processors follow the same contract: .apply(data: str | {"text": str}) -> {"text": str, ...}.
See docs/ for a quick reference (docs/index.md, docs/processors.md). To generate HTML docs locally you can run pdoc -o docs amharic_text_processor.

🧰 Built-in Processors

HtmlStripper: remove HTML tags and script/style content
WhitespaceNormalizer: collapse repeated whitespace and trim
PunctuationNormalizer: unify Ethiopic/ASCII punctuation, collapse repeats, keep decimals intact
UnicodeNormalizer: normalize Unicode (default NFC) and strip control chars
AmharicCharacterFilter: keep Ethiopic characters plus safe punctuation/digits
CharacterRemapper: normalize variant Ethiopic glyphs to canonical forms
DottedAbbreviationNormalizer: convert dotted abbreviations (e.g., እ.ኤ.አ) into slash form before expansion
AbbreviationExpander: expand slash/dot Amharic abbreviations to full forms (e.g., ፍ/ቤቱ -> ፍርድ ቤቱ, ፕ/ር -> ፕሮፌሰር, ዓ.ም. -> ዓመተ ምሕረት)
NumberToGeez: convert Arabic digits in text to Ethiopic (Geez) numerals (e.g., 31 -> ፴፩)
GeezToNumber: convert Ethiopic (Geez) numerals back to Arabic digits (e.g., ፴፩ -> 31)
WordNumberToDigits: convert Amharic worded numbers (e.g., “ሁለት ሺህ ሶስት መቶ”) to Arabic digits
DigitsToWordNumber: turn Arabic digit sequences into Amharic worded numbers (supports up to trillions)
OldPhoneMapper: convert legacy phone representations to modern forms via a predefined mapping
EthiopicNumberSpacer: insert spaces between Ethiopic letters and adjacent digits (e.g., "ዜና11" -> "ዜና 11")
SentenceLineFormatter: place each sentence on its own line after end punctuation
SentenceDeduplicator: drop exact or near-duplicate sentences with RapidFuzz similarity
AmharicTransliterator: transliterate Ethiopic (Amharic) text to Latin using a romanization table
CommonNoiseRemover: remove noisy tokens like IMG_1124 or non-Ethiopic bracketed text (some_not_amharic_words)
RegexFilter: run a configurable regex substitution with counts

Sentence deduplication example

from amharic_text_processor.processors import SentenceDeduplicator

deduper = SentenceDeduplicator(similarity_threshold=0.85)
text = "ሰላም ዓለም። ሰላም ዓለም። እንዴት ነህ? እርስዎ እንዴት ነው?"
result = deduper.apply(text)
print(result["text"])
# -> ሰላም ዓለም። እንዴት ነህ?
print(result["sentences_removed"])  # duplicates that were dropped

🧧 Custom Processor Example

from amharic_text_processor import BaseProcessor


class ExampleProcessor(BaseProcessor):
    def apply(self, data):
        text = BaseProcessor._extract_text(data)
        processed = text.replace("old", "new")
        return {"text": processed, "modified": True}

Add it to a pipeline just like the built-ins.

🧪 Testing

pytest -q

🤝 Contributing

See CONTRIBUTING.md for guidelines on adding processors, running tests, and coding style.

📦 Publishing

GitHub Actions workflows are included:

CI runs tests on pushes/PRs.
Publish to PyPI builds and publishes on release creation.
See CHANGELOG.md for release notes.

📜 License

MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Dec 23, 2025

0.1.2

Nov 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amharic_text_processor-0.1.3.tar.gz (31.5 kB view details)

Uploaded Dec 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

amharic_text_processor-0.1.3-py3-none-any.whl (29.8 kB view details)

Uploaded Dec 23, 2025 Python 3

File details

Details for the file amharic_text_processor-0.1.3.tar.gz.

File metadata

Download URL: amharic_text_processor-0.1.3.tar.gz
Upload date: Dec 23, 2025
Size: 31.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amharic_text_processor-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`f605cb2fc4883b2be31c5dc8e84ba48d28d32c11a29cfca359d14b144b0aeaf3`
MD5	`7d47ddb1664640ffc2808f24c0f50f31`
BLAKE2b-256	`1d354d948b526b08ded8fdd49b31a8baa9341371c7769fde1e127c741cb652bf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for amharic_text_processor-0.1.3.tar.gz:

Publisher: publish.yml on ethiopicai/Amharic-Text-Processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: amharic_text_processor-0.1.3.tar.gz
- Subject digest: f605cb2fc4883b2be31c5dc8e84ba48d28d32c11a29cfca359d14b144b0aeaf3
- Sigstore transparency entry: 776510078
- Sigstore integration time: Dec 23, 2025
Source repository:
- Permalink: ethiopicai/Amharic-Text-Processor@3175feaefb31e7af48118ac4760fa3cdc167c7fe
- Branch / Tag: refs/heads/main
- Owner: https://github.com/ethiopicai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3175feaefb31e7af48118ac4760fa3cdc167c7fe
- Trigger Event: workflow_dispatch

File details

Details for the file amharic_text_processor-0.1.3-py3-none-any.whl.

File metadata

Download URL: amharic_text_processor-0.1.3-py3-none-any.whl
Upload date: Dec 23, 2025
Size: 29.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for amharic_text_processor-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f22db39b723ad04bc1202e6c7b049f3e500f306c970a2f8e0473e121334111aa`
MD5	`69996e57ec668e063633205e4bcdbaaa`
BLAKE2b-256	`ccb9079994731627034b5eeef7499315d7ef7fa4e91d9ebe8029f84e5319b909`

See more details on using hashes here.

Provenance

The following attestation bundles were made for amharic_text_processor-0.1.3-py3-none-any.whl:

Publisher: publish.yml on ethiopicai/Amharic-Text-Processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: amharic_text_processor-0.1.3-py3-none-any.whl
- Subject digest: f22db39b723ad04bc1202e6c7b049f3e500f306c970a2f8e0473e121334111aa
- Sigstore transparency entry: 776510104
- Sigstore integration time: Dec 23, 2025
Source repository:
- Permalink: ethiopicai/Amharic-Text-Processor@3175feaefb31e7af48118ac4760fa3cdc167c7fe
- Branch / Tag: refs/heads/main
- Owner: https://github.com/ethiopicai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3175feaefb31e7af48118ac4760fa3cdc167c7fe
- Trigger Event: workflow_dispatch

amharic-text-processor 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Amharic Text Processor

✨ Features

📦 Installation

🚀 Quick Start

🔗 Pipeline Contract

📚 Code Documentation

🧰 Built-in Processors

Sentence deduplication example

🧧 Custom Processor Example

🧪 Testing

🤝 Contributing

📦 Publishing

📜 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance