Amharic tokenizer with BPE-like merges over decomposed fidel (Cython)

These details have not been verified by PyPI

Project links

Project description

Amharic Tokenizer 🇪🇹

Amharic tokenizer with a GPT-style BPE-like pipeline over decomposed fidel. Implements: cleaning → fidel decomposition → BPE training/application → detokenization, with a Cython core for speed.

What's new in 0.1.2

WordPiece-style continuation prefixes: non-initial subwords are now prefixed with ## during tokenization.

Example: Going → ['G', '##o', '##i', '##n', '##g', '</w>']

Amharic example: Input: የተባለ ውን የሚያደርገው ም በዚህ ምክንያት ነው Tokens:

['የአተአ', '##በ', '##ኣለ', '##አ', '</w>', ' ', 'ወእ', '##ነ', '##እ', '</w>', ' ', 'የአመኢየኣ', '##ደ', '##አረ', '##እ', '##ገ', '##አወእ', '</w>', ' ', 'መእ', '</w>', ' ', 'በአ', '##ዘኢ', '##ሀ', '##እ', '</w>', ' ', 'መእ', '##ከ', '##እነእ', '##የኣ', '##ተእ', '</w>', ' ', 'ነ', '##አወእ', '</w>']

Detokenization matches the input.

Detokenization fixes:
- Strips ## correctly and handles embedded </w> markers without leaking into text.
- Avoids extra spaces resulting from end-of-word handling.
Developer ergonomics: AmharicTokenizer.from_default() returns a minimally trained instance for quick experiments.

Note: The </w> token remains an internal end-of-word marker in the token stream; it is never emitted in detokenized text.

Installation

From PyPI (recommended)

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate     # Windows

pip install amharic-tokenizer

Verify the CLI:

amh-tokenizer --help

From source (for development)

python -m venv .venv
source .venv/bin/activate
pip install -e .

Training (CLI)

# Train on a cleaned Amharic text corpus and save model
amh-tokenizer train /abs/path/to/cleaned_amharic.txt /abs/path/to/amh_bpe \
  --num-merges 50000 --verbose --log-every 2000

# Example using relative paths
amh-tokenizer train cleaned_amharic.txt amh_bpe --num-merges 50000 --verbose --log-every 2000

Quick Usage (Python)

from amharic_tokenizer import AmharicTokenizer

# Load a trained model
tok = AmharicTokenizer.load("amh_bpe_v0.2.0")

text = "ኢትዮጵያ ጥሩ ናት።"

# Tokenize
tokens = tok.tokenize(text)
print(tokens)  # variable-length subword tokens
# Tokens to ids
ids = tok.encode(text) # or tok.convert_tokens_to_ids(tokens)
# Ids to tokens
tokens = tok.convert_ids_to_tokens(ids)

display_tokens = [t.replace('</w>', '') for t in tokens if t != '</w>']
print(display_tokens)

# Detokenize back to original text
print(tok.detokenize(tokens))

Example Script

# Test a single string
python examples/try_tokenizer.py amh_bpe --text "ኢትዮጵያ ጥሩ ናት።"

# Test a file
python examples/try_tokenizer.py amh_bpe --file cleaned_amharic.txt

Tip: If running examples directly by path, ensure the package is installed (pip install -e .) or run as a module from the project root:

python -m examples.try_tokenizer amh_bpe --text "..."

API

AmharicTokenizer(num_merges=50000)

train(corpus_text, verbose=False, log_every=1000) -> int
tokenize(text) -> list[str]
detokenize(tokens) -> str
save(path_prefix) / load(path_prefix)
is_trained() -> bool

Notes

Longer, more diverse corpora and higher num_merges produce longer subwords.
Training and tokenization work over decomposed fidel; detokenization recomposes the original Amharic characters.

Troubleshooting

ModuleNotFoundError inside the repo: install in editable mode (pip install -e .) or run scripts from outside the repo to avoid shadowing the installed package.
TestPyPI installs: resolve build dependencies from PyPI:

pip install -i https://test.pypi.org/simple/ \
    --extra-index-url https://pypi.org/simple amharic-tokenizer

License

This project is licensed under the MIT License – see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.6

Nov 17, 2025

0.2.5

Nov 9, 2025

0.2.4

Nov 6, 2025

0.2.3

Nov 5, 2025

0.2.2

Nov 5, 2025

0.2.1

Nov 5, 2025

This version

0.2.0

Oct 30, 2025

0.1.9

Oct 30, 2025

0.1.8

Oct 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amharic_tokenizer-0.2.0.tar.gz (3.8 MB view details)

Uploaded Oct 30, 2025 Source

File details

Details for the file amharic_tokenizer-0.2.0.tar.gz.

File metadata

Download URL: amharic_tokenizer-0.2.0.tar.gz
Upload date: Oct 30, 2025
Size: 3.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for amharic_tokenizer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d964825a68764b5242bd1222f805e2b7879b071f739b087822a668f9abe972dd`
MD5	`3e5a32efb50eaecec54ffed5fc9e5842`
BLAKE2b-256	`f8894ee3d4a462a5db1295558b796dbcb048b97ee7880c09e441b5b97dc5654d`

See more details on using hashes here.

amharic-tokenizer 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Amharic Tokenizer 🇪🇹

What's new in 0.1.2

Installation

From PyPI (recommended)

From source (for development)

Training (CLI)

Quick Usage (Python)

Example Script

API

Notes

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes