Amharic tokenizer with BPE-like merges over decomposed fidel (Cython)
Project description
Amharic Tokenizer 🇪🇹
Amharic tokenizer with a GPT-style BPE-like pipeline over decomposed fidel. Implements: cleaning → fidel decomposition → BPE training/application → detokenization, with a Cython core for speed.
What's new in 0.1.2
- WordPiece-style continuation prefixes: non-initial subwords are now prefixed with
##during tokenization.- Example:
Going→['G', '##o', '##i', '##n', '##g', '</w>'] - Amharic example:
Input:
የተባለ ውን የሚያደርገው ም በዚህ ምክንያት ነውTokens:
Detokenization matches the input.['የአተአ', '##በ', '##ኣለ', '##አ', '</w>', ' ', 'ወእ', '##ነ', '##እ', '</w>', ' ', 'የአመኢየኣ', '##ደ', '##አረ', '##እ', '##ገ', '##አወእ', '</w>', ' ', 'መእ', '</w>', ' ', 'በአ', '##ዘኢ', '##ሀ', '##እ', '</w>', ' ', 'መእ', '##ከ', '##እነእ', '##የኣ', '##ተእ', '</w>', ' ', 'ነ', '##አወእ', '</w>']
- Example:
- Detokenization fixes:
- Strips
##correctly and handles embedded</w>markers without leaking into text. - Avoids extra spaces resulting from end-of-word handling.
- Strips
- Developer ergonomics:
AmharicTokenizer.from_default()returns a minimally trained instance for quick experiments.
Note: The
</w>token remains an internal end-of-word marker in the token stream; it is never emitted in detokenized text.
Installation
From PyPI (recommended)
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
pip install amharic-tokenizer
Verify the CLI:
amh-tokenizer --help
From source (for development)
python -m venv .venv
source .venv/bin/activate
pip install -e .
Training (CLI)
# Train on a cleaned Amharic text corpus and save model
amh-tokenizer train /abs/path/to/cleaned_amharic.txt /abs/path/to/amh_bpe \
--num-merges 50000 --verbose --log-every 2000
# Example using relative paths
amh-tokenizer train cleaned_amharic.txt amh_bpe --num-merges 50000 --verbose --log-every 2000
Quick Usage (Python)
from amharic_tokenizer import AmharicTokenizer
# Load a trained model
tok = AmharicTokenizer.load("amh_bpe_v0.2.0")
text = "ኢትዮጵያ ጥሩ ናት።"
# Tokenize
tokens = tok.tokenize(text)
print(tokens) # variable-length subword tokens
# Tokens to ids
ids = tok.encode(text) # or tok.convert_tokens_to_ids(tokens)
# Ids to tokens
tokens = tok.convert_ids_to_tokens(ids)
display_tokens = [t.replace('</w>', '') for t in tokens if t != '</w>']
print(display_tokens)
# Detokenize back to original text
print(tok.detokenize(tokens))
Example Script
# Test a single string
python examples/try_tokenizer.py amh_bpe --text "ኢትዮጵያ ጥሩ ናት።"
# Test a file
python examples/try_tokenizer.py amh_bpe --file cleaned_amharic.txt
Tip: If running examples directly by path, ensure the package is installed (
pip install -e .) or run as a module from the project root:
python -m examples.try_tokenizer amh_bpe --text "..."
API
AmharicTokenizer(num_merges=50000)
train(corpus_text, verbose=False, log_every=1000) -> inttokenize(text) -> list[str]detokenize(tokens) -> strsave(path_prefix)/load(path_prefix)is_trained() -> bool
Notes
- Longer, more diverse corpora and higher
num_mergesproduce longer subwords. - Training and tokenization work over decomposed fidel; detokenization recomposes the original Amharic characters.
Troubleshooting
- ModuleNotFoundError inside the repo: install in editable mode (
pip install -e .) or run scripts from outside the repo to avoid shadowing the installed package. - TestPyPI installs: resolve build dependencies from PyPI:
pip install -i https://test.pypi.org/simple/ \
--extra-index-url https://pypi.org/simple amharic-tokenizer
License
This project is licensed under the MIT License – see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file amharic_tokenizer-0.2.0.tar.gz.
File metadata
- Download URL: amharic_tokenizer-0.2.0.tar.gz
- Upload date:
- Size: 3.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d964825a68764b5242bd1222f805e2b7879b071f739b087822a668f9abe972dd
|
|
| MD5 |
3e5a32efb50eaecec54ffed5fc9e5842
|
|
| BLAKE2b-256 |
f8894ee3d4a462a5db1295558b796dbcb048b97ee7880c09e441b5b97dc5654d
|