Enterprise-grade Amharic & Ge'ez Biblical Text Tokenizer with Byte Pair Encoding
Project description
EthioBBPE
📖 Overview
EthioBBPE is a professional, production-ready Byte Pair Encoding (BPE) tokenizer specifically designed for Amharic, Ge'ez, and biblical texts. It provides state-of-the-art tokenization for Ethiopian languages with support for ancient scripts and special characters.
✨ Key Features
- 🚀 One-line Installation:
pip install EthioBBPE - 🤖 Auto-download: Models automatically downloaded from Hugging Face Hub
- 📦 Compressed Storage: Gzip compression for efficient storage (65%+ size reduction)
- ⚡ High Performance: Optimized for speed with batch processing
- 🎯 Perfect Reconstruction: 100% accuracy on Amharic biblical texts
- 🔧 Professional API: Clean, intuitive interface inspired by Hugging Face
- 📊 Production Ready: Checkpointing, quantization, and comprehensive metrics
🚀 Quick Start
Installation
pip install EthioBBPE
Basic Usage
from ethiobbpe import EthioBBPETokenizer
# Load pretrained tokenizer (auto-downloads from Hugging Face)
tokenizer = EthioBBPETokenizer.from_pretrained()
# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
# Decode back to text
decoded = tokenizer.decode(encoded.ids)
print(f"Decoded: {decoded}")
# Output: ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። (100% accurate!)
Advanced Usage
from ethiobbpe import AutoTokenizer
# Alternative: Use AutoTokenizer factory
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")
# Batch encoding
texts = [
"በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
"ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
"ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ"
]
encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
print(f"Text {i}: {len(enc.tokens)} tokens")
# Callable interface
result = tokenizer("ሰላም ዓለም")
print(result.tokens)
# Get vocabulary
vocab = tokenizer.get_vocab()
print(f"Vocabulary size: {tokenizer.get_vocab_size()}")
📊 Model Details
| Property | Value |
|---|---|
| Model Name | EthioBBPE_AmharicBible |
| Vocabulary Size | 16,000 tokens |
| Training Data | 61,769 lines (Synaxarium + Canon Biblical) |
| Compression | Gzip Level 9 (65%+ reduction) |
| Reconstruction Accuracy | 100% |
| License | Apache 2.0 |
Training Datasets
- Synaxarium Dataset: 366 religious texts
- Canon Biblical Dataset: 61,403 Amharic-English parallel texts
Total corpus: ~27.5 MB of high-quality Amharic biblical text.
🛠️ API Reference
EthioBBPETokenizer
Main tokenizer class with full encoding/decoding capabilities.
Methods
from_pretrained(model_name, cache_dir, force_download): Load pretrained modelfrom_file(file_path): Load from local fileencode(text, add_special_tokens, truncation, max_length): Encode single textencode_batch(texts, ...): Encode multiple textsdecode(ids, skip_special_tokens): Decode token IDsdecode_batch(batch_ids, ...): Decode multiple sequencesget_vocab(): Get vocabulary dictionaryget_vocab_size(): Get vocabulary sizesave(path): Save tokenizer to file
AutoTokenizer
Factory class for automatic tokenizer loading.
from ethiobbpe import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")
Encoding
Wrapper object for encoding results with properties:
ids: Token IDs (List[int])tokens: Token strings (List[str])attention_mask: Attention mask (List[int])type_ids: Token type IDs (List[int])offsets: Character offsets (List[tuple])special_tokens_mask: Special tokens mask (List[int])
📈 Performance
Compression Benefits
| Format | Size | Compression |
|---|---|---|
| tokenizer.json | 1.3 MB | Baseline |
| vocab.json.gz | 136 KB | 89.8% |
| tokenizer_quantized_8bit.json.gz | 56 KB | 95.7% |
Speed
- Single encoding: < 1ms
- Batch encoding (32 texts): < 10ms
- Model download: Automatic caching for subsequent uses
🔬 Technical Details
Architecture
- Algorithm: Byte Pair Encoding (BPE)
- Pre-tokenization: Whitespace splitting
- Special Tokens:
[PAD],[CLS],[SEP],[MASK],[UNK] - Vocabulary: 16,000 tokens optimized for Amharic/Ge'ez
Production Features
- Checkpointing: SHA256 integrity verification for fault-tolerant training
- Quantization: 8-bit and 4-bit options for deployment optimization
- Metrics Tracking: Comprehensive training statistics in JSON format
- Multiple Export Formats: tokenizer.json, vocab.json.gz, quantized versions
📝 Examples
Ge'ez Punctuation Handling
tokenizer = EthioBBPETokenizer.from_pretrained()
# Ancient Ge'ez punctuation marks
geez_text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(geez_text)
print(f"Ge'ez punctuation: {encoded.tokens}")
# Single token for repeated marks!
decoded = tokenizer.decode(encoded.ids)
assert decoded == geez_text # Perfect reconstruction
Biblical Text Processing
biblical_text = """
ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።
አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ።
ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ ።
"""
encoded = tokenizer.encode(biblical_text)
print(f"Token count: {len(encoded.tokens)}")
print(f"Perfect reconstruction: {tokenizer.decode(encoded.ids) == biblical_text}")
🤝 Integration
With Hugging Face Transformers
from transformers import AutoModel, AutoTokenizer
from ethiobbpe import EthioBBPETokenizer
# Use EthioBBPE as the tokenizer
ethio_tokenizer = EthioBBPETokenizer.from_pretrained()
# Integrate with your pipeline
def tokenize_for_model(text):
encoding = ethio_tokenizer.encode(text)
return {
"input_ids": encoding.ids,
"attention_mask": encoding.attention_mask
}
With PyTorch DataLoader
import torch
from torch.utils.data import Dataset
from ethiobbpe import EthioBBPETokenizer
class AmharicDataset(Dataset):
def __init__(self, texts, tokenizer, max_length=512):
self.texts = texts
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer.encode(
self.texts[idx],
truncation=True,
max_length=self.max_length
)
return {
"input_ids": torch.tensor(encoding.ids),
"attention_mask": torch.tensor(encoding.attention_mask)
}
📦 Installation Options
Basic Installation
pip install EthioBBPE
Development Installation
pip install EthioBBPE[dev]
From Source
git clone https://github.com/nexuss0781/Ethio_BBPE.git
cd Ethio_BBPE
pip install -e .
🧪 Testing
# Run tests
pytest tests/
# With coverage
pytest tests/ --cov=ethiobbpe
📚 Resources
🙏 Acknowledgments
- Training data from Synaxarium Dataset
- Training data from Canon Biblical Dataset
- Built with Hugging Face Tokenizers
📄 License
Apache License 2.0 - See LICENSE for details.
👤 Author
Nexuss0781
Email: nexuss0781@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ethiobbpe-2.0.0.tar.gz.
File metadata
- Download URL: ethiobbpe-2.0.0.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85efac41f4dcd07f6cc50a681da8eb76881b60ae249495df1f549aaa33ffa913
|
|
| MD5 |
e18eb2a11bd5510d9b0a9d0e8b9be472
|
|
| BLAKE2b-256 |
be960e648f201cc077655f83b222db050ac9fdd7656110c1385dab0686c0c387
|
File details
Details for the file ethiobbpe-2.0.0-py3-none-any.whl.
File metadata
- Download URL: ethiobbpe-2.0.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d0a9be9d4a3dc699d46e3f1dae1382bb8ba239cdeb274075c61f6c80b5bfc84
|
|
| MD5 |
ee0942e4bb2b8d517a8c855c4cc2df18
|
|
| BLAKE2b-256 |
3403d99c3c1fd9eb3aaa7b0f40cdc99ae2d27e4e217605953245f6c1beaa1c84
|