Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages (Amharic, Tigrinya, Ge'ez)
Reason this release was yanked:
Older version may have dependency mismatch with your project dependencies
Project description
EthioBBPE
Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages
EthioBBPE is a production-ready, high-performance tokenizer optimized for Amharic, Tigrinya, and Ge'ez scripts. Built with advanced features including checkpointing, multi-format compression, and model quantization, it delivers efficient text processing for Ethiopian languages.
✨ Features
🚀 Production-Ready
- Automatic Model Download: Seamlessly downloads pretrained models from Hugging Face Hub on first use
- Embedded Models: Includes pretrained weights in the package for offline usage
- Zero Configuration: Works out of the box with sensible defaults
🔧 Advanced Capabilities
- Checkpointing: SHA256 integrity verification for fault-tolerant training
- Multi-Format Compression: Support for gzip, bz2, and lzma/xz (up to 90% size reduction)
- Model Quantization: 8-bit and 4-bit quantization for efficient deployment
- Batch Processing: Efficient encoding/decoding of text batches
📊 Performance
- Vocabulary Size: 16,000 tokens optimized for Ethiopic scripts
- Compression Ratio: ~90% size reduction with gzip level 9
- Perfect Reconstruction: 100% accuracy on Amharic biblical texts
- Fast Inference: Optimized for production workloads
📦 Installation
Basic Installation
pip install EthioBBPE
With Training Dependencies
pip install EthioBBPE[training]
Development Installation
pip install EthioBBPE[dev]
🎯 Quick Start
Simple Usage
from ethiobbpe import EthioBBPE
# Initialize tokenizer (auto-downloads model on first use)
tokenizer = EthioBBPE()
# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ"
encoded = tokenizer.encode(text)
print(encoded['ids']) # Token IDs
print(encoded['tokens']) # Token strings
# Decode back to text
decoded = tokenizer.decode(encoded['ids'])
print(decoded) # Perfect reconstruction: "ሰላም ለኢዮብ ዘኢነበበ"
Batch Processing
texts = [
"በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
"ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
]
# Encode batch
encoded_batch = tokenizer.encode_batch(texts)
# Decode batch
decoded_batch = tokenizer.decode_batch([e['ids'] for e in encoded_batch])
Load from Hugging Face Hub
from ethiobbpe import EthioBBPE
# Load specific model from Hugging Face
tokenizer = EthioBBPE.from_pretrained("nexuss0781/Ethio-BBPE")
🔬 Advanced Usage
Custom Configuration
from ethiobbpe import Config, EthioBBPE
# Create custom configuration
config = Config(
model_name="MyTokenizer",
vocab_size=32000,
compression_format="bz2",
compression_level=9,
enable_quantization=True,
quantization_bits=8
)
# Use configuration
tokenizer = EthioBBPE(model_name=config.model_name)
Utility Functions
from ethiobbpe import (
load_compressed_vocab,
validate_checkpoint,
list_checkpoints,
get_model_info
)
# Load compressed vocabulary
vocab = load_compressed_vocab("vocab.json.gz")
# Validate checkpoint
is_valid = validate_checkpoint("checkpoint_ckpt_1.json")
# List available checkpoints
checkpoints = list_checkpoints("./checkpoints")
# Get model information
info = get_model_info("./models/EthioBBPE_AmharicBible")
print(f"Total size: {info['total_size_mb']} MB")
Training Your Own Tokenizer
from ethiobbpe.trainer import BBPETrainer
# Initialize trainer with advanced features
trainer = BBPETrainer(
vocab_size=16000,
min_frequency=2,
compression_format="gzip",
compression_level=9,
enable_quantization=True,
quantization_bits=8,
max_checkpoints=5
)
# Train on your data
texts = ["Your Amharic text 1", "Your Amharic text 2", ...]
metrics = trainer.train(
texts=texts,
output_dir="./my_tokenizer",
model_name="MyAmharicTokenizer"
)
print(f"Training completed in {metrics['training_duration_seconds']:.2f}s")
print(f"Final vocab size: {metrics['final_vocab_size']}")
print(f"Compression ratio: {metrics['compression_ratio']:.2%}")
📈 Model Details
Training Data
- Synaxarium Dataset: 366 Ethiopian Orthodox Church texts
- Canon Biblical Dataset: 61,403 Amharic-English parallel biblical texts
- Total Corpus: 27.5 MB, 61,769 lines
Performance Metrics
| Metric | Value |
|---|---|
| Vocabulary Size | 16,000 tokens |
| Training Time | ~17 seconds |
| Original Size | 1.34 MB |
| Compressed Size | 136 KB (gzip) |
| Compression Ratio | 89.8% |
| Reconstruction Accuracy | 100% |
Supported Scripts
- ✅ Amharic (አማርኛ)
- ✅ Tigrinya (ትግርኛ)
- ✅ Ge'ez (ግዕዝ)
- ✅ Mixed-language texts
🛠️ API Reference
EthioBBPE Class
__init__(model_name: str = "EthioBBPE_AmharicBible", model_dir: Optional[str] = None)
Initialize the tokenizer with optional custom model name or directory.
encode(text: str, add_special_tokens: bool = True, ...) -> Dict[str, Any]
Encode text into token IDs, tokens, and offsets.
decode(ids: List[int], skip_special_tokens: bool = True) -> str
Decode token IDs back to text.
encode_batch(texts: List[str], ...) -> List[Dict[str, Any]]
Encode multiple texts efficiently.
decode_batch(batch_ids: List[List[int]], ...) -> List[str]
Decode multiple sequences efficiently.
get_vocab_size() -> int
Get the vocabulary size.
get_vocab() -> Dict[str, int]
Get the full vocabulary mapping.
save(path: str) -> None
Save the tokenizer to a file.
from_pretrained(model_name: str = "nexuss0781/Ethio-BBPE") -> EthioBBPE
Class method to load a pretrained tokenizer from Hugging Face Hub.
📁 Package Structure
ethiobbpe/
├── __init__.py # Main exports
├── tokenizer.py # Core EthioBBPE class
├── config.py # Configuration management
├── utils.py # Utility functions
├── trainer.py # Advanced BBPE trainer
└── models/ # Pretrained models
├── tokenizer.json
├── vocab.json.gz
├── config.json
└── training_metrics.json
🤝 Integration
Hugging Face Transformers
from transformers import AutoTokenizer
# Use with Hugging Face Transformers
tokenizer = AutoTokenizer.from_pretrained("nexuss0781/Ethio-BBPE")
LangChain
from langchain.text_splitter import CharacterTextSplitter
# Use in LangChain pipelines
text_splitter = CharacterTextSplitter(
separator="",
chunk_size=512,
chunk_overlap=50
)
📄 License
Apache License 2.0 - See LICENSE for details.
🙏 Acknowledgments
- Training data from Synaxarium Dataset
- Biblical texts from Canon Biblical Dataset
- Built with Hugging Face Tokenizers
📬 Contact
- Author: Nexus Research
- Email: nexuss0781@gmail.com
- GitHub: nexuss0781/Ethio_BBPE
- Hugging Face: nexuss0781/Ethio-BBPE
🗺️ Roadmap
- Support for additional Ethiopian languages (Oromo, Somali, Sidama)
- Pre-trained language models using EthioBBPE
- WebAssembly build for browser-based inference
- ONNX export for optimized deployment
- Streaming tokenization for large documents
Made with ❤️ for Ethiopian NLP community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ethiobbpe-1.0.2.tar.gz.
File metadata
- Download URL: ethiobbpe-1.0.2.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c2744ebeaae5fdce8ed40490fceb2b37b6977c3ca7e4e9d6252a3d5f6de61b9
|
|
| MD5 |
ba10f0fea8fa88e574ceb10fdc3bd0a7
|
|
| BLAKE2b-256 |
f4383408f4fe197ca718124439f2b05034db6e500a65ca71753d01069cb6785f
|
File details
Details for the file ethiobbpe-1.0.2-py3-none-any.whl.
File metadata
- Download URL: ethiobbpe-1.0.2-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ba0b02162b2e2c45d5cd4ff9bb4f3e15d757ce8036321e1a523be64bb26267f
|
|
| MD5 |
6dcb0c295db7694c287f1c88b09cc5d8
|
|
| BLAKE2b-256 |
f05e66b88b4d0a32f5a55d4edc9095e63fadebd8a7732a8569016baa7b2f3f7d
|