Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages (Amharic, Tigrinya, Ge'ez)

These details have not been verified by PyPI

Project links

Project description

EthioBBPE

Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages

EthioBBPE is a production-ready, high-performance tokenizer optimized for Amharic, Tigrinya, and Ge'ez scripts. Built with advanced features including checkpointing, multi-format compression, and model quantization, it delivers efficient text processing for Ethiopian languages.

✨ Features

🚀 Production-Ready

Automatic Model Download: Seamlessly downloads pretrained models from Hugging Face Hub on first use
Embedded Models: Includes pretrained weights in the package for offline usage
Zero Configuration: Works out of the box with sensible defaults

🔧 Advanced Capabilities

Checkpointing: SHA256 integrity verification for fault-tolerant training
Multi-Format Compression: Support for gzip, bz2, and lzma/xz (up to 90% size reduction)
Model Quantization: 8-bit and 4-bit quantization for efficient deployment
Batch Processing: Efficient encoding/decoding of text batches

📊 Performance

Vocabulary Size: 16,000 tokens optimized for Ethiopic scripts
Compression Ratio: ~90% size reduction with gzip level 9
Perfect Reconstruction: 100% accuracy on Amharic biblical texts
Fast Inference: Optimized for production workloads

📦 Installation

Basic Installation

pip install EthioBBPE

With Training Dependencies

pip install EthioBBPE[training]

Development Installation

pip install EthioBBPE[dev]

🎯 Quick Start

Simple Usage

from ethiobbpe import EthioBBPE

# Initialize tokenizer (auto-downloads model on first use)
tokenizer = EthioBBPE()

# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ"
encoded = tokenizer.encode(text)
print(encoded['ids'])      # Token IDs
print(encoded['tokens'])   # Token strings

# Decode back to text
decoded = tokenizer.decode(encoded['ids'])
print(decoded)  # Perfect reconstruction: "ሰላም ለኢዮብ ዘኢነበበ"

Batch Processing

texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
]

# Encode batch
encoded_batch = tokenizer.encode_batch(texts)

# Decode batch
decoded_batch = tokenizer.decode_batch([e['ids'] for e in encoded_batch])

Load from Hugging Face Hub

from ethiobbpe import EthioBBPE

# Load specific model from Hugging Face
tokenizer = EthioBBPE.from_pretrained("nexuss0781/Ethio-BBPE")

🔬 Advanced Usage

Custom Configuration

from ethiobbpe import Config, EthioBBPE

# Create custom configuration
config = Config(
    model_name="MyTokenizer",
    vocab_size=32000,
    compression_format="bz2",
    compression_level=9,
    enable_quantization=True,
    quantization_bits=8
)

# Use configuration
tokenizer = EthioBBPE(model_name=config.model_name)

Utility Functions

from ethiobbpe import (
    load_compressed_vocab,
    validate_checkpoint,
    list_checkpoints,
    get_model_info
)

# Load compressed vocabulary
vocab = load_compressed_vocab("vocab.json.gz")

# Validate checkpoint
is_valid = validate_checkpoint("checkpoint_ckpt_1.json")

# List available checkpoints
checkpoints = list_checkpoints("./checkpoints")

# Get model information
info = get_model_info("./models/EthioBBPE_AmharicBible")
print(f"Total size: {info['total_size_mb']} MB")

Training Your Own Tokenizer

from ethiobbpe.trainer import BBPETrainer

# Initialize trainer with advanced features
trainer = BBPETrainer(
    vocab_size=16000,
    min_frequency=2,
    compression_format="gzip",
    compression_level=9,
    enable_quantization=True,
    quantization_bits=8,
    max_checkpoints=5
)

# Train on your data
texts = ["Your Amharic text 1", "Your Amharic text 2", ...]
metrics = trainer.train(
    texts=texts,
    output_dir="./my_tokenizer",
    model_name="MyAmharicTokenizer"
)

print(f"Training completed in {metrics['training_duration_seconds']:.2f}s")
print(f"Final vocab size: {metrics['final_vocab_size']}")
print(f"Compression ratio: {metrics['compression_ratio']:.2%}")

📈 Model Details

Training Data

Synaxarium Dataset: 366 Ethiopian Orthodox Church texts
Canon Biblical Dataset: 61,403 Amharic-English parallel biblical texts
Total Corpus: 27.5 MB, 61,769 lines

Performance Metrics

Metric	Value
Vocabulary Size	16,000 tokens
Training Time	~17 seconds
Original Size	1.34 MB
Compressed Size	136 KB (gzip)
Compression Ratio	89.8%
Reconstruction Accuracy	100%

Supported Scripts

✅ Amharic (አማርኛ)
✅ Tigrinya (ትግርኛ)
✅ Ge'ez (ግዕዝ)
✅ Mixed-language texts

🛠️ API Reference

EthioBBPE Class

`init(model_name: str = "EthioBBPE_AmharicBible", model_dir: Optional[str] = None)`

Initialize the tokenizer with optional custom model name or directory.

`encode(text: str, add_special_tokens: bool = True, ...) -> Dict[str, Any]`

Encode text into token IDs, tokens, and offsets.

`decode(ids: List[int], skip_special_tokens: bool = True) -> str`

Decode token IDs back to text.

`encode_batch(texts: List[str], ...) -> List[Dict[str, Any]]`

Encode multiple texts efficiently.

`decode_batch(batch_ids: List[List[int]], ...) -> List[str]`

Decode multiple sequences efficiently.

`get_vocab_size() -> int`

Get the vocabulary size.

`get_vocab() -> Dict[str, int]`

Get the full vocabulary mapping.

`save(path: str) -> None`

Save the tokenizer to a file.

`from_pretrained(model_name: str = "nexuss0781/Ethio-BBPE") -> EthioBBPE`

Class method to load a pretrained tokenizer from Hugging Face Hub.

📁 Package Structure

ethiobbpe/
├── __init__.py          # Main exports
├── tokenizer.py         # Core EthioBBPE class
├── config.py            # Configuration management
├── utils.py             # Utility functions
├── trainer.py           # Advanced BBPE trainer
└── models/              # Pretrained models
    ├── tokenizer.json
    ├── vocab.json.gz
    ├── config.json
    └── training_metrics.json

🤝 Integration

Hugging Face Transformers

from transformers import AutoTokenizer

# Use with Hugging Face Transformers
tokenizer = AutoTokenizer.from_pretrained("nexuss0781/Ethio-BBPE")

LangChain

from langchain.text_splitter import CharacterTextSplitter

# Use in LangChain pipelines
text_splitter = CharacterTextSplitter(
    separator="",
    chunk_size=512,
    chunk_overlap=50
)

📄 License

Apache License 2.0 - See LICENSE for details.

🙏 Acknowledgments

Training data from Synaxarium Dataset
Biblical texts from Canon Biblical Dataset
Built with Hugging Face Tokenizers

📬 Contact

Author: Nexus Research
Email: nexuss0781@gmail.com
GitHub: nexuss0781/Ethio_BBPE
Hugging Face: nexuss0781/Ethio-BBPE

🗺️ Roadmap

Support for additional Ethiopian languages (Oromo, Somali, Sidama)
Pre-trained language models using EthioBBPE
WebAssembly build for browser-based inference
ONNX export for optimized deployment
Streaming tokenization for large documents

Made with ❤️ for Ethiopian NLP community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.0

May 5, 2026

This version

1.0.2 yanked

May 2, 2026

Reason this release was yanked:

Older version may have dependency mismatch with your project dependencies

1.0.1 yanked

May 2, 2026

Reason this release was yanked:

Need to be upgrade for optimizations and improvements

1.0.0 yanked

May 2, 2026

Reason this release was yanked:

This is older version that lack some improvements please upgrade to latest versions

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethiobbpe-1.0.2.tar.gz (11.4 kB view details)

Uploaded May 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ethiobbpe-1.0.2-py3-none-any.whl (13.5 kB view details)

Uploaded May 2, 2026 Python 3

File details

Details for the file ethiobbpe-1.0.2.tar.gz.

File metadata

Download URL: ethiobbpe-1.0.2.tar.gz
Upload date: May 2, 2026
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`9c2744ebeaae5fdce8ed40490fceb2b37b6977c3ca7e4e9d6252a3d5f6de61b9`
MD5	`ba10f0fea8fa88e574ceb10fdc3bd0a7`
BLAKE2b-256	`f4383408f4fe197ca718124439f2b05034db6e500a65ca71753d01069cb6785f`

See more details on using hashes here.

File details

Details for the file ethiobbpe-1.0.2-py3-none-any.whl.

File metadata

Download URL: ethiobbpe-1.0.2-py3-none-any.whl
Upload date: May 2, 2026
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5ba0b02162b2e2c45d5cd4ff9bb4f3e15d757ce8036321e1a523be64bb26267f`
MD5	`6dcb0c295db7694c287f1c88b09cc5d8`
BLAKE2b-256	`f05e66b88b4d0a32f5a55d4edc9095e63fadebd8a7732a8569016baa7b2f3f7d`

See more details on using hashes here.

EthioBBPE 1.0.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

EthioBBPE

✨ Features

🚀 Production-Ready

🔧 Advanced Capabilities

📊 Performance

📦 Installation

Basic Installation

With Training Dependencies

Development Installation

🎯 Quick Start

Simple Usage

Batch Processing

Load from Hugging Face Hub

🔬 Advanced Usage

Custom Configuration

Utility Functions

Training Your Own Tokenizer

📈 Model Details

Training Data

Performance Metrics

Supported Scripts

🛠️ API Reference

EthioBBPE Class

__init__(model_name: str = "EthioBBPE_AmharicBible", model_dir: Optional[str] = None)

encode(text: str, add_special_tokens: bool = True, ...) -> Dict[str, Any]

decode(ids: List[int], skip_special_tokens: bool = True) -> str

encode_batch(texts: List[str], ...) -> List[Dict[str, Any]]

decode_batch(batch_ids: List[List[int]], ...) -> List[str]

get_vocab_size() -> int

get_vocab() -> Dict[str, int]

save(path: str) -> None

from_pretrained(model_name: str = "nexuss0781/Ethio-BBPE") -> EthioBBPE

📁 Package Structure

🤝 Integration

Hugging Face Transformers

LangChain

📄 License

🙏 Acknowledgments

📬 Contact

🗺️ Roadmap

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`init(model_name: str = "EthioBBPE_AmharicBible", model_dir: Optional[str] = None)`

`encode(text: str, add_special_tokens: bool = True, ...) -> Dict[str, Any]`

`decode(ids: List[int], skip_special_tokens: bool = True) -> str`

`encode_batch(texts: List[str], ...) -> List[Dict[str, Any]]`

`decode_batch(batch_ids: List[List[int]], ...) -> List[str]`

`get_vocab_size() -> int`

`get_vocab() -> Dict[str, int]`

`save(path: str) -> None`

`from_pretrained(model_name: str = "nexuss0781/Ethio-BBPE") -> EthioBBPE`