Enterprise-grade Amharic & Ge'ez Biblical Text Tokenizer with Byte Pair Encoding

These details have not been verified by PyPI

Project links

Project description

EthioBBPE

PyPI - Version PyPI - Python Version PyPI - License

📖 Overview

EthioBBPE is a professional, production-ready Byte Pair Encoding (BPE) tokenizer specifically designed for Amharic, Ge'ez, and biblical texts. It provides state-of-the-art tokenization for Ethiopian languages with support for ancient scripts and special characters.

✨ Key Features

🚀 One-line Installation: pip install EthioBBPE
🤖 Auto-download: Models automatically downloaded from Hugging Face Hub
📦 Compressed Storage: Gzip compression for efficient storage (65%+ size reduction)
⚡ High Performance: Optimized for speed with batch processing
🎯 Perfect Reconstruction: 100% accuracy on Amharic biblical texts
🔧 Professional API: Clean, intuitive interface inspired by Hugging Face
📊 Production Ready: Checkpointing, quantization, and comprehensive metrics

🚀 Quick Start

Installation

pip install EthioBBPE

Basic Usage

from ethiobbpe import EthioBBPETokenizer

# Load pretrained tokenizer (auto-downloads from Hugging Face)
tokenizer = EthioBBPETokenizer.from_pretrained()

# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
encoded = tokenizer.encode(text)

print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")

# Decode back to text
decoded = tokenizer.decode(encoded.ids)
print(f"Decoded: {decoded}")
# Output: ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። (100% accurate!)

Advanced Usage

from ethiobbpe import AutoTokenizer

# Alternative: Use AutoTokenizer factory
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")

# Batch encoding
texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ወደ ቍስጥንጥንያ አገርም በደረሰች ጊዜ",
    "ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ"
]

encodings = tokenizer.encode_batch(texts)
for i, enc in enumerate(encodings):
    print(f"Text {i}: {len(enc.tokens)} tokens")

# Callable interface
result = tokenizer("ሰላም ዓለም")
print(result.tokens)

# Get vocabulary
vocab = tokenizer.get_vocab()
print(f"Vocabulary size: {tokenizer.get_vocab_size()}")

📊 Model Details

Property	Value
Model Name	EthioBBPE_AmharicBible
Vocabulary Size	16,000 tokens
Training Data	61,769 lines (Synaxarium + Canon Biblical)
Compression	Gzip Level 9 (65%+ reduction)
Reconstruction Accuracy	100%
License	Apache 2.0

Training Datasets

Synaxarium Dataset: 366 religious texts
Canon Biblical Dataset: 61,403 Amharic-English parallel texts

Total corpus: ~27.5 MB of high-quality Amharic biblical text.

🛠️ API Reference

`EthioBBPETokenizer`

Main tokenizer class with full encoding/decoding capabilities.

Methods

from_pretrained(model_name, cache_dir, force_download): Load pretrained model
from_file(file_path): Load from local file
encode(text, add_special_tokens, truncation, max_length): Encode single text
encode_batch(texts, ...): Encode multiple texts
decode(ids, skip_special_tokens): Decode token IDs
decode_batch(batch_ids, ...): Decode multiple sequences
get_vocab(): Get vocabulary dictionary
get_vocab_size(): Get vocabulary size
save(path): Save tokenizer to file

`AutoTokenizer`

Factory class for automatic tokenizer loading.

from ethiobbpe import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Nexuss0781/Ethio-BBPE")

`Encoding`

Wrapper object for encoding results with properties:

ids: Token IDs (List[int])
tokens: Token strings (List[str])
attention_mask: Attention mask (List[int])
type_ids: Token type IDs (List[int])
offsets: Character offsets (List[tuple])
special_tokens_mask: Special tokens mask (List[int])

📈 Performance

Compression Benefits

Format	Size	Compression
tokenizer.json	1.3 MB	Baseline
vocab.json.gz	136 KB	89.8%
tokenizer_quantized_8bit.json.gz	56 KB	95.7%

Speed

Single encoding: < 1ms
Batch encoding (32 texts): < 10ms
Model download: Automatic caching for subsequent uses

🔬 Technical Details

Architecture

Algorithm: Byte Pair Encoding (BPE)
Pre-tokenization: Whitespace splitting
Special Tokens: [PAD], [CLS], [SEP], [MASK], [UNK]
Vocabulary: 16,000 tokens optimized for Amharic/Ge'ez

Production Features

Checkpointing: SHA256 integrity verification for fault-tolerant training
Quantization: 8-bit and 4-bit options for deployment optimization
Metrics Tracking: Comprehensive training statistics in JSON format
Multiple Export Formats: tokenizer.json, vocab.json.gz, quantized versions

📝 Examples

Ge'ez Punctuation Handling

tokenizer = EthioBBPETokenizer.from_pretrained()

# Ancient Ge'ez punctuation marks
geez_text = "፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠፠"
encoded = tokenizer.encode(geez_text)
print(f"Ge'ez punctuation: {encoded.tokens}")
# Single token for repeated marks!

decoded = tokenizer.decode(encoded.ids)
assert decoded == geez_text  # Perfect reconstruction

Biblical Text Processing

biblical_text = """
ሰላም ለኢዮብ ዘኢነበበ ከንቶ ። 
አመ አኀዞ አበቅ ወአመ አህጎለ ጥሪቶ ። 
ሐዋርያ መንፈስ ይቤ እንዘ ያነክር ሕይወቶ ።
"""

encoded = tokenizer.encode(biblical_text)
print(f"Token count: {len(encoded.tokens)}")
print(f"Perfect reconstruction: {tokenizer.decode(encoded.ids) == biblical_text}")

🤝 Integration

With Hugging Face Transformers

from transformers import AutoModel, AutoTokenizer
from ethiobbpe import EthioBBPETokenizer

# Use EthioBBPE as the tokenizer
ethio_tokenizer = EthioBBPETokenizer.from_pretrained()

# Integrate with your pipeline
def tokenize_for_model(text):
    encoding = ethio_tokenizer.encode(text)
    return {
        "input_ids": encoding.ids,
        "attention_mask": encoding.attention_mask
    }

With PyTorch DataLoader

import torch
from torch.utils.data import Dataset
from ethiobbpe import EthioBBPETokenizer

class AmharicDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer.encode(
            self.texts[idx],
            truncation=True,
            max_length=self.max_length
        )
        return {
            "input_ids": torch.tensor(encoding.ids),
            "attention_mask": torch.tensor(encoding.attention_mask)
        }

📦 Installation Options

Basic Installation

pip install EthioBBPE

Development Installation

pip install EthioBBPE[dev]

From Source

git clone https://github.com/nexuss0781/Ethio_BBPE.git
cd Ethio_BBPE
pip install -e .

🧪 Testing

# Run tests
pytest tests/

# With coverage
pytest tests/ --cov=ethiobbpe

📚 Resources

🙏 Acknowledgments

Training data from Synaxarium Dataset
Training data from Canon Biblical Dataset
Built with Hugging Face Tokenizers

📄 License

Apache License 2.0 - See LICENSE for details.

👤 Author

Nexuss0781
Email: nexuss0781@gmail.com

Made with ❤️ for Ethiopian Language NLP

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

May 5, 2026

1.0.2 yanked

May 2, 2026

Reason this release was yanked:

Older version may have dependency mismatch with your project dependencies

1.0.1 yanked

May 2, 2026

Reason this release was yanked:

Need to be upgrade for optimizations and improvements

1.0.0 yanked

May 2, 2026

Reason this release was yanked:

This is older version that lack some improvements please upgrade to latest versions

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ethiobbpe-2.0.0.tar.gz (14.9 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ethiobbpe-2.0.0-py3-none-any.whl (10.3 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file ethiobbpe-2.0.0.tar.gz.

File metadata

Download URL: ethiobbpe-2.0.0.tar.gz
Upload date: May 5, 2026
Size: 14.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`85efac41f4dcd07f6cc50a681da8eb76881b60ae249495df1f549aaa33ffa913`
MD5	`e18eb2a11bd5510d9b0a9d0e8b9be472`
BLAKE2b-256	`be960e648f201cc077655f83b222db050ac9fdd7656110c1385dab0686c0c387`

See more details on using hashes here.

File details

Details for the file ethiobbpe-2.0.0-py3-none-any.whl.

File metadata

Download URL: ethiobbpe-2.0.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 10.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for ethiobbpe-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2d0a9be9d4a3dc699d46e3f1dae1382bb8ba239cdeb274075c61f6c80b5bfc84`
MD5	`ee0942e4bb2b8d517a8c855c4cc2df18`
BLAKE2b-256	`3403d99c3c1fd9eb3aaa7b0f40cdc99ae2d27e4e217605953245f6c1beaa1c84`

See more details on using hashes here.

EthioBBPE 2.0.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

EthioBBPE

📖 Overview

✨ Key Features

🚀 Quick Start

Installation

Basic Usage

Advanced Usage

📊 Model Details

Training Datasets

🛠️ API Reference

EthioBBPETokenizer

Methods

AutoTokenizer

Encoding

📈 Performance

Compression Benefits

Speed

🔬 Technical Details

Architecture

Production Features

📝 Examples

Ge'ez Punctuation Handling

Biblical Text Processing

🤝 Integration

With Hugging Face Transformers

With PyTorch DataLoader

📦 Installation Options

Basic Installation

Development Installation

From Source

🧪 Testing

📚 Resources

🙏 Acknowledgments

📄 License

👤 Author

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`EthioBBPETokenizer`

`AutoTokenizer`

`Encoding`